I get a version of this email from a founder every week now: "We're picking an agentic AI development company and there are 40 on our list. They all say the same thing on their homepage. How do we actually choose?"
The honest answer: about 5 of those 40 are real. The other 35 are outsourcing firms that updated their website last summer. The hard part isn't finding an agentic AI development company — it's filtering for one that actually runs agents internally, not just sells the label.
I run Shape. We ship our own AI products (Wondercut, ProductAI, MomentClip) agent-first, and we build for funded founders and corporate ventures the same way. This piece is the screening framework I'd use if I were on the buyer side of the table — five tests that separate a real agentic AI development company from a rebrand, plus exactly how to run them. Use it on us, use it on everyone else.
Five tests that separate real from rebranded
I'll give you the tests first, then explain why each one matters. If a company can't pass at least four, walk away.
| Test | What to ask for | Pass signal | Fail signal |
|---|---|---|---|
| 1. Real PRs | "Show me a PR from your own product, opened by an agent, with the verifier output visible." | A real GitHub PR, with agent plan in description and CI checks visible | "We'll send a demo" or a blog post |
| 2. Eval-to-feature ratio | "What's your eval-to-feature-code ratio on your last AI product?" | A specific number (typically 0.5:1 to 2:1) | "We have some tests" or "what do you mean by evals?" |
| 3. Typed vs shipped | "How many lines did your senior engineers personally type last week vs. how many shipped?" | Specific lopsided ratio (we run ~20% typed / 80% agent-shipped) | Suspiciously round percentages or topic change |
| 4. Week-one output | "What did you ship in week one of your last client engagement?" | A deployed working spike with one core flow | A Notion doc, a Figma file, or a kickoff meeting |
| 5. Team transparency | "Send GitHub handles + seniority of everyone who'll touch my repo, before signing." | Public profiles, all senior, with real prior work | Account manager fronting unnamed offshore staff |
Now, the why behind each.
Test 1: Show me a PR from your own product, opened by an agent
This is the single fastest filter. A real agentic AI development company runs agent-first on its own codebase every day. They can pull up GitHub, scroll their main repo, and show you a real PR opened by an agent with the agent's plan in the description and the verifier output in the checks. Shape's repos look like this. So do the repos of every other team that's actually doing the work.
A rebrand will say "we can show you a demo." A rebrand will share a blog post. A rebrand will offer a 45-minute "discovery call." None of that is a PR. Insist on the PR.
Test 2: What's your eval-to-feature-code ratio on your last AI product?
Agentic delivery without evals is autocomplete with extra steps. A real agentic AI development company has a number for this, usually somewhere between 0.5:1 and 2:1 (evals to features) on production AI features. Our ProductAI codebase runs closer to 2:1 because image generation is full of edge cases. Wondercut runs about 1:1.
If the answer is "we have some tests" — that's a rebrand. If the answer is "what do you mean by evals" — that's a rebrand wearing makeup.
Test 3: How many lines did your senior engineers personally type last week vs. how many shipped?
Honest answer at Shape: senior engineers personally type roughly 20% of what ships. The other 80% is agent-generated, human-reviewed, agent-tested, human-architected. That ratio means we ship four to six times what a traditional senior engineer ships in a week, and the verification is automated.
A rebrand will dodge the question or quote a percentage that's suspiciously round (50/50 is the most common lie). A real agentic AI development company has a number and a Git history that backs it.
Test 4: What did you ship in week one of your last client engagement?
Real agentic teams ship a working spike in week one. Working means deployed, accessible to the client, with one core flow functional. Not a Notion doc. Not a Figma file. A deployed thing the client can poke.
How? Because the meta-decision (we'll run agent-first) was made years ago, the tooling is already in place, and the team's muscle memory is "ship something runnable, then refine." Compare to a traditional dev shop where week one is discovery, week two is more discovery, and the first deploy happens around week four.
Test 5: Who's actually on my project — names, GitHub handles, seniority?
This is the test most founders forget. A real agentic AI development company has a small senior team (two to four engineers per project, all senior). A rebrand has a "delivery manager" who fronts a rotating cast of mid-level offshore staff. You can verify by asking for the GitHub handles of everyone who will touch your repo — and looking them up.
At Shape, you get names, GitHub profiles, and prior work links before the contract is signed. Anyone hiding their team is hiding their team for a reason.
The Shape stack — what agent-first delivery actually runs on
I covered this in detail in how my team actually ships code in 2026, but here's the version a buyer needs to see. This is what we use today (May 2026). It changes quarterly.
| Layer | What we use at Shape (May 2026) | Why |
|---|---|---|
| Primary agent | Claude Code (terminal) | Long-horizon tasks, multi-file edits, headless automation |
| Inline editor | Cursor | Tight inline edits, diffs visible per change |
| Eval framework | Custom Python harness over JSON test banks | Heavier tools (Braintrust etc.) are overkill at our scale |
| Orchestration | n8n + cron + Lambda | n8n for human-in-the-loop, cron/Lambda for pure background |
| Context & memory | Per-repo CLAUDE.md, AGENTS.md, /skills directory | Versioned in git like any other code |
| Code review | Agent for nits, human for architecture | Senior engineers leveraged on architectural decisions, not commas |
| Production gate | Tests + lint + eval pass + human approval | Agents never push to prod unsupervised. Ever. |
The stack matters less than the discipline around it. Every workflow we run on a client project, we already run on Shape's own products. There is no client work that's different from how we ship Wondercut. That's the unfair advantage and the only one that actually matters.
Team composition — what an engagement looks like from the inside
An engagement runs with two to four people. All senior. All based in Berlin, New York, or Ljubljana. We don't subcontract, we don't staff-aug, and we don't use junior offshore developers under a senior's name. The agents are the leverage. The humans are senior.
A typical project breakdown:
- Lead engineer (senior). Owns the spec, the eval design, and the architecture. Reviews every PR.
- One or two builder engineers (senior). Drive the agent loops day-to-day. Each runs three to five parallel agents during the workday.
- Designer (senior, on AI products only). Owns the UX patterns for non-deterministic outputs — the thing most teams underestimate when shipping AI.
- Marko (founder). Available for any client engagement, weekly. I read every PR for every project. That's not posturing — it's how I keep the muscle.
Compare that to a traditional AI development agency: one senior architect, three to five mid-level offshore developers, a project manager, and an account manager. Same headcount, half the leverage, double the friction.
Pricing — what it actually costs
Most agency sites refuse to show pricing. I find that disrespectful of the buyer's time. Here's ours, openly:
- Fixed-Scope MVP: $48K for a six-week build. One core flow, deployed, evaluated, instrumented, handed off. This is the most common engagement.
- Dedicated Pod: $35–60K per month, three to twelve months. Two to four senior engineers + a designer if it's an AI product. Best for post-MVP teams scaling, or corp ventures building in parallel to an internal team.
- Build-for-Equity: Case-by-case. Reserved for founders we know in spaces Shape wants a position. Don't pitch on the first call.
The cheapest AI development agency you'll find quotes $25–30K for a six-week MVP. They'll deliver a wrapper around an OpenAI call with no eval suite and a 9–14 week actual timeline. The total cost ends up higher and the asset you get is worse. The math on this is brutal and I've watched founders learn it the hard way more than once.
Why three cities matter
Berlin, New York, Ljubljana. The three locations exist because of how we got here, but they also happen to work as a global delivery footprint:
- Berlin + Ljubljana covers full EU business hours and overlaps US East Coast in the morning.
- New York covers US business hours fully and overlaps EU in the afternoon.
- For an EU client, there's always someone live during their work day.
- For a US client, the same.
- For a corp venture spread across geographies, we're already operating that way internally.
No follow-the-sun handoffs, no "we'll respond tomorrow" because the team is asleep on another continent. The trade-off is we're not the cheapest — but we're also not delivering at the cost of the founder's sleep.
Proof — read this section if you've read nothing else
I'm wary of agency pages that brag without showing the work. Here's the work:
- ProductAI: AI product photography for e-commerce. Built inside Shape, 4-week initial spike, shipped to 1,000+ paying users in 8 months. Agent-first start to finish. Eval suite is ~2x the feature code.
- Wondercut: AI video editing for short-form creators. 10 weeks to ship what a traditional agency had scoped as a 9-month build.
- MomentClip: Highlights tool for creators. Smaller surface, same delivery model.
- MapleNorth: Editorial product built end-to-end agent-first. Public, indexable, hits page-one Google for its target queries.
- GitHub: github.com/ShapeStudio — public org. Some repos open, some private, but the cadence and commit patterns are visible.
- Client engagements: Several funded Seed/Series A teams and one corporate venture team in the last six months. Most are under NDA. I'll walk you through them on a call.
If you want to see the patterns these builds run on, the deeper read is how my team actually ships code in 2026. If you want to see the conversion-side framing, read what agentic AI development services actually means. And if you want to see how MVP-class engagements work, how to ship a real AI product in six weeks is the piece.
FAQ
How do I know an agentic AI development company is real and not a rebrand?
Run the five tests above. Ask for a PR opened by an agent on their own product. If they can't show you, they're a rebrand. There is no other test that filters this fast.
What's the difference between an agentic AI development company and an AI development agency?
Most "AI development agencies" are traditional dev shops with a fresh page. An agentic AI development company organizes its daily work around agents executing under human supervision. The first is positioning. The second is operations.
Is Shape based in the US?
We have a New York presence. The core team is in Berlin and Ljubljana. We work US and EU hours and have shipped for clients in both regions.
Can I see code from a client engagement?
Under NDA, yes, on a call. We can show structure, eval design, and agent-PR patterns from a representative project. We don't share client code publicly, but we can show you our own.
How long does a typical engagement run?
Fixed-Scope MVP: six weeks. Dedicated pod: three to twelve months. Build-for-equity: as long as it takes, typically nine to eighteen months.
What's the worst-fit engagement for Shape?
A static marketing site, a CRUD app with no AI, or a project where the founder wants to review every line of code by hand. We're not a great fit for those. There are excellent traditional shops who are.
What I'd do if I were you
Make a shortlist of three. One of them should be us. Pick the others by running the five tests on the rest of your list — most will fail Test 1 in the first email exchange. That's the point.
Then run a 30-minute call with each of the three. Same questions, same scope description, same time window. The shop that gives the most concrete, specific answers — including admitting where they're not a fit — is the one that's been doing the work. The shop that gives the smoothest pitch is the one that's been doing the pitching.
When you're ready to run that call with us, book a 30-minute slot directly on my calendar. No deck. We'll talk about your product, what's hard about it, and whether we're the right team to ship it. If we're not, I'll tell you who is.
Written by Marko Balažic, founder of Shape — an AI venture studio shipping AI-powered products agent-first. If you're evaluating an agentic AI development company, reach out and I'll help you think through it, even if we're not the right fit.




%202.png)




