Evals Startups Are Not Enterprise Ready

They want to be the next "Datadog" or "Snowflake", but can they fool everyone at the same time?

Jun 08, 2025

This past week, I was at the AI Engineer conference in SF to get a pulse on the AI propaganda machine. And as expected, the evals hype was in full force—paid keynote plugs, booths, and workshops all pushing the idea that “you can’t build serious AI agents without buying an evals platform”.

So is that claim really true? Or, are evals SaaS just a glorified QA testing suite trying to convince everyone that they are the next Datadogs? (If you are confused about what an “eval” means, here’s a quick analogy: if an eval is a single “test” or “exam” for an AI app/agent, then “evals platform” is like a grading software. See the appendix for more).

Since it’s clear that observability, regression testing, and experimentation are real needs in agent development, I won’t waste time debating whether “evals” SaaS have utility. Evals SaaS solutions are valuable, especially for small, non-technical teams and vibe-coders.

But small teams is not where the money is in infra. Rather, I am asking the“build vs buy vs use open-source” question for enterprises. Should enterprises FOMO into buying proprietary evals solutions, right now?

Plus, is “evals” even a standalone software category—or just a bag of features that belong inside existing solutions for observability and experimentation, etc?

In my view, enterprises should not rush into purchasing proprietary evals solutions, before fully operationalizing “how to do evals” with free open source solutions (e.g. Langfuse) that support open telemetry formats (e.g. OpenTelemetry).

Or, build their own agent observability solution that fits their business’s needs, forking open source solutions. This allows enterprises to both move fast and avoid lock-in risks.

Here’s why.

Commoditization is inevitable: it’s a buyer’s market for LLMOps solutions

Here’s the brutal truth: they all look the same and have feature parity. Which means, this category of software will be a buyer’s market.

The market sentiment is that the “evals” space feels both premature and crowded. And this sentiment was shared across 100+ conversations I had with enterprise tech leaders, developers, and VCs at the conference.

Thanks to vibe coding, most vendors can ship similar features faster than ever. That means the technical moat is shallow—and the real defensibility should come from data network effects or proprietary benchmarks. So far, no vendor has shown that.

The most likely outcome of this space is consolidation —either via absorption into broader observability platforms (like Datadog or Grafana), or as open-source foundations standardize telemetry semantics across the stack.

If you’re buying today, don’t expect long-term advantage. Expect a bake-off on price, not capability.

Evals platforms attack the trivial layer

In short, these “evals” startups solve only the trivial parts of AI agent reliability, and yet want to charge enterprise prices.

The actual difficulty in building robust evals isn’t UI polish. It’s upstream: constructing representative test suites, sourcing gold-standard data, defining meaningful pass/fail logic, and tuning scoring functions that reflect business goals.

In other words: dashboards and data collection are the table-stakes part. What’s hard is deciding what to evaluate, why, and how. You can’t outsource these hard parts to evals platforms.

But these platforms punt the hard science to the customer, while trying to charge enterprise prices. That’s where it doesn’t sit well with everyone.

This is why the highest leverage move for enterprises isn’t to “buy into” evals—it’s to first internalize what a good eval looks like for your domain, with open source tools.

Then decide if a vendor saves you real time beyond what an open source backend + simple dashboards could already do.

Lock-in risk is disguised as ‘workflow.’

Most of these 3p “evals” solutions are essentially trying to create a workflow lock-in by training your employees to do evals, “their way”.

Basically, get into organizations, get your dev teams and PMs to chuck all their evals data into them, and expand from there.

And the more your team internalizes their way of doing evals, the more expensive it gets to switch later.

That’s obviously a bad business decision, since AI agents will run a critical % of your business in the future.

Reliability isn’t just a feature—it’s a core engineering competency. Outsourcing a critical component like observability to an outside vendor creates failure modes you don’t control.

Category confusion ≠ product vision

Some startups in this space are still figuring out what business they’re in.

One minute they’re an “evals platform.” The next, they’re bundling prompt management, A/B testing, tracing, fine-tuning orchestration, and RAG debugging.

Basically, they are just bundling everything.

This is not necessarily a sign of ambition. It’s often a sign that the company is still pre-product/market fit. You don’t see Amplitude bundling prompt playgrounds and vector database hosting.

So when a vendor claims to be “end-to-end,” ask: does this tool actually solve a problem I have today—or is it hedging by solving every possible problem in hopes one sticks?

When a tool claims to be everything, assume it’s still searching.

Buying today = paying to be a design partner.

Most of these evals vendors didn’t even exist 18 months ago. A good chunk pivoted from “prompt playground” tools.

Their enterprise pitch is less about delivering mature software and more about recruiting early customers to teach them what to build.

That’s not inherently bad. If you’re a startup yourself and want to influence the roadmap, go for it. But for large enterprises?

You’re not just buying software. You’re underwriting someone’s product-market fit journey, and teach them how to rebuild Datadog.

Open source gives you 80 % of the value, 0 % of the handcuffs.

Langfuse, Phoenix, and other open tools already give you enough to get started—good tracing support, metrics visualization, model comparison tooling, etc.

If your team is serious about shipping reliable agents, you’ll end up needing internal eval infrastructure anyway—custom datasets, domain-specific pass/fail logic, integrated CI/CD hooks.

Proprietary evals platforms don’t replace that work.

The right move is to build the muscle in-house, then revisit vendors in a year. By then, either the category will have consolidated—or the winner will have earned your trust.

Verdict

The bottom line is this:

evals are important, yes. Not debating that.
the market feels still very in the formative stage, and yet very crowded. Existing solutions look really similar to each other.
use cases are also changing fast, so you are not really paying for current features, but subsidizing for the future ones
it pays to get started with open source evals solutions, while having an eye out for any winners to emerge from the closed source space.

Appendix: Quick Bit on Evals

Technically speaking, an “evals” just refers to tests AI app (or agent) to measure its performance.

Need #1: Regression testing: allows developers and PMs to test the impact of new changes to AI apps (and agents) before deploying them into production.

For example, suppose you are Tinder, and you want to test if you can get away with downgrading from GPT-4o to GPT-4o-mini for your AI girlfriend experience. To confidently push this change, you need to run a bunch of tests to see if cheaper models hurt the end user experience, and quantify the results.

Evals software do not provide the tests for you (which is the hard part), but gives you a test bench (a place to run the tests and visualize the results).

Need #2: Observability: allows for monitoring and trouble shooting AI apps both historically and in real time, by storing and visualizing logs (more precisely, traces, spans, events, etc).

For example, if your boss asks why customers are hating the customer support AI agent, you can drill down to old logs to get an answer. Or, you can build dashboards that show the average session length, etc.

Need #3: Experimentation: allows you to run many A/B tests on AI agents in real time, and track results.

For example, say you have two prompts (version 1, and version 2), and want to see what customers prefer - you can run this test in live. Evals solutions provide you client SDKs to send over metrics (basically the same concept as Mixpanel or Optimzely).

Enterprise AI Trends

Discussion about this post