Recently, a client asked why I’m bullish on Databricks, so I’ll share my answer with my subscribers today.
In a nutshell, Databricks will grow market share in enterprise AI with an under-appreciated but very powerful wedge - the Unity Catalog. Unity Catalog is essentially a data and AI app governance solution which already has 10K+ customers in less than 2 years of launch, and is fast becoming a category leader.
Since strong data and AI governance is a must for every enterprise customer using AI, Unity Catalog is a powerful beachhead for converting non-Databricks accounts. It also allows Databricks to land and expand its AI training and Spark revenue, by cross-selling its Mosaic AI and serverless services, as well as natural language BI tools.
Investing into Unity Catalog has been a genius move, because while VCs and hyperscalers are spending billions fighting the war of attrition over making the best LLM or AI infra tool, etc, but the competition in the data & AI governance space is much, much weaker in comparison, except maybe Snowflake’s Polaris catalog (which is a far less mature product). Other incumbent governance vendors don’t have a strong AI story and interoperability, which makes them quite outdated already.
So here’s why I’m bullish on Databricks, in a nutshell:
Data governance is the toughest challenge for enterprises when it comes to productionizing LLM apps. Not the LLMs, vector DBs, or tooling, etc. AI infra is pretty commoditized already.
Databricks provides the best solution to data & AI governance with its Unity Catalog, because it provides a one-stop-shop for governing data and AI apps. Every enterprise needs a solution here.
The kicker: winning the data and AI governance category is a far more valuable than winning the AI infra war. It has real “winner takes all” characteristics, and every large enterprise needs them. It’s a very large TAM category, with very few players who can do both data and AI governance like Databricks.
Through its Unity Catalog, Databricks can easily cross-sell its analytics and AI infra services. That’s Databricks’ AI strategy in a nutshell.
Continue reading as I double click on each of these points.
Quick announcement: Are you interested in being interviewed by Enterprise AI Trends? I am starting an interview series (podcast + transcript) and am looking for the first few interviewees who are working on AI, product, or GTM to debate challenges and opportunities in Enterprise AI. If you are interested in being interviewed, or are seeking for consulting on AI strategy - email public@nextword.dev
The real speedbump for enterprise AI adoption - poor governance
Let’s start with the importance of having an unified data + AI catalog, which is what Unity Catalog provides.
In the past, I wrote wrote about why data governance is really important for putting AI apps into production (”Data Strategy Before AI Strategy”). Essentially, it argues that you can’t ship AI apps to customers without having a strong access governance layer over how data is shared with AI apps.
For those who are unfamiliar, data governance refers to the activity of managing how your company’s data is created, consumed, entitled, and shared across employees and to customers. Data governance is performed with a tool called data catalog, essentially a UI and tooling that helps with:
discovering datasets in your org (”what datasets do we have as a company, and who / which team maintains the data, or are using data, and for what”),
grant and manage permissions to these datasets to other teams and apps
and troubleshoot issues with datasets (e.g. figure out which data pipeline broke, detect data quality issues, notify the right owners, etc).
And with the rise of AI apps, enterprises also demand a governance solution for AI artifacts, such as AI agents, AI apps, functions, tools, and so forth. These will need to be first class assets, just like “datasets” or “features”, so enterprises can have things like:
an inventory of all AI agents or apps that your firm ever created and published
a way to control which tools and function calls are available to which AI agent
… and audit trails to identify who (in your company) modified these “AI assets”
Thus, the traditional definition of data governance expanded from pure data (datasets and ML features) to include AI assets and its ancillary artifacts like functions. A whole new set of enterprise requirements emerged for efficiently governing and auditing AI apps (and agents) - and Databricks has been prescient to provide an unified solution for this. Competitors have a weak answer in comparison.
Interestingly (and perhaps shockingly) even in 2024, most enterprises don’t have a single, authoritative data catalog, let alone an unified data + AI catalog. They either have 0 or N number of data catalog(s), usually depending on how many vendor solutions they purchased in the past. They have bene putting off data cataloging projects for years since it’s often politically difficult to coordinate, and the projects don’t have immediate ROI. Many BS initiatives like “data mesh” were proposed to solve this problem, but that didn’t work because it was too abstract.
But with the rise of AI apps, especially LLM apps, enterprises can’t procrastinate on data+AI governance any longer, especially for customer facing AI apps that need high level of security. Thus, the pace of data + AI catalog adoption is picking up, and Unity Catalog - which is the best in class product - is benefitting from this tailwind.
Here’s an example use case
To see why data+AI governance is a blocker, consider an enterprise building a customer support app that handles password resets. Left column shows the step an AI chatbot needs to take, and the right shows what sensitive information requires governance.
As the table above shows, this simple workflow requires governance guardrails for both data (e.g. email, password, etc) and AI assets (e.g. function calls, retrievers, etc). There’s simply more to govern with AI apps that goes beyond applying access control on datasets. And even this app represents a potential security nightmare, if the agent retrieves the customer’s data or session, etc. Complexity is too enormous for enterprises to build on their own.
Long story short, enterprises that have a lot to lose need a strong data governance solution where every person involved in making LLM apps are onboarded. It’d be disastrous if a customer support bot uses the wrong customer’s personal info to generate responses, or invoke an unauthorized or inappropriate function call due to prompt hijacking, etc (e.g. reseting another customer’s password).
The data + AI flywheel for Databricks
Since data+AI governance is a pressing problem for enterprise AI, Unity Catalog is enjoying great traction, with 10K+ enterprise customers in less than 2 years post-launch.
Databricks knows this, and has been structuring its GTM efforts with Unity Catalog at the forefront. This is smart, because data catalogs have an inherent flywheel / network effect built in. It’s a product that gets better as more internal teams register their datasets with the catalog, which will incentivize other teams to do the same, and so on.
Databricks’ land and expand typically goes like this:
A customer adopts the open source version of Unity Catalog (or hosted version) to register their internal datasets.
Eventually, Unity Catalog becomes the “source of truth” for discovering all data that the company owns, as well as metadata.
Unity Catalog starts incorporating non-data scientist / engineer personas as well (PMs, solution architects, data audit teams, etc).
Unity Catalog becomes the “operating system” of how data / AI apps get built.
This puts Databricks in a great position to cross sell serverless Spark, hosted notebooks, scheduling, AI training, LLMOps, etc.
Thus, once a catalog is established, it’s very sticky - and given that most enterprises have data silos and poor data governance, there’s a lot of room for Databricks to further integrate themselves into enterprise data tech stack.
Overall, I think Databricks’ strategy will not only work, but it is also smart to avoid fighting expensive wars in building and promoting the best OSS LLM, etc. That’s not Databricks’ strength anyways. Ultimately, it’s about owning the end to end lifecycle of how AI and data apps get built. It starts with owning the catalog.
Risks:
Adopting a firm wide data catalog still requires a great amount of behavioral change and a strong organizational mandate - thus, it may take a while (in the order of a full year) for these migrations to be complete. That said, if it’s an enterprise that’s any serious about generative AI, it should prioritize the Unity Catalog migration within a quarter, which is doable in my opinion.