Why Data Strategy Comes Before AI Strategy
Any AI strategy will fail unless your company has its data in order
Recently, I was chatting with a Director of Engineering overseeing generative AI platforms at a Fortune 500 company - and I asked him what keeps him up at night.
He said, “we are building these LLM-powered prototypes right now, but our data infrastructure isn’t ready for production LLM apps.”
And he’s right.
Generative AI use cases (RAG, finetuning, etc) depend on healthy data infrastructure as a foundation. Without the right solutions for data governance, observability, catalog, data sharing, lineage, etc, companies will sooner or later discover significant challenges to deploying and scaling AI apps in production. Shaky data foundations will later boil over as product delays or worse, data privacy violations.
Unfortunately, the importance of data strategy is being drowned amidst the hype over newer, sexier topics like LLMs, vector database, RAG, etc. My sense after speaking to enterprise customers is that many compartmentalize generative AI as a separate effort from existing data strategy / vision. This is a mistake, since Generative AI builds upon existing data infrastructure.
The elephant in the room is that many enterprises lack the data infrastructure to confidently deploy customer facing AI apps. That includes gaps in:
data discovery and data silos - e.g. most companies don’t have a central data catalog, and streamlined processes for sharing data amongst teams.
data observability and lineage tracking to help monitor and troubleshoot data-centric issues with Gen AI apps quickly.
data quality - e.g. Gen AI apps are very sensitive to the quality of data passed into context. Is the current standards for data quality good enough to pass as context into, say, a customer-facing chatbot? What can be done to boost data quality?
data governance - e.g. can companies trust that there’s a single source of truth for data permission boundaries for everyone in the company? Who owns vetting permission levels? How do we approach IAM for LLM agents?
the list goes on.
In this post, I will discuss:
the real consequences of having poor data strategy & data infrastructure, which can derail generative AI efforts at companies (note: some lessons also apply to startups, which tend to have higher risk tolerance)
a sketch of how to fix poor data strategy, and
data requirements of generative AI workloads, and how they are different versus traditional ML.
Generative AI will expose companies with weak data strategy & governance
For a data strategy to be “ready” for generative AI, two things need to happen:
addressing existing issues with enterprise data ecosystems, such as poor data governance, data discoverability, data silos, etc.
addressing new issues that arise from generative AI workloads, such as serving and storing embeddings, vector DBs, training data management, etc.
Of the two (old and new) issues, my claim is that the old issues will hurt more. In fact, generative AI workloads will shine more light on existing issues, because:
Many Gen AI use cases (e.g. chatbots, agent-assist) are customer facing use cases, so the bar is higher for data governance, quality, and observability.
Technologies such as RAG tend to integrate data from various places, often in real-time, which exposes data silos.
With LLM agents, data governance becomes more complex since we have a new entity (agent programs) that can consume data. If governance is already spotty, then it will be hard to utilize LLM agents in your enterprise.
and so on.
Unfortunately, many enterprises are plagued with issues in their data ecosystem, which will haunt their future generative AI roadmap. The following is a non-exhaustive list of problems:
data silos: multiple teams & LOBs (lines of businesses) having their own data lakes or data stores, so any app or project that is cross-cutting can’t be built easily. Also, no one knows what data other teams have, so they re-create similar datasets, wasting time and money.
poor data governance: aka, there’s no streamlined way of vending, enforcing, and auditing read & write permissions to everyone at the company. Governance issues include the lack of a central data permission enforcement layer, airtight audit and monitoring of abnormal access patterns, etc.
poor data discoverability: poor discoverability of existing data, often marked by no singular data catalog, is one of the causes of data silos. But more generally, it’s caused by the lack of a data sharing culture across teams (due to lack of cooperativeness, ownership, toxic attitudes, etc). In some cases, teams avoid sharing or “externalizing” data because maintaining data SLAs and pipelines other people depend on is thankless work. Eventually, no one ends up “owning” data discoverability.
data lineage and pipeline observability: many enterprises don’t have a standard observability and lineage solution for running ETL / Spark jobs. Not only that, they may use multiple orchestration tools (Airflow, Prefect, cron, etc). Therefore, job dependencies are hard to visualize, and data quality issues are hard to detect and troubleshoot.
not mentioned: poor data quality, PII data leaks, deduplication at scale, cost, etc.
Now, let’s look in more detail how each bucket of problem affects generative AI.
Issue #1: Data silos, poor data discoverability, and the lack of data interoperability
LLM apps tend to integrate multiple data sources from different places. But if data can’t be aggregated easily because they sit in different silos, then the app can’t be built. In other words, data silos can become a serious bottleneck for generative AI apps.
Consider the following RAG application (e-commerce chatbot) that combines 1) conversation history, 2) user info, and 3) product recommendations into context. If these datasets, ML services, or retrievers sit in different Lines of Business that don’t know each other, then that’s another 2-3 months of delays. Often, these departments don’t even know each other exists, and even then, they may be hesitant to share data unless some SVP makes it a top priority (below: data silo in action)
Poor data interoperability is another problem that causes speed bumps. What do we mean by interoperability? Imagine 2 (or more) lines of businesses using different tech stacks to do ETL, orchestration, storage, and so forth. One is using Parquet and the other is using HDFS. This lack of standardization makes data really hard to share, and make available to new Gen AI apps. This adds to delays. RAG isn’t so simple at enterprise scale.
Issue #2: Poor data governance and audit trails
Here’s another sleeper risk for enterprises: poor data governance. In my experience, most enterprises don’t have a single source of truth for data entitlements and policy enforcement for every employee, nor a well maintained data catalog with 100% coverage of every dataset in the company. This can turn into a serious security and privacy risk for generative AI apps, which are often end-user facing.
More concretely, poor data governance at enterprises is caused by the following gaps, which occur for various reasons (lack of ownership, departmental politics, legacy tech stack, etc)
imprecise data permissions boundaries for the AI app - aka the company didn’t think deeply about precisely what datasets the app should be entitled to read and write
ineffective enforcement of permission boundaries - aka the company can’t enforce the permissions properly due to some technical or process reasons.
bad actors or bad luck - people make suboptimal decisions (intentionally or unintentionally) that makes data insecure
And these issues can turn into severe privacy risks for generative AI apps. Consider the following failure scenarios:
Failures in RAG / chatbot: e.g. your bank’s customer support AI chatbot spewing off your bank balance to your colleague.
Failures in finetuned models:
e.g. PII data leaks into the finetuning data for your LLM, resulting in hackers obtaining your customers’ SSN.
or, the finetuning data is “poisoned” intentionally by bad actors to degrade model performance or decrease safety.
Failures in LLM agents: e.g. LLM agent accidentally grabs more data than it’s entitled to, exposing too much data to end users.
Failures in semantic search: e.g. internal knowledge base search exposes confidential documents to interns or contract workers
In all of the above patterns, the root cause boils down to the AI / LLM app reading unauthorized data, and / or serving sensitive data to the wrong person. These things shouldn’t happen with an airtight, central, and scalable data governance.
To prevent these failures, enterprises need to build a data catalog that they trust to serve all generative AI apps, and have mechanisms to disallow Gen AI apps from reading data from outside of this catalog. In practice, this means:
cataloging all datasets involved in generative AI workloads (finetuning, RAG, etc) and having an onboarding process to ensure compliance
rigorously defining permission boundaries for LLM / AI apps - and store those policies in one place
data permissions enforcement: enforce data permissions for every read and write access to your data. Ideally in a scalable and precise (FGAC) way.
implement audit trail to this catalog: an efficient way to audit data access to uncover bad actors or culprits
LLM assisted data governance: leverage LLMs to detect breaches of data governance via agents that read data access logs to uncover unauthorized data access, data leaks, etc.
Issues #3: No data lineage, trace, and observability
Now, let’s talk lineage and observability for data, which is another area of struggle for enterprises. Data observability and trace are essential for troubleshooting data issues in real time. And since generative AI apps are extremely sensitive to data quality, my claim is that poor data observability will derail productionizing generative AI apps.
Here’s how.
In essence, data lineage and observability boils down to knowing 1) how data was generated via tracing its predecessor(s), and 2) monitoring status (quality, staleness, etc) of data.
Knowing these answers is important to maintain the uptime of systems that leverage RAG (which are the vast majority of LLM apps). The quality of RAG is direct function of data quality of context. Therefore, if garbage, irrelevant data is being indexed into vector databases, we want to know asap and troubleshoot exactly where it went wrong.
For example, for financial advisor AI (like one Morgan Stanley is building), the app will fail (and in a big way) if ETL jobs failed to vectorize today’s stock news data. For time sensitive info such as news, real time data monitoring is important. If data is stale or missing, the app will either hallucinate, tell outdated info, etc, especially without proper guardrails. If this happens one too many times often, then user will churn.
Thus, when trouble happens, you need proper data lineage, trace, and data observability, so that your enterprise can triage data quality issues as they emerge, and operationalize troubleshooting. For the hypothetical financial advisor AI example, that may include:
Having LLM assisted data observability to understand that today’s date does not match the date of provided context, and raise alarms (for staleness)
If there’s context that is ill-formatted or simply off-domain, then want to raise alarm (for data quality) and see exactly where in the data lineage things went wrong.
Some of the best “enterprise” data catalog solution with built-in lineage is Databrick’s Unity Catalog.
Issue #4: ETL, data prep, and entity resolution
In addition to data silos, poor governance, and poor observability, the good ole data prep issues can also sneak up as generative AI blockers.
So what do we mean by “data prep issues”?
entity resolution: a sizeable chunk of customer data can’t be attributed to a single ID (identity) cleanly due to a long list of issues
masking / scrubbing PII data at scale: this is important when finetuning LLMs
stale data: ETL jobs run too infrequently to support some real time use cases that need the most up-to-date data
etc
These data quality issues have always been elephants in the room - important, but no real urgency to fix them. That was okay until generative AI dropped, because the scope of ML use cases have been fairly narrow until now.
But now that AI use cases are seeping across all lines of businesses, all the dirty (data) laundry is being exposed.
Thus, enterprises with better data quality will ship AI-powered products much faster than others. This differential is caused for a historical reason: it used to be okay to just “store all the data into data lake, just in case we need it later, and not do anything with it.” But now that all unstructured data could be useful, data quality is paying dividends.
Entity resolution and deduplication are especially skyrocketing in importance. Consider the use case of delivering LLM-generated, personalized welcome messages at retail stores. Personalized messages are essentially impossible when there’s unclear mapping between user information and session. This is an example of yet another sneaky issue (identity, entity resolution) hindering generative AI roadmap.
Note, many SaaS vendors actually “sell” customer data platforms (CDPs), though it’s unclear whether any large enterprise should outsource this endeavor or not.
The growth of finetuning workloads also makes data quality more important than ever:
scalable PII detection and masking is needed when running finetuning job, i.e. all labeled data / training data should be scanned for PII data and masked if necessary.
data quality over quantity: Much research evidence indicates that training data requirement for finetuning is far, far, smaller than the amount needed for pretraining, and that high quality data increases quality and memory efficiency of LLMs. This means there needs to be 1) more manual curation, 2) more LLM assisted curation, and so forth. Compared to traditional ML, quality matters far more than quantity, unless you are pre-training your own models.
Issue #5: data flywheel, labeled data, and monetization
Until generative AI, most enterprises only vaguely knew that unstructured data is valuable: only few grasped how to systematically turn data into products, edge, and ultimately, revenue. Being “data driven” or having “data flywheels” were basically buzzwords at most enterprises used by MBAs.
This changed with generative AI, which clearly showed several paths to monetizing data via new experiences (e.g. chatbots, semantic search, agent assist) or just as currency (e.g. selling it). With a clearer monetization path for data, companies need to think harder than ever about:
what is their “data advantage”
what is “core data” that is business critical - never to be shared with partners
how to manage data labeling workflows
which datasets to acquire
which datasets to protect and never sell
which datasets to monetize, etc.
The answer to these questions are intertwined with business strategy, so the C-suite needs to be involved.
So far, we talked about how problems with “traditional” data strategy can plague future AI efforts at companies. Next, let’s talk about a few emerging areas of data strategy specific to gen AI.