Skip to content

Databricks CustomerLake: The Third Path in CDP Architecture

For five years the first-party data ingestion debate has been a binary. It was always a simplification — and now there's a product that doesn't fit either side of it.

For the last few years the first-party data ingestion architecture debate has been a binary: packaged CDP (Segment, mParticle, Tealium — collect into the vendor's platform, sync to your warehouse later) versus warehouse-native composable (Snowplow + Snowflake + dbt + Hightouch — collect into your own cloud first, treat the warehouse as canonical).

That binary was always a simplification. Databricks announced CustomerLake on June 16, 2026 at its Data + AI Summit, and it's the first credible product that doesn't fit either side of it. It belongs to a third architectural pattern that the existing assessment frameworks — including ours — weren't built to evaluate.

This post is about what that third pattern is, where it sits against the six dimensions of first-party data complexity we've been using as our diagnostic, and what it means for anyone currently working through a CDP architecture decision.

What CustomerLake actually is

CustomerLake is an agentic CDP natively embedded in the Databricks lakehouse — identity resolution, audience building, campaign automation, and activation running directly inside the same governed platform where customer data and AI models already reside, with no requirement to copy data into a separate CDP system.

Two agent types do the work. Profile Agents handle identity resolution via what Databricks calls "Agentic Identity Resolution," combining deterministic, probabilistic, and agentic workflows. Campaign Agents replace one-off campaign workflows with what the company calls "infinity campaigns" — continuous agentic loops that analyse customer signals and act in real time. Unity Catalog governs both. Lakehouse Federation lets the product reach into Snowflake, BigQuery, or cloud object storage without moving data.

It's currently in Private Preview, with HP, Circle K, AB InBev, and Getnet by Santander as named early customers. Not generally available. Not battle-tested at scale. Not a procurement option this quarter.

But the architectural pattern matters even before the product is GA. It collapses a trade-off that's structured every CDP conversation for the last five years.

Why the binary was incomplete

Until now, first-party data ingestion architecture decisions came down to a forced choice:

  • Packaged CDP gave you a managed product, a marketing-operator UI, and speed-to-value. The cost was data custody, payload fidelity, warehouse latency, and metered access to your own events.
  • Warehouse-native composable gave you custody, fidelity, sub-five-minute latency, and clean unit economics. The cost was engineering investment, operational ownership, and the absence of a packaged UI for non-technical operators.

The framework we've been building uses six complexity dimensions — custody, fidelity, schema volatility, latency, access economics, exit posture — to decide which side of that binary fits your business. High-complexity first-party data ingestion needs pushed you to composable; low-complexity needs let you buy packaged.

CustomerLake breaks the binary because it's a managed product where the data stays in your lakehouse. Plot the two real choices that have ever mattered — whether you get a packaged operator UI, and whether your data stays in your own governed cloud — and the existing options sit in opposite corners. CustomerLake lands in the corner that was empty.

DATA STAYS IN YOUR GOVERNED CLOUD ▲ ▼ DATA COPIED INTO A VENDOR STORE ◀ NO PACKAGED UI PACKAGED, OPERATOR-READY UI ▶ Warehouse-native composable Snowplow + Snowflake + dbt + Hightouch Custody + fidelity, no packaged UI. Engineering owns it. CustomerLake THE PREVIOUSLY-EMPTY QUADRANT Managed product AND data stays in your lakehouse under Unity Catalog. Agentic CDP, no copy-out. Roll-your-own scripts Custody but neither product nor custody guarantees. Rarely chosen. Packaged CDP Segment / mParticle / Tealium Fast UI, speed-to-value — but you give up custody and pay to access your own events.
The two axes that have always mattered. CustomerLake fills the quadrant the binary never had a name for.

The explicit argument from Databricks is that standalone CDPs force data duplication and governance overhead by holding customer data outside the core data platform; embedding the CDP inside the lakehouse removes both.

If that holds in practice, the trade-off changes shape. You can have a managed CDP product AND keep custody in your governed lakehouse. You can have identity resolution AND avoid the schema-standardisation tax. You can have campaign automation AND keep raw events queryable in your warehouse without paying a separate vendor for export rights.

The question is no longer "packaged or composable." It's "packaged-outside-the-lakehouse, composable, or packaged-inside-the-lakehouse."

Where CustomerLake sits against the six complexity dimensions

Provisionally — the product is in Preview and the marketing claims need to be tested against actual environments — here's how the architecture appears to map.

CustomerLake vs the six complexity dimensions Provisional — product is in Private Preview; marketing claims still need testing against real environments. DIMENSION VERDICT WHAT STILL NEEDS PROVING Custody STRONG Stays in your lakehouse under Unity Catalog. No vendor copy. Fidelity UNKNOWN Do Profile Agents augment raw payloads, or overwrite source-of-truth? Schema volatility UNKNOWN Lakehouse handles drift — but does the agent layer impose schema? Latency LIKELY STRONG No batch-sync penalty; agentic-loop latency itself untested. Access economics DIFFERENT Consumption-based, not per-MAU. Better or worse is volume-dependent. Exit posture STRONG* *risk shifts from CDP-vendor to Databricks-platform dependency.
Provisional read against the six-dimension diagnostic. Three dimensions are still genuinely open.

Custody. Strong. Data stays in your Databricks lakehouse under Unity Catalog. No duplication into a vendor-controlled store. This is the dimension where CustomerLake makes its loudest claim, and the architecture supports it on paper.

Fidelity. Unknown. The question is what the Profile Agents do to raw ingested payloads. If they augment without standardising, fidelity is preserved. If they materialise unified profiles that downstream activation reads from, the original raw payload becomes a source-of-truth question. Worth testing carefully.

Schema volatility. Unknown, but lakehouse-native architectures generally handle schema drift better than packaged CDPs do. The real question is whether the agentic layer imposes implicit schema expectations.

Latency. Should be strong. If activation reads from the same lakehouse the events landed in, the batch-sync penalty that breaks packaged CDPs on this dimension doesn't apply. Specifics of the agentic loop latency need testing.

Access economics. Pricing is a value-aligned consumption model rather than traditional software licensing. Different economics from per-MAU packaged CDPs. Whether it's better or worse depends on volume and use case — but the structural complaint about packaged CDPs (paying rent to access data you already generated) doesn't translate cleanly here, because the data is already in your platform.

Exit posture. The custody argument means if you stop using CustomerLake, your data doesn't go anywhere — it's already in your lakehouse. The bigger exit question becomes Databricks platform dependency itself, which is a different (and longer-term) bet than CDP vendor dependency.

What this changes about the framework

The complexity-as-lens diagnostic still works. The six dimensions still describe the demands your business places on first-party data ingestion. What changes is the answer space: there are now three architectural patterns the answer can land on, not two.

This has three practical consequences for anyone using the assessment framework.

First, your scorecard needs a third column. If you're running an evaluation today, you should be testing packaged-outside-lakehouse, composable, AND lakehouse-native — with the understanding that the third option may not be procurable yet, but should be in the scorecard anyway because it's the architecture you may be migrating to within 18 months.

Second, the threshold logic shifts. The previous decision rule was simple: low complexity meant packaged, high complexity meant composable. The revised rule routes through one more question first.

Measure complexity on the six dimensions Already on a lakehouse (Databricks)? NO → YES ↓ by complexity LOW COMPLEXITY Packaged CDP Segment / mParticle HIGH COMPLEXITY Warehouse-native composable ANY COMPLEXITY CustomerLake-pattern is the leading candidate pending the Preview unknowns Note: being on Databricks lowers the switching cost — it does not, by itself, raise the fit. Vendor-first procurement is still the failure mode.
The revised threshold logic adds one routing question — but lakehouse incumbency lowers switching cost, not the fit bar itself.

Low complexity and not already on a lakehouse — still packaged. High complexity and not already on a lakehouse — still composable. Any complexity and already on Databricks or a similar lakehouse — the CustomerLake-pattern becomes the leading candidate, pending the unknowns above.

Third, the existing six dimensions remain the right diagnostic, but you'll need to add evidence requirements specific to lakehouse-native CDPs. Does the Profile Agent layer preserve raw payloads? Does the agentic loop introduce its own latency? Does the consumption-based pricing scale linearly with event volume, or are there step functions? These are testable, but they're not in the current due-diligence checklist.

What stays unchanged

The fundamental claim of the framework — that first-party data ingestion architecture decisions follow from honest complexity assessment, not from vendor selection — survives the announcement intact. CustomerLake doesn't change the question. It expands the answer set.

The procurement failure mode also stays the same. Most CDP decisions still get made backwards: vendor first, complexity assessment second. A team that's already moved most analytical workload to Databricks is now going to be sold CustomerLake hard, regardless of whether the complexity profile actually fits. That's the same mistake teams make with Segment and mParticle today — just with a different vendor on the receiving end.

The discipline is the same: measure your complexity on six dimensions, weight them according to what your business actually needs, and let the architecture choice follow. The choice now has three plausible homes instead of two. The methodology that gets you there hasn't changed.

The principle

The first-party data ingestion architecture landscape just got a third path. The diagnostic that decides which path fits your business — first-party data complexity, measured on six dimensions, weighted by your strategic priorities — works the same way it did the day before the CustomerLake announcement.

The question for the framework isn't whether to throw it out. It's whether to extend it. The complexity-as-lens approach holds. The candidate set just expanded.

If you're currently mid-procurement on a packaged CDP, the announcement doesn't mean stop. It means add CustomerLake to the scorecard, weight it for the uncertainty of a Private Preview product, and revisit in 6–12 months when it's GA and there's environmental evidence to test against. If you're currently building a warehouse-native composable stack and you're already on Databricks, the announcement is more disruptive — your decision was right when you made it and may still be right, but the GA timeline for CustomerLake should influence how much you invest in things CustomerLake will eventually subsume.

The framework was always meant to be revised as the architecture landscape changed. This is the first revision that matters.