Skip to content

Customer Data Processing for Capturing Intent Signals in Outbound Marketing: Why Packaged CDPs Struggle and Warehouse-Native Architectures Win

The short answer. Customer data processing for custom intent signals — the workload of capturing, extracting, scoring, and activating proprietary buying signals — sits awkwardly inside packaged CDPs and runs naturally on warehouse-native architectures. The reason is structural, not configurational. This post walks through why, and what a working reference architecture looks like.

What is customer data processing for custom intent signals in outbound marketing?

Customer data processing involves building the pipeline that captures raw customer signals, stores them at scale, extracts meaning from them, scores them, resolves them to accounts and contacts, and activates them downstream. In the custom intent use case, it’s the engineering layer between collecting proprietary signals and operationalising them in sales and marketing.

And the workload is very different from traditional CDP event tracking.

You’re ingesting HTML diffs from careers pages, pricing pages, integration directories, and changelogs. You’re processing podcasts, webinars, earnings calls, and public posts that require speech-to-text and LLM-based extraction. You’re handling semi-structured feeds from GitHub, Crunchbase, job boards, and funding APIs. You’re enriching LinkedIn-derived activity. And increasingly, you’re generating vector embeddings so teams can ask semantic questions like: “Which target CIOs discussed this problem publicly in the last 90 days?”

The output is a constantly changing graph of scored signals tied to accounts and contacts, with decay logic, enrichment pipelines, and asynchronous processing throughout.

This is not a “track a click event” workload. It’s a content intelligence pipeline feeding revenue teams — exactly the kind of unstructured, bursty, signal-heavy processing most packaged CDPs were never designed to handle.

Why do packaged CDPs struggle with this kind of customer data processing?

If you map this workload against the capabilities of the customer data infrastructure layer, the friction shows up almost everywhere. The most acute failure modes:

Multi-source ingestion patterns are wrong. Packaged CDPs ship with connectors for stable SaaS sources — CRM, e-commerce platforms, web analytics, marketing automation. They do not ship with crawlers, RSS pollers, STT pipelines, or LLM extraction steps. Every custom signal source becomes a bespoke engineering effort against a closed platform that was never designed to host that engineering.

Storage architecture is wrong. Profile and event tables are the wrong primitives for raw HTML, transcripts, JSON blobs, and embeddings. Packaged stores are tuned for profile lookup and segmentation queries. You can technically stuff a 50KB transcript into a custom attribute, but you’re abusing the schema and paying enterprise rates for what is fundamentally object storage.

The cost model penalises you. Per-profile and per-event pricing was designed for B2C clickstream economics. Bulk unstructured ingestion at the scale this use case demands — millions of scraped pages, thousands of transcripts, embeddings for all of it — explodes commercial models that assume your unit of cost is a known customer.

You don’t own the IP. These custom signals are arguably the highest-value proprietary data your marketing team will ever produce. Locking them inside a vendor’s schema, behind their export limits and data residency rules, defeats the strategic point of building them in the first place. Portability isn’t a nice-to-have here — it’s the entire business case.

The compute model is closed. Custom signal extraction needs Python, transformer models, vector operations, and the ability to call external LLM APIs at scale. Most packaged CDPs offer a thin segmentation DSL and, if you’re lucky, a vendor-curated AI layer with a narrow set of pre-built models. That isn’t where GTM engineering — the practice of small operator-led teams building proprietary signal pipelines against the CRM and the warehouse — can credibly happen.

Real-time query and writeback concurrency falls over. Deep research at scale implies agents reading account context, writing scored signals back, and looping. Packaged CDPs were architected for batch segment computation and downstream activation. Hundreds of concurrent agent processes reading and writing to the profile layer is a load profile they typically can’t sustain — and one I expect to become non-negotiable over the next eighteen months.

Identity is contact-centric and B2C-shaped. Most custom intent signals are account-level long before they’re contact-level. A funding event happens to a company; a new CIO joins a company; five engineering job posts appear on a company’s careers page. Most packaged CDPs were built around a B2C identity graph where the unit of analysis is a person. Forcing account-first signal logic into that model is possible, but it’s friction every step of the way.

None of these are configuration problems you fix in a procurement cycle. They are structural.

Packaged CDP vs warehouse-native: customer data processing at a glance

Customer data processing capabilityPackaged CDPWarehouse-Native
Multi-source ingestionSaaS connectors only; custom sources need workaroundsCrawlers, API pollers, STT, LLM extractors all natively supported
Storage primitivesProfile + event tablesSemi-structured types (VARIANT, JSON, ARRAY), vector indexes, object storage
Compute modelClosed segmentation DSL + curated AI layerOpen: SQL, Python, dbt, Spark, in-warehouse LLM functions
Cost modelPer-profile / per-eventPer-compute / per-storage, decoupled from licensing
Data ownershipLocked in vendor schemaOpen table formats (Iceberg, Delta), portable by default
Agentic AI concurrencyBatch-optimised, breaks under loadDesigned for concurrent read/write at scale
Identity modelOften B2C / contact-centricAccount-first via dbt or composable identity layer

How does a warehouse-native architecture handle customer data processing differently?

The inverse, mostly. The same capability dimensions, but with the architecture working with you rather than against you.

Modern cloud warehouses — Snowflake, BigQuery, Databricks — store semi-structured data natively through VARIANT, JSON, and ARRAY types. You can land raw scraped artifacts without schema gymnastics and query across them with SQL. Each of the three has shipped first-party LLM functions — Snowflake Cortex, BigQuery ML, Databricks AI Functions — that let you call extraction, classification, embedding, and summarisation models inside the warehouse, against data that never has to leave it. Vector storage and similarity search are now table stakes in the same platforms.

Compute is decoupled from storage and from licensing. Your scraper budget is separate from your activation budget. A spike in crawl volume doesn’t trigger a renegotiation with a CDP vendor — it triggers a slightly larger compute bill that scales linearly with what you’re actually doing.

Open table formats — Iceberg, Delta — mean the signal layer you’ve spent engineering effort building is portable by default. You can switch query engines, swap activation tooling, and keep the asset.

And critically, the language of the platform is the language of GTM engineering. SQL, Python, dbt, and Airflow or Dagster for orchestration. Composable activation layers like Hightouch, Census, GrowthLoop, and RudderStack sit on top of the warehouse and handle the identity-resolution-and-syndication problem without forcing you to centralise everything inside a closed system.

This is the same answer the pillar post arrives at for agentic AI workflows generally. Custom intent signal capture just happens to be the customer data processing use case where it hits hardest first.

A reference architecture

  • Capture layer — Playwright or Scrapy-based crawlers, webhook receivers, scheduled API pollers, speech-to-text services such as Whisper or Deepgram, and where appropriate, LLM-driven research agents producing structured output.
  • Landing zone — Object storage (S3, GCS, ADLS) for raw artifacts, immutable and dated. This is your audit trail and your replay layer when extraction logic changes — and it will change, often.
  • Warehouse — Snowflake, BigQuery, or Databricks holding both the raw artifacts as semi-structured types and the structured outputs of extraction.
  • Extraction layer — In-warehouse LLM functions and external Anthropic or OpenAI APIs, orchestrated through dbt models or Python jobs. Versioned prompts, evaluated outputs, reproducible runs.
  • Signal store — Typed, scored, dated signals keyed to account and contact identifiers, with embeddings indexed for semantic retrieval. This is your proprietary intent layer.
  • Identity layer — dbt models for stitching, or a composable CDP layer (Hightouch, Census, GrowthLoop, RudderStack) sitting on top of the warehouse.
  • Activation — Reverse ETL into CRM (Salesforce, HubSpot), sales engagement (Outreach, Apollo), ad platforms (LinkedIn, Google), and increasingly, agent runtimes that read directly from the signal store.

Nothing in this architecture is exotic. All of it is being deployed in production today. None of it fits naturally inside a packaged CDP.

Is warehouse-native always the right approach to customer data processing?

No. The honest trade-offs from the pillar post still apply.

You need engineering capacity, orchestration discipline, and governance maturity. For a five-person marketing team at seed stage, the sensible play is Clay plus an LLM plus a manual LinkedIn workflow — effectively a no-code composable approach with someone else’s warehouse underneath. Same architectural shape, lower ceiling, lower investment. That’s the right answer for that context.

The argument changes at mid-market and enterprise scale. Once the account universe is in the thousands, once multiple signal sources are running concurrently, once sales and revops want signals scored, attributed, and routed deterministically — warehouse-native stops being a preference and starts being the only customer data processing architecture that can hold the workload without breaking the cost model or the engineering team.

Where this leaves you

Custom intent signal capture is a useful stress test for any customer data processing architecture. If your stack can’t capture the signal, store it cleanly, extract structure from it, score it, resolve it to an account, and push it to the systems your team actually works in — then your stack has quietly committed you to the same commoditised feeds your competitors are already buying.

The architectural decision precedes the strategic one. “Build your own signals” is a slide. Building them in a system that can scale, that you own, and that doesn’t price you out of your own data — that’s an outcome.

For the full capability framework underneath this argument, the pillar post on the customer data infrastructure layer walks through each capability dimension and the packaged-versus-composable trade-offs in depth.

FAQ

Can a packaged CDP handle customer data processing for custom intent signals?

In most cases, no — at least not without significant workarounds. Packaged CDPs are architected around profile and event primitives, SaaS connector-based ingestion, and closed compute models. Custom intent signal capture requires unstructured data ingestion, in-warehouse LLM extraction, vector storage, and open compute, none of which are native capabilities in most packaged platforms.

Cloud warehouses such as Snowflake, BigQuery, and Databricks provide native semi-structured storage, in-warehouse LLM functions, vector indexes, and open table formats. They decouple compute from licensing, which means bulk unstructured ingestion does not break the cost model. They also speak SQL and Python — the working languages of the engineering teams building these pipelines.

A typical warehouse-native stack includes capture tooling (Playwright, Scrapy, Whisper, Deepgram), object storage (S3, GCS, ADLS), a cloud warehouse (Snowflake, BigQuery, Databricks), transformation and orchestration (dbt, Airflow, Dagster), in-warehouse LLM functions (Snowflake Cortex, BigQuery ML, Databricks AI Functions) alongside external APIs (Anthropic, OpenAI), reverse ETL (Hightouch, Census), and a composable identity layer where account resolution is needed.

GTM engineering is the practice of small operator-led teams using code — SQL, Python, dbt, LLM workflows, custom integrations — to build proprietary go-to-market data pipelines that off-the-shelf vendors do not offer. In the intent signal context, GTM engineering treats customer data processing as a real data engineering discipline rather than a software procurement exercise.

No. The architectural shape is the same at any size; only the implementation differs. Small teams can run a low-code version of warehouse-native customer data processing using tools like Clay, lightweight LLM workflows, and manual research routines. Mid-market and enterprise teams need full warehouse-native infrastructure to handle the volume, concurrency, and governance demands at scale.