Change Data Capture: Architecture, Failure Modes, and How to Keep It Working

Transparency note: This analysis is based on production patterns, internal benchmarks, and publicly documented system behaviors. Numbers without explicit citations are observed across enterprise deployments; cited numbers link to original sources. Actual performance varies by workload, scale, and configuration.

Executive Summary (TL;DR)

Change data capture tracks and applies data changes.
Task skew leads to executor OOM in Spark.
Speculative execution can cause shuffle failures.
Monitor Spark UI for skewed tasks.
Production scale impacts task distribution.

What Most Teams Get Wrong

Change data capture aims to efficiently track and apply changes across distributed systems. It assumes data distribution remains balanced during processing.

Trigger: Uneven data distribution. Consequence: Executor OOM due to task skew. Impact: Task completion time increases by 50% in Spark jobs.

How It Actually Works (Under the Hood)

Log-based CDC captures changes from transaction logs.
Trigger-based CDC uses database triggers to detect changes.
Batch processing applies changes in bulk to target systems.
Streaming processing applies changes in near real-time.
Checkpointing ensures data consistency across failures.

Hard Numbers (defaults and thresholds)

Configuration / Metric	Default Value	Source
`spark.executor.memory`	1g	Apache Spark 3.0.0, spark-defaults.conf
`spark.sql.shuffle.partitions`	200	Apache Spark 3.0.0, spark-defaults.conf
`spark.speculation`	false	Apache Spark 3.0.0, spark-defaults.conf

Top: real-flow topology for change data capture. Bottom: failure overlay (concrete failure mechanisms with measured impact).

Real-World Constraints

spark.executor.memory must be tuned to workload size
spark.sql.shuffle.partitions impacts shuffle performance
spark.speculation can cause redundant task execution
Log-based CDC requires access to transaction logs
Trigger-based CDC can impact source database performance

Failure Modes (Trigger → Mechanism → Consequence → Impact)

Failure Chain
Trigger: High data volume → Mechanism: Insufficient executor memory → Consequence: Executor OOM → Measured impact: Job fails with memory error
Trigger: Uneven data distribution → Mechanism: Tasks are skewed → Consequence: Increased job duration → Measured impact: Task completion time increases by 50%
Trigger: Large shuffle operations → Mechanism: Insufficient shuffle partitions → Consequence: Shuffle failures → Measured impact: Job fails with shuffle error
Trigger: Speculative execution enabled → Mechanism: Redundant task execution → Consequence: Increased resource usage → Measured impact: Cluster resource utilization spikes
Trigger: High write throughput → Mechanism: Backlog in apply queue → Consequence: Data latency → Measured impact: Increased lag in data application

What the failure looks like live

21/10/23 14:32:10 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1456 ms on executor 1 (host: 10.0.0.1) with OOM

Production Reality (What Breaks at Scale)

At 1TB+ data volumes, task skew becomes pronounced because data distribution is uneven; the only mitigation that works is to increase the number of shuffle partitions and optimize data partitioning.

Expert insight: Task skew often arises from poorly distributed keys; pre-sorting data by key can mitigate this issue significantly.

Hidden Costs of Maintenance

Ongoing tuning of executor memory settings
Frequent monitoring of task distribution
Adjusting shuffle partitions for different workloads
Managing speculative execution settings
Ensuring consistent access to transaction logs

How Engines Differ

Engine	Approach	Where It Works Well	Where It Breaks
Apache Spark	In-memory processing	Large-scale data	Task skew
Flink	Stream processing	Real-time analytics	Stateful operations
Kafka	Log-based CDC	Event-driven architectures	High throughput
Debezium	Log-based CDC	Database integration	Schema evolution

X vs Alternatives

Strategy	How It Works	Best For	Failure Mode
Log-based CDC	Reads transaction logs	Database changes	Log access issues
Trigger-based CDC	Uses triggers	Immediate changes	Database load
Batch processing	Applies changes in bulk	Periodic updates	Data staleness
Streaming processing	Applies changes in real-time	Low-latency updates	State management

How to Keep It Actually Working

Set spark.executor.memory to 2g for high-volume jobs
Increase spark.sql.shuffle.partitions to 500 for large shuffles
Disable spark.speculation for stable environments
Pre-sort data by key to avoid task skew
Monitor Spark UI for task distribution anomalies

Standards and Industry Guidance

Standards and frameworks that apply to change data capture in production environments:

ISO/IEC 25010 - SQuaRE — reliability (maturity, availability, fault tolerance) is the relevant quality characteristic for production pipelines
NIST SP 800-53 Rev. 5 — SI-4 (monitoring) and CP-10 (information system recovery) apply to pipeline observability and failure recovery
ISO 8000 - Data Quality — the data quality discipline pipelines exist to maintain end-to-end
ISO/IEC 27001 — change-management discipline for production pipeline modifications

Where It Matters Most

E-commerce

Real-time inventory updates using log-based CDC.

Finance

Fraud detection with streaming CDC for transaction monitoring.

Healthcare

Patient data synchronization across systems with batch CDC.

The Underlying Principle (and Where Solix Fits)

The principle behind change data capture is to maintain data consistency across distributed systems by efficiently tracking and applying changes. Solix CDP implements this by providing a scalable and flexible platform for managing CDC workflows, while acknowledging that other vendors also target this critical need with their solutions.

Prerequisite Concepts

Apache Spark — A unified analytics engine for large-scale data processing.
Data Partitioning — The process of dividing data into distinct subsets for parallel processing.
Executor Memory — Memory allocated to each executor in a Spark cluster.

Frequently Asked Questions

What is change data capture in simple terms?

Change data capture is a method to track and apply changes in data sources to ensure consistency across systems.

How is change data capture different from data replication?

CDC tracks and applies only changes, whereas data replication copies entire datasets.

Why is my change data capture suddenly slow?

Check for task skew or insufficient shuffle partitions causing delays in processing.

How do I tell if change data capture is broken?

Look for increased lag or failed tasks in the Spark UI indicating processing issues.

Related Glossary Terms

Trademark Notice

Product names, logos, brands, and other trademarks referenced on this page are the property of their respective trademark holders. References to third-party products are for descriptive and informational purposes only and do not imply affiliation, endorsement, or sponsorship by the trademark holders. Solix Technologies is not affiliated with, endorsed by, or sponsored by any third party referenced on this page unless explicitly stated.

About the author

Barry Kunst

Vice President Marketing, Solix Technologies Inc.

Barry Kunst is VP of Marketing at Solix Technologies, focused on AI-driven growth, enterprise data strategy, and B2B technology markets. With more than two decades in enterprise data infrastructure, his prior roles span Sitecore, Veritas Technologies, Broadcom Software, and FICO. He is a member of the Forbes Technology Council.

What you can do with Solix

Request A Demo

Enter to win a $100 Amex Gift Card