Backpressure
Backpressure occurs when downstream operators can't keep up with upstream throughput. Flink's credit-based flow control propagates backpressure naturally through the pipeline, preventing data loss without unbounded buffering.
Table of Contents
What Backpressure Is
Backpressure is the situation where a downstream operator processes data slower than the upstream operator produces it. Without flow control, buffers would grow unbounded until OOM. With flow control, the slow operator signals upstream to slow down ā pressure propagates backward through the pipeline.
The Highway Traffic Analogy
Backpressure is like traffic on a highway. When there's an accident (slow operator) ahead, cars (events) pile up behind it. The congestion propagates backward ā cars 5 miles back slow down even though they can't see the accident. Eventually, the on-ramp (source) gets backed up too. Flink's flow control is like traffic signals on the on-ramp ā they meter the flow to match downstream capacity.
Backpressure scenario: Source (100K events/sec) ā Map ā KeyBy ā Window ā Sink (50K events/sec) ā BOTTLENECK! Without flow control: Source produces 100K/sec Sink can only handle 50K/sec Buffer grows by 50K events/sec After 60 seconds: 3M events buffered After 5 minutes: OOM crash š„ With Flink's credit-based flow control: Sink signals "I'm full" to Window operator Window signals "I'm full" to KeyBy KeyBy signals "I'm full" to Map Map signals "I'm full" to Source Source slows down to 50K/sec (matches sink capacity) ā No unbounded buffering, no OOM ā But: end-to-end latency increases ā And: Kafka consumer lag grows (source reads slower)
Backpressure is Normal
Some backpressure is normal and expected ā it means your flow control is working. Sustained backpressure is the problem ā it means your pipeline can't keep up with input rate and you need to either increase capacity or reduce input.
How Flink Handles Backpressure
Flink uses a credit-based flow control mechanism. Each receiver tells the sender how many buffers it can accept (credits). When credits run out, the sender stops sending. This propagates naturally through the entire pipeline without any central coordinator.
Credit-Based Flow Control: Sender (upstream) Receiver (downstream) āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā ā Output Bufferā āāāā data āāāāā ā Input Buffer ā ā ā āāā credits āāā ā ā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā 1. Receiver has N input buffers available (e.g., 4) 2. Receiver sends credits=4 to sender 3. Sender sends up to 4 buffers of data 4. Receiver processes buffers, frees them 5. Receiver sends new credits as buffers are freed 6. If receiver is slow ā no credits sent ā sender blocks Network buffer pool per TaskManager: - Configured via taskmanager.memory.network.fraction (default 10%) - Split into input buffers and output buffers - Each channel (connection between operators) gets exclusive buffers - Floating buffers shared across channels for flexibility Backpressure propagation: Sink full ā no credits to Window ā Window blocks Window full ā no credits to KeyBy ā KeyBy blocks KeyBy full ā no credits to Source ā Source stops reading Kafka ā Kafka consumer lag increases (visible externally)
Receiver runs out of input buffers
The downstream operator is processing slowly. Its input buffer pool fills up completely.
No credits sent upstream
Since no buffers are free, the receiver can't grant credits to the sender.
Sender's output buffers fill
The sender can't send data (no credits), so its output buffers fill up.
Sender blocks on output
The sender's thread blocks when trying to write to a full output buffer. It stops processing input.
Pressure propagates upstream
The blocked sender now can't consume from ITS input buffers, causing the same cascade upstream.
Detecting Backpressure
Flink provides multiple ways to detect and diagnose backpressure: the Web UI backpressure indicator, task metrics, and checkpoint duration analysis.
Detection Methods: 1. Flink Web UI ā Backpressure Tab Each operator shows: OK / LOW / HIGH - OK: < 10% of time backpressured - LOW: 10-50% of time backpressured - HIGH: > 50% of time backpressured Reading the UI: Source [OK] ā Map [OK] ā Window [HIGH] ā Sink [OK] ā The FIRST operator showing HIGH is the bottleneck's UPSTREAM The actual bottleneck is the operator AFTER the backpressured one ā Sink is the bottleneck (Window is backpressured because Sink is slow) 2. Metrics (Prometheus/Grafana) - outPoolUsage: % of output buffers in use (high = sending blocked) - inPoolUsage: % of input buffers in use (high = receiving backed up) - numRecordsOutPerSecond: throughput (dropping = backpressure) - busyTimeMsPerSecond: time operator is busy (1000 = fully saturated) 3. Checkpoint Duration - Long checkpoint duration often indicates backpressure - Barrier alignment takes longer when buffers are full - Check "Alignment Duration" in checkpoint details 4. Kafka Consumer Lag - Growing lag = source is reading slower than production rate - Often caused by downstream backpressure slowing the source
Backpressure Indicators
- ā Web UI shows HIGH backpressure on operators upstream of bottleneck
- ā outPoolUsage > 90% on the operator before the bottleneck
- ā busyTimeMsPerSecond = 1000 on the bottleneck operator (100% busy)
- ā Checkpoint alignment duration increasing over time
- ā Kafka consumer lag growing steadily
- ā numRecordsOutPerSecond dropping below input rate
Resolving Backpressure
Resolving backpressure requires identifying the bottleneck and either increasing its capacity or reducing the load. The approach depends on whether the bottleneck is CPU-bound, I/O-bound, or state-access-bound.
| Bottleneck Type | Symptoms | Solutions |
|---|---|---|
| CPU-bound operator | busyTime=1000, high CPU usage | Increase parallelism, optimize UDF code, enable chaining |
| I/O-bound sink | Low CPU but high outPoolUsage | Batch writes, async I/O, increase sink parallelism |
| State access (RocksDB) | High disk I/O, slow processElement | Tune RocksDB (cache, write buffer), use SSD, reduce state size |
| Skewed keys | Some subtasks HIGH, others OK | Pre-aggregate before keyBy, use rebalance(), salt hot keys |
| GC pressure | Periodic spikes in latency | Increase heap, tune GC, reduce object creation, use RocksDB |
// Strategy 1: Increase parallelism of bottleneck operator stream .keyBy(Event::getUserId) .process(new ExpensiveFunction()) .setParallelism(32) // increase just this operator .name("Expensive Processing"); // Strategy 2: Async I/O for slow external calls AsyncDataStream.unorderedWait( stream, new AsyncDatabaseLookup(), 30, TimeUnit.SECONDS, 200 // 200 concurrent requests (up from default) ); // Strategy 3: Pre-aggregate to reduce downstream load stream .keyBy(Event::getRegion) .window(TumblingProcessingTimeWindows.of(Time.seconds(1))) .reduce((a, b) -> merge(a, b)) // pre-aggregate .keyBy(Result::getRegion) .process(new ExpensiveSink()); // less data to process // Strategy 4: Handle data skew with salting stream .map(event -> { // Add random salt to hot keys String saltedKey = event.getKey() + "_" + random.nextInt(10); return event.withKey(saltedKey); }) .keyBy(Event::getKey) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .reduce(reducer) // Second stage: aggregate salt partitions .keyBy(result -> result.getOriginalKey()) .window(TumblingEventTimeWindows.of(Time.minutes(1))) .reduce(reducer);
Find the Bottleneck First
Don't increase parallelism blindly. First identify which operator is the actual bottleneck (the one AFTER the backpressured operators). Then determine WHY it's slow (CPU, I/O, state, skew). The fix depends on the root cause ā more parallelism doesn't help if the problem is a slow external system or data skew.
Backpressure vs Kafka Consumer Lag
Kafka consumer lag and Flink backpressure are related but distinct concepts. Backpressure is the internal flow control mechanism. Consumer lag is the external symptom visible in Kafka when the source can't read fast enough.
Relationship between backpressure and Kafka lag: Normal operation (no backpressure): Kafka production rate: 100K events/sec Flink consumption rate: 100K events/sec Consumer lag: ~0 (caught up) Backpressure: none Backpressure scenario: Kafka production rate: 100K events/sec Flink processing capacity: 60K events/sec (bottleneck downstream) Flink consumption rate: 60K events/sec (throttled by backpressure) Consumer lag: growing by 40K events/sec Backpressure: HIGH on operators upstream of bottleneck Key insight: Backpressure ā Source reads slower ā Lag grows Lag is the SYMPTOM, backpressure is the MECHANISM Lag without backpressure (different problem): - Source parallelism < Kafka partitions (some partitions unread) - Deserialization is the bottleneck (source itself is slow) - Network issues between Flink and Kafka Monitoring both: - Kafka lag: tells you HOW FAR BEHIND you are - Backpressure metrics: tell you WHERE the bottleneck is - Together: diagnose both the severity and the cause
| Metric | What It Tells You | Where to Look |
|---|---|---|
| Kafka consumer lag | How far behind real-time you are | Kafka monitoring (Burrow, Confluent) |
| Backpressure (Web UI) | Which operator is the bottleneck | Flink Web UI ā Backpressure tab |
| busyTimeMsPerSecond | How saturated each operator is | Flink metrics (Prometheus) |
| Checkpoint duration | Overall pipeline health | Flink Web UI ā Checkpoints |
Acceptable Lag
Some consumer lag is acceptable depending on your SLA. If your requirement is "results within 5 minutes of event time," then 2 minutes of lag is fine. Monitor lag trends ā steady lag is okay, growing lag means you're falling further behind and need to scale.
Interview Questions
Q:What is backpressure and how does Flink handle it?
A: Backpressure occurs when downstream operators can't keep up with upstream throughput. Flink handles it with credit-based flow control: each receiver tells the sender how many buffers (credits) it can accept. When credits run out, the sender blocks. This propagates naturally upstream ā eventually the source slows down to match the slowest operator's capacity. No data is lost, no unbounded buffering occurs. The trade-off: end-to-end latency increases and Kafka consumer lag grows.
Q:How do you identify the bottleneck operator causing backpressure?
A: In the Flink Web UI, look at the backpressure indicators. The bottleneck is the operator AFTER the last backpressured operator. Example: Source[OK] ā Map[HIGH] ā Window[HIGH] ā Sink[OK]. The Sink is the bottleneck ā Map and Window are backpressured because Sink is slow. Confirm with metrics: the bottleneck has busyTimeMsPerSecond=1000 (100% busy) and low outPoolUsage (it's not blocked sending, it's busy processing). Upstream operators have high outPoolUsage (blocked trying to send).
Q:What's the relationship between Flink backpressure and Kafka consumer lag?
A: They're cause and effect. When downstream backpressure propagates to the Kafka source, the source reads slower than the production rate. This causes consumer lag to grow. Lag is the external symptom; backpressure is the internal mechanism. Important: lag can exist without backpressure (source parallelism too low, deserialization bottleneck). And backpressure can exist without lag (if the source isn't Kafka, or if the bottleneck is between two internal operators). Monitor both: lag tells you severity, backpressure tells you location.
Q:How would you resolve backpressure caused by data skew?
A: Data skew means some keys have much more data than others ā a few subtasks are overloaded while others are idle. Solutions: (1) Two-stage aggregation with key salting ā add random suffix to hot keys, aggregate per-salt, then aggregate across salts. (2) Pre-aggregation ā use a small tumbling window before the expensive operation to reduce volume. (3) Rebalance ā if the operation doesn't need keying, use .rebalance() for round-robin distribution. (4) Custom partitioner ā distribute hot keys across multiple subtasks with a custom KeySelector.
Common Mistakes
Increasing parallelism of the wrong operator
Seeing backpressure on the Map operator and increasing Map's parallelism. But Map is backpressured because the downstream Sink is slow ā increasing Map parallelism makes it worse (more data hitting the same slow sink).
ā Always identify the actual bottleneck (operator AFTER the backpressured ones) before scaling. Increase parallelism of the bottleneck operator, not the backpressured upstream operators.
Adding network buffers to fix backpressure
Increasing taskmanager.memory.network.fraction thinking more buffers will fix backpressure. More buffers just delay the onset ā they fill up eventually and backpressure returns, now with higher memory usage.
ā More buffers don't fix throughput problems ā they just add latency before backpressure kicks in. Fix the root cause: optimize the bottleneck operator, increase its parallelism, or reduce input rate.
Ignoring data skew
All subtasks of an operator show OK except one that shows HIGH backpressure. The hot key problem ā one key has 100x more data than others, overloading a single subtask.
ā Check per-subtask metrics in the Web UI. If one subtask is saturated while others are idle, you have skew. Use key salting (two-stage aggregation), pre-aggregation, or a custom partitioner to distribute hot keys.
Not monitoring backpressure proactively
Only noticing backpressure when Kafka lag alerts fire or results are delayed. By then, you're already behind and catching up takes time.
ā Set up Grafana dashboards with busyTimeMsPerSecond and outPoolUsage per operator. Alert when busyTime > 800ms/sec (80% saturated) ā this gives you warning before full backpressure develops. Also monitor checkpoint duration trends.