Featured image of post Preventing Infinite Retry Loops in Kafka Consumers

Preventing Infinite Retry Loops in Kafka Consumers

How we discovered and fixed a critical issue that could have led to runaway costs and blocked message processing

While building Listify’s real-time content processing pipeline, we uncovered a subtle but dangerous pattern in our Kafka consumers that could have led to infinite retry loops, runaway API costs, and blocked message processing. Here’s what happened and how we fixed it.

The Problem: When Failure Means Forever

Listify uses Apache Kafka to handle asynchronous processing tasks - generating AI-powered metadata, cleaning up deleted content, and processing RSS feeds. Each task is handled by a dedicated consumer service that listens to specific Kafka topics.

This part of our codebase went through rapid development cycles - we needed to get features out quickly. The initial implementation had only a simple, naive retry mechanism: if something fails, try again. We knew this was technical debt, but like many teams under time pressure, we moved forward with a mental note to revisit it later.

This is exactly why documenting technical debt and actually planning it in your backlog matters. What starts as a “we’ll fix it later” can turn into a costly oversight when left unchecked.

The pattern seemed straightforward: consume a message, process it, commit the offset if successful. If processing fails, don’t commit and let the consumer retry. Simple, right?

Not quite.

The Hidden Trap

When you disable auto-commit in Kafka (which you should for reliable processing), you take full responsibility for offset management. If your code never commits a failed message, it will retry forever. That message stays at the front of the queue, blocking all subsequent messages in the same partition.

We had three services with this exact pattern:

1
2
3
4
5
6
while True:
    msg = consumer.poll(1.0)
    
    if process_message(msg):
        consumer.commit(msg)  # Only commit on success
    # If it fails... we just loop and try again

No retry limits. No backoff delays. No escape hatch.

The Cost Impact

One of our services calls external AI APIs to generate embeddings and analyze content. Let’s do the math on a stuck message:

  • API cost per request: $0.00015
  • Retries per second: ~1
  • Time stuck: 1 week

Total cost: ~$90 per stuck message

With multiple services and messages, this could easily spiral into hundreds or thousands of dollars before anyone noticed. Of course, we had set up usage limits and billing alerts on the service provider side beforehand - something you should always do when using pay-as-you-go services. But relying solely on billing alerts as your safety net isn’t ideal. By the time you get notified, the damage is already done.

The Blocking Problem

Kafka partitions process messages sequentially. When one message can’t be processed, everything behind it waits. In our case:

  • User saves a malformed bookmark → metadata generation fails
  • Generator retries infinitely on that one item
  • All subsequent bookmarks from that user sit in queue
  • User’s new content never gets processed

This isn’t just a cost issue - it’s a user experience disaster.

The Solution: Bounded Retries with Exponential Backoff

We implemented a retry strategy with three key components:

1. Retry Limits

Set clear boundaries for retry attempts:

  • Maximum retry count (typically 8-20 attempts)
  • Maximum retry duration (10-20 minutes)
  • Whichever limit hits first, give up and commit

This ensures no message stays in the queue forever.

2. Exponential Backoff

Instead of hammering failed operations immediately, space out retries:

Attempt Delay
1 1 second
2 2 seconds
3 4 seconds
4 8 seconds
5 16 seconds
6 32 seconds
7 64 seconds
8 120 seconds (capped)

This gives transient failures time to resolve while preventing tight retry loops.

3. Circuit Breakers for External APIs

For services calling expensive external APIs, we added circuit breakers:

  • Track consecutive failures
  • After 3 failures, “open” the circuit for 60 seconds
  • Skip API calls during that window
  • Test with one call after timeout

This prevents burning through API credits when a service is genuinely down.

Implementation Pattern

The core retry logic tracks attempts per message and calculates backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
retry_tracker = RetryTracker(max_retries=8, max_wait_seconds=600)
current_message = None

while True:
    # Only poll for new messages if we're not retrying
    if current_message is None:
        msg = consumer.poll(1.0)
        current_message = msg
    
    message_key = f"{msg.topic()}:{msg.partition()}:{msg.offset()}"
    
    # Check if we should give up
    if not retry_tracker.should_retry(message_key):
        logger.error("Max retries exceeded, giving up")
        consumer.commit(msg)
        retry_tracker.mark_success(message_key)
        current_message = None
        continue
    
    # Try processing
    if process_message(msg):
        consumer.commit(msg)
        retry_tracker.mark_success(message_key)
        current_message = None
    else:
        # Calculate backoff and wait
        backoff = retry_tracker.calculate_backoff(message_key)
        time.sleep(backoff)
        # Don't clear current_message - we'll retry

The key insight: track the current message explicitly. Only poll for new messages when you’ve successfully processed or given up on the current one.

Results

After implementing this pattern across all our Kafka consumers:

Cost Protection:

  • Before: Unlimited API calls per stuck message (~$90/week potential)
  • After: Maximum 8 API calls per message (~$0.0012)
  • Cost reduction: 99.99%

Availability:

  • Partitions no longer blocked by single failed messages
  • Maximum processing delay: 10 minutes (down from infinite)
  • Failed messages logged for manual review

Reliability:

  • Exponential backoff allows transient failures to resolve
  • Circuit breakers prevent API cost spirals
  • Services gracefully handle external API outages

Lessons Learned

  1. Manual offset management requires careful failure handling: Disabling auto-commit is good practice, but you must have a strategy for when things go wrong.

  2. Cost tracking matters: Know the financial impact of retry loops, especially with pay-per-use APIs.

  3. Test failure scenarios: We discovered this during code review, not production. Always test what happens when external dependencies fail.

  4. Exponential backoff is your friend: It’s a simple pattern that solves multiple problems - prevents tight loops, gives systems time to recover, and reduces load on failing services.

  5. Monitor partition lag: This metric would have caught the blocking issue quickly. If partition lag grows unbounded, something’s stuck.

When to Give Up

The hardest question: when do you stop retrying and accept failure?

Our approach:

  • Transient failures (network issues, rate limits): Retry with backoff
  • Permanent failures (invalid data, missing dependencies): Log and skip after reasonable attempts
  • External service outages: Circuit breaker prevents cost spiral, log for manual follow-up

After max retries, we commit the offset and log detailed context. Operations teams can investigate and reprocess if needed, but the queue keeps moving.

Takeaway

Kafka is a powerful tool for building asynchronous, scalable systems. But with great power comes great responsibility - especially around failure handling. A simple oversight in retry logic can lead to infinite loops, blocked partitions, and runaway costs.

If you’re building Kafka consumers with manual offset management:

  1. Set retry limits (count and time)
  2. Implement exponential backoff
  3. Always have an escape hatch (commit after max retries)
  4. Add circuit breakers for expensive external calls
  5. Monitor partition lag and processing rates

Your future self (and your finance team) will thank you.

Powered by Listify Smart
Built with Hugo
Theme Stack designed by Jimmy