GLM 4.5 Air

Kafka Broker Recovery Guide: Avoid CPU Spikes

Learn how to properly add Kafka brokers back to cluster after extended downtime without causing CPU spikes. Complete guide with replication throttling configuration and step-by-step recovery process for Kafka 2.3.0.

Question

How to properly add a Kafka broker back to a cluster after extended downtime without causing CPU spikes and cluster instability?

I have a Kafka cluster running version 2.3.0 with 27 brokers. Two brokers went down separately due to disk failures and have been offline for over a month. When I attempted to add one back, it caused a CPU spike across the cluster and instability because TBs of data needed to be replicated.

According to Kafka documentation, replication throttling should help control this process. I’ve set up a test cluster with 5 brokers and similar configuration to reproduce the issue. However, when I apply replication throttling configurations, the data is still being replicated at maximum resource usage without any throttling.

Here’s the command I used to enable replication throttling:

./kafka-configs.sh --bootstrap-server <bootstrap-servers> \
  --entity-type brokers --entity-name <broker-id> \
  --alter --add-config leader.replication.throttled.rate=30000000,follower.replication.throttled.rate=30000000,leader.replication.throttled.replicas=*,follower.replication.throttled.replicas=*

My server.properties configuration:

# Network Settings
num.network.threads=12

# The number of threads that the server uses for processing requests, which may include disk I/O
num.io.threads=18

# Replica Settings
num.replica.fetchers=6

How can I achieve proper replication throttling to bring brokers back online without causing CPU spikes and cluster instability in my production Kafka environment?

GLM 4.5 Air

Brief Answer

To properly add a Kafka broker back to a cluster after extended downtime without causing CPU spikes, you need to correctly configure replication throttling at both broker and topic levels, adjust replica fetcher settings, and implement a gradual recovery process. Your current throttling configuration may not be working because the throttled replicas aren’t properly specified, or the broker hasn’t fully propagated these settings before starting the recovery process.

Contents

  • Understanding Kafka Replication Throttling
  • Why Your Throttling Configuration Isn’t Working
  • Correct Throttling Configuration
  • Step-by-Step Broker Recovery Process
  • Monitoring and Dynamic Adjustments
  • Advanced Techniques for Large Recovery Operations
  • Conclusion and Best Practices

Understanding Kafka Replication Throttling

Kafka’s replication throttling mechanism limits the rate at which data is replicated between brokers, preventing network and disk overload when bringing replicas back into sync after broker downtime. In your case, with brokers offline for over a month, there’s potentially terabytes of data to replicate, making proper throttling critical for cluster stability.

The throttling mechanism works in two directions:

  • Leader replication throttling: Controls the rate at which data is sent from the leader to followers
  • Follower replication throttling: Controls the rate at which data is received by followers from leaders

Your current configuration sets both rates to 30MB/s, which should be reasonable, but if this throttling isn’t being properly applied, the system might be replicating at full speed, causing the CPU spikes you’re experiencing.


Why Your Throttling Configuration Isn’t Working

Several factors could explain why your replication throttling isn’t working as expected:

  1. Replica specification: Your configuration uses wildcards (*) for throttled replicas, which should work, but might not be specific enough for your recovery scenario.

  2. Metadata refresh timing: After applying throttling configurations, Kafka needs time to propagate these changes across the cluster. If you immediately started the broker after applying throttling, the configuration might not have been fully distributed.

  3. Replica fetcher saturation: With num.replica.fetchers=6, these threads might be saturated with the number of partitions needing recovery, preventing effective throttling.

  4. Version limitations: Kafka 2.3.0 has some limitations in throttling implementation compared to newer versions. Certain features might not work as expected in this version.

  5. Topic-level overrides: If specific topics have their own replication throttling settings configured, those would override the broker-level settings you’ve applied.


Correct Throttling Configuration

To properly configure replication throttling in your Kafka 2.3.0 environment:

  1. Apply throttling to specific brokers:
bash
./kafka-configs.sh --bootstrap-server <bootstrap-servers> \
  --entity-type brokers --entity-name <broker-id> \
  --alter --add-config leader.replication.throttled.rate=10485760,follower.replication.throttled.rate=10485760

I’ve reduced the throttled rate to 10MB/s (10,485,760 bytes/s) for more conservative throttling. This can be adjusted based on your cluster capacity.

  1. Configure which replicas to throttle:

For bringing specific brokers back online, identify the partitions where the recovering broker is a replica:

bash
./kafka-topics.sh --bootstrap-server <bootstrap-servers> --describe \
  | grep -E "<broker-id>[0-9]+:<broker-id>[0-9]+" | awk '{print $6}'

Then apply throttling specifically to those partitions:

bash
./kafka-configs.sh --bootstrap-server <bootstrap-servers> \
  --entity-type brokers --entity-name <broker-id> \
  --alter --add-config leader.replication.throttled.replicas=<partition-list>,follower.replication.throttled.replicas=<partition-list>
  1. Adjust replica fetcher settings:

Consider increasing the replica fetcher count to distribute the load:

bash
# In server.properties
num.replica.fetchers=12
  1. Network and I/O thread optimization:

For recovery scenarios, you might need to adjust thread counts:

bash
# In server.properties
num.network.threads=16
num.io.threads=24

Step-by-Step Broker Recovery Process

Here’s a recommended process to bring your broker back online with proper throttling:

  1. Preparation phase:

    • Verify the broker’s hardware is fully functional
    • Ensure disk space is sufficient for the expected data volume
    • Back up any critical configurations
  2. Apply throttling configurations:

    • Identify all partitions where the broker is a replica
    • Apply broker-level throttling settings as described above
    • Apply replica-specific throttling to only the partitions that need recovery
  3. Gradual restart process:

    • Start the broker with zookeeper.connect pointing to your Zookeeper ensemble
    • Monitor the startup process closely for errors
    • Check that the broker registers successfully in the cluster
  4. Monitor replication progress:

    • Use the following command to monitor replication lag:
    bash
    ./kafka-topics.sh --bootstrap-server <bootstrap-servers> --describe
    
    • Look for “UnderReplicatedPartitions” metric in JMX or monitoring tools
    • Watch CPU, memory, disk I/O, and network usage across the cluster
  5. Complete recovery:

    • Once all partitions are fully replicated and in sync
    • Remove the throttling configurations:
    bash
    ./kafka-configs.sh --bootstrap-server <bootstrap-servers> \
      --entity-type brokers --entity-name <broker-id> \
      --alter --delete-config leader.replication.throttled.rate,follower.replication.throttled.rate,leader.replication.throttled.replicas,follower.replication.throttled.replicas
    

Monitoring and Dynamic Adjustments

Effective monitoring is crucial during the broker recovery process:

  1. Key metrics to monitor:

    • CPU utilization across all brokers (aim to keep below 70-80%)
    • Disk I/O wait time
    • Network throughput
    • JVM heap usage
    • UnderReplicatedPartitions count
    • Request latency metrics
  2. Dynamic throttling adjustments:

    • Start with a conservative throttle rate (e.g., 5-10MB/s)
    • Monitor cluster stability for 15-30 minutes
    • If stable, gradually increase the rate (e.g., by 25-50%)
    • Continue until you find the maximum sustainable rate
  3. Alerting setup:

    • Configure alerts for CPU spikes, increased latency, or high replication lag
    • Set up notifications for when UnderReplicatedPartitions count changes
    • Monitor disk space on both the recovering broker and existing brokers

Advanced Techniques for Large Recovery Operations

For your large-scale recovery scenario with terabytes of data, consider these advanced techniques:

  1. Incremental broker restart:

    • Instead of bringing both brokers back at once, stagger the recovery process:
    • Bring back one broker, wait for full recovery
    • Then bring back the next broker
    • This provides more control and reduces the total load on the cluster
  2. Topic-level throttling:

    • If some topics have higher priority than others, apply different throttling rates:
    bash
    ./kafka-configs.sh --bootstrap-server <bootstrap-servers> \
      --entity-type topics --entity-name <topic-name> \
      --alter --add-config leader.replication.throttled.rate=<rate>,follower.replication.throttled.rate=<rate>
    
  3. Temporary partition reassignment:

    • For extremely large partitions, consider temporarily reassigning them to reduce the load on the recovering broker.
  4. Hardware resource provisioning:

    • Temporarily allocate additional resources to the recovery process:
    • Increase I/O threads and network threads
    • Consider using SSDs temporarily for faster recovery
    • Ensure sufficient disk space (at least 2x the expected data volume)

Conclusion and Best Practices

Bringing Kafka brokers back online after extended downtime requires careful planning and configuration to avoid cluster instability. Here are the key takeaways:

  1. Throttling is essential but needs proper configuration. Start with conservative rates (5-10MB/s) and gradually increase based on cluster stability.

  2. Be specific about what to throttle instead of using wildcards when possible, to avoid unnecessary restrictions on healthy replication flows.

  3. Monitor continuously during the recovery process, paying close attention to CPU, disk I/O, and network usage.

  4. Gradual approach is better - consider staggered recovery of multiple brokers rather than bringing them all back simultaneously.

  5. Version considerations - Kafka 2.3.0 has some limitations compared to newer versions. If possible, consider upgrading to a newer version for improved throttling capabilities.

  6. Test your process - Always test your recovery process in a non-production environment before applying to production.

By implementing these strategies, you should be able to bring your Kafka brokers back online without causing CPU spikes and cluster instability. Remember that each environment is unique, so you may need to adjust the throttling rates and recovery approach based on your specific cluster configuration and hardware capabilities.