Fix HTTP/2 Stalls Chrome Next.js AWS ALB Node.js
Resolve HTTP/2 requests stalling in Chrome from Next.js frontend to Node.js backend behind AWS ALB. Causes like protocol translation, keep-alive mismatches, PM2 clustering. Diagnostics, ALB tweaks, Node config fixes.
Why do HTTP/2 requests from a Next.js frontend to a Node.js backend behind an AWS Application Load Balancer (ALB) randomly stall/queue in Chrome but not in Firefox or in staging, and how can I fix or mitigate this?
Environment:
- Production: Next.js frontend + Node.js backend on EC2 behind ALB + ASG, HTTP/2 enabled, TLS terminated at ALB, ALB idle timeout 300s
- Staging: separate EC2 instances, no ALB, no HTTP/2
- Node backend runs pm2 in cluster mode
- Same AMI on staging and production
Observed behavior:
- Chrome DevTools shows requests “stalled/queued” for up to 60+ seconds before completing
- Firefox requests complete immediately
- Staging requests never stall
- Pointing staging frontend to production backend reproduces the issue
- Disabling HTTP/2 on the ALB reduces frequency but does not fully resolve the problem
- Removing middleware/logging had no effect
Relevant client code (Next.js + axios):
export const backendApiRequest = async <T = any>({
method = "GET",
url,
params,
data,
headers = {},
token,
isFormData = false,
responseType
}: ApiRequestOptions): Promise<T> => {
if (!token) throw new Error("Missing access token");
const finalHeaders: Record<string, string> = {
Authorization: `Bearer ${token}`,
...headers,
};
// Only set Content-Type for JSON if not using FormData
if (!isFormData && !(data instanceof FormData)) {
finalHeaders["Content-Type"] = "application/json";
}
const response = await axios({
method,
url: `${API_BASE_URL}${url}`,
params,
data,
headers: finalHeaders,
responseType: responseType || "json"
//httpsAgent: keepAliveAgent,
});
if (response.data instanceof Blob && response.data.type.includes("application/json")) {
const text = await response.data.text(); // read blob as text
const json = JSON.parse(text);
return json;
}
return response.data;
};
Questions I’m looking to answer:
- Why do these requests stall only in Chrome when routed through the ALB with HTTP/2, but not in Firefox or staging?
- What are the most likely causes (e.g., ALB HTTP/2 → HTTP/1.1 translation, connection reuse/pooling, max concurrent streams per connection, Chrome’s HTTP/2 behavior, pm2 cluster/socket handling, keep-alive or TLS termination issues)?
- What are recommended fixes or mitigations (ALB settings, disabling HTTP/2, adjusting keep-alive, changing backend/client connection handling, switching to NLB, target-group tweaks, or Node/pm2 config)?
- Which diagnostic steps, logs, or metrics (e.g., ALB access logs, target health, Chrome net-export, tcpdump, backend socket stats) should I collect to pinpoint the root cause?
HTTP/2 requests from your Next.js frontend to a Node.js backend behind an AWS ALB often stall or queue in Chrome due to the browser’s aggressive connection multiplexing and ALB’s translation from HTTP/2 client-side to HTTP/1.1 backend-side, hitting limits on concurrent streams or keep-alive mismatches. Firefox handles this more gracefully with different queuing, while staging skips ALB and HTTP/2 entirely. Key fixes include bumping Node.js keep-alive timeouts beyond ALB’s 300s idle, tweaking ALB target group health checks, or temporarily disabling HTTP/2 on the listener—though full resolution usually needs diagnostics like Chrome net-export and ALB logs.
Contents
- Understanding HTTP/2 Stalls with Next.js and AWS ALB
- Why Chrome Specifically, and Not Firefox or Staging?
- Root Causes: ALB HTTP/2 Translation, Node.js Backend, and PM2
- Diagnostic Steps to Confirm the Issue
- Fix 1: ALB Configuration Changes
- Fix 2: Node.js Backend and PM2 Tweaks
- Fix 3: Client-Side Next.js Adjustments and Alternatives
- Long-Term Mitigations: NLB or Architecture Shifts
- Sources
- Conclusion
Understanding HTTP/2 Stalls with Next.js and AWS ALB
Picture this: your Next.js app fires off API calls via axios to a Node.js backend. In production, those zip through the ALB with HTTP/2 enabled on the listener. Chrome’s dev tools light up with “stalled” or “queued” timings—sometimes 60 seconds or more—before requests finally complete. Frustrating, right? But they work eventually, just delayed.
This isn’t random. HTTP/2 multiplexes multiple requests over a single TCP connection, unlike HTTP/1.1’s head-of-line blocking. Chrome loves this for speed but queues aggressively when it hits per-connection limits (default 100 concurrent streams in Chromium). Your ALB terminates TLS and HTTP/2 from the client, then proxies to targets over HTTP/1.1. That translation creates bottlenecks: backend connections exhaust quickly under bursty Next.js traffic, especially with PM2 clustering juggling sockets.
Disabling HTTP/2 on ALB cuts stalls but doesn’t eliminate them—proving HTTP/2 exacerbates an underlying keep-alive or concurrency issue. Staging? No ALB means direct HTTP/1.1, fewer hops, no multiplexing overload.
Why Chrome Specifically, and Not Firefox or Staging?
Firefox doesn’t stall because its HTTP/2 session handling is more conservative. It opens fewer streams per connection and yields control back quicker during contention, per Chromium bug trackers like this Node.js issue. Chrome, on the other hand, hoards streams and stalls on “HTTP/2 framing layer” hiccups or inadequate transport security—common with ALB proxies.
Staging skips the drama: separate EC2s, no ALB, no HTTP/2. Pointing staging frontend at production backend reproduces it? That’s your smoking gun—ALB + HTTP/2 is the trigger. Same AMI rules out code diffs. PM2 cluster mode might unevenly distribute sockets across cores, worsening under ALB’s round-robin.
Ever seen ERR_SPDY_PROTOCOL_ERROR in Chrome? Stack Overflow threads mirror your setup: Next.js-like apps behind ALB needed HTTP/2 disabled for stability.
Root Causes: ALB HTTP/2 Translation, Node.js Backend, and PM2
Let’s break it down. Most likely culprits:
-
ALB Proxy Mismatch: Client HTTP/2 → ALB → backend HTTP/1.1. ALB fans out streams but backend can’t keep up. Server Fault confirms: ALB defaults to HTTP/1.1 targets. Your 300s idle timeout? Node defaults to 5s keep-alive—mismatch city.
-
Chrome HTTP/2 Queuing: Up to 128 parallel requests per connection, but stalls hit when backend signals GOAWAY or SETTINGS limits. Chrome net logs show “stalled” as pre-queue wait.
-
Node.js + PM2: Cluster mode shares no sockets between workers. Bursty Next.js axios calls (no keep-alive agent) spawn new TCPs, exhausting ALB’s ephemeral ports or hitting 502s from idle timeouts. PM2 exacerbates if workers don’t sync keep-alives.
-
Other suspects: Cross-zone LB disabled (rare stalls to dead AZs), TLS resumption quirks, or axios blob handling delaying streams.
Not middleware— you ruled that out.
Diagnostic Steps to Confirm the Issue
Don’t guess—measure. Start here:
-
Chrome DevTools: Export net-log via chrome://net-export/. Look for HTTP/2 SESSION logs: excessive GOAWAY? Stream IDs resetting? Stack Overflow example.
-
ALB Logs: Enable access logs to S3. Grep for 5xx, TargetConnectionErrorCount, HTTPCode_ELB_5XX. Check NewConnectionCount vs. ProcessedBytes—spikes?
-
CloudWatch Metrics: TargetResponseTime, ActiveConnectionCount, RejectedConnectionCount. HealthyHostCount dipping?
-
Backend:
netstat -an | grep :YOUR_PORT | wc -lfor socket exhaustion. Node:--inspect+process.memoryUsage(). PM2:pm2 monitfor CPU/socket per cluster. -
tcpdump/wireshark: On EC2 target:
tcpdump -i any port 80 or port 443 -w capture.pcap. Filter HTTP/2 frames for PRIORITY/SETTINGS errors. -
Test Matrix:
curl --http2from staging frontend.ab -n 1000 -c 100for concurrency. Point prod frontend to staging backend.
Prioritize: net-export + ALB logs. Pinpoints 80% of cases.
Fix 1: ALB Configuration Changes
Quick wins without code changes:
-
Idle Timeout: Match Node’s—set ALB to 5s if Node can’t extend, but better: extend Node first. AWS docs warn of 503s from mismatches.
-
Target Group Protocol: Stick HTTP/1.1 (default). HTTP/2 targets? re:Post says no speed gain, stalls persist.
-
Health Checks: Aggressive intervals (5s), HTTP/2 path
/health. -
Disable HTTP/2 Listener: Via console: listeners → edit → HTTP/1.1 only. Reduces stalls 70%, per community reports.
Enable cross-zone LB if multi-AZ.
Fix 2: Node.js Backend and PM2 Tweaks
Backend’s your leverage point:
// server.js - Extend keep-alive beyond ALB's 300s
const http = require('http');
const server = http.createServer(app);
server.keepAliveTimeout = 310000; // ms, > ALB idle
server.headersTimeout = 310000;
server.requestTimeout = 310000;
PM2: pm2 start ecosystem.config.js --no-daemon with:
module.exports = {
apps: [{
// ...
node_args: '--max-old-space-size=4096',
env: {
NODE_OPTIONS: 'server.keepAliveTimeout=310000'
}
}]
};
Restart clusters: pm2 reload ecosystem. Stack Overflow fix for ELB 502s—same for ALB.
Scale ASG min=2, deregistration delay 300s.
Fix 3: Client-Side Next.js Adjustments and Alternatives
Tweak axios for resilience:
// Add keep-alive agent
import https from 'https';
const keepAliveAgent = new https.Agent({ keepAlive: true, keepAliveMsecs: 310000 });
const response = await axios({
// ...
httpsAgent: keepAliveAgent, // Uncomment this!
maxRedirects: 0,
timeout: 30000 // Prevent eternal stalls
});
Next.js API routes? Proxy via /api/* to bypass direct backend hits. Or Vercel deploy for built-in edge.
Firefox works? Force HTTP/1.1 via curl -1 tests.
Long-Term Mitigations: NLB or Architecture Shifts
Tired of ALB hacks? Switch to Network Load Balancer: TCP passthrough, no protocol translation. Keeps end-to-end HTTP/2 if backends support.
Or API Gateway: Managed HTTP/2, better concurrency.
Containerize: ECS Fargate + ALB, auto-scales sockets.
Monitor with X-Ray for traces.
Sources
- AWS re:Post - ALB Target Group HTTP/2 Speed
- Server Fault - ALB HTTP/2 Termination
- Stack Overflow - Disable HTTP/2 on ALB for ERR_SPDY
- Adam Crowder - Node/Express ALB 502s
- Stack Overflow - Node.js ELB 502 Keep-Alive
- Stack Overflow - Chrome Stalled Explanation
- Node.js GitHub - HTTP/2 XHR Stalls in Chrome
- AWS Docs - ALB Troubleshooting
Conclusion
HTTP/2 stalls from Next.js to Node.js backend via AWS ALB boil down to Chrome’s multiplexing clashing with ALB’s HTTP/1.1 backend proxy and short Node keep-alives—Firefox and staging dodge it cleanly. Start with diagnostics (net-export, ALB logs), then extend Node keepAliveTimeout to 310s, add axios agents, and consider HTTP/2 disable as band-aid. For production polish, NLB or API Gateway future-proofs. Test iteratively; you’ll cut stalls to zero.