In x86 assembly from unoptimized C code, why is there a 'sub esp, 8' instruction after stdcall calls but not after cdecl calls, despite stdcall being callee-cleanup?

Unoptimized compilers use fixed stack frames reserving space for locals and args upfront. Cdecl's plain 'ret' leaves ESP at args; stdcall's 'ret 8' skips them, requiring 'sub esp, 8' to realign the frame in the caller.

Programming

Why sub esp,8 After stdcall Call But Not Cdecl?

Understand why compilers generate 'sub esp, 8' after stdcall calls but not cdecl in unoptimized x86 assembly. Explains stack frame management, ret 8 vs ret, and calling conventions differences for debugging.

1 answer• 1 view

01/26/2026, 03:30 PM

Why does the compiler generate sub esp, 8 after the call to a stdcall function, but not for cdecl?

An important difference between cdecl (caller‑cleanup) and stdcall (callee‑cleanup) calling conventions is how the stack is managed after function calls.

Consider this C program compiled without optimizations:

#include <stdio.h>

int __cdecl foo(int m, int n) {
 return m + n;
}

int __stdcall bar(int m, int n) {
 return m + n;
}

int main(void) {
 int m, n, x;
 scanf("%d%d", &m, &n);
 x = foo(m, n);
 printf("%d\n", x);
 x = bar(m, n);
 printf("%d\n", x);
 return 0;
}

Compiled functions:

foo (cdecl):

asm

00401608 <_foo>:
 401608: 55 push ebp
 401609: 89 e5 mov ebp,esp
 40160b: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
 40160e: 8b 45 0c mov eax,DWORD PTR [ebp+0xc]
 401611: 01 d0 add eax,edx
 401613: 5d pop ebp
 401614: c3 ret

bar (stdcall):

asm

00401615 <_bar@8>:
 401615: 55 push ebp
 401616: 89 e5 mov ebp,esp
 401618: 8b 55 08 mov edx,DWORD PTR [ebp+0x8]
 40161b: 8b 45 0c mov eax,DWORD PTR [ebp+0xc]
 40161e: 01 d0 add eax,edx
 401620: 5d pop ebp
 401621: c2 08 00 ret 0x8

Difference: ret vs ret 0x8.

In main disassembly, call to foo (cdecl):

asm

 401654: 8b 55 ec mov edx,DWORD PTR [ebp-0x14]
 401657: 8b 45 f0 mov eax,DWORD PTR [ebp-0x10]
 40165a: 89 54 24 04 mov DWORD PTR [esp+0x4],edx
 40165e: 89 04 24 mov DWORD PTR [esp],eax
 401661: e8 a2 ff ff ff call 401608 <_foo>
 401666: 89 45 f4 mov DWORD PTR [ebp-0xc],eax

No stack adjustment.

Call to bar (stdcall):

asm

 40167c: 8b 55 ec mov edx,DWORD PTR [ebp-0x14]
 40167f: 8b 45 f0 mov eax,DWORD PTR [ebp-0x10]
 401682: 89 54 24 04 mov DWORD PTR [esp+0x4],edx
 401686: 89 04 24 mov DWORD PTR [esp],eax
 401689: e8 87 ff ff ff call 401615 <_bar@8>
 40168e: 83 ec 08 sub esp,0x8
 401691: 89 45 f4 mov DWORD PTR [ebp-0xc],eax

Why the sub esp, 0x8 after the stdcall? In theory, stdcall callee cleans up the stack.

Note: Older compilers favored stdcall for smaller code (no per‑call cleanup in caller), but modern compilers appear to reverse this.

In unoptimized x86 assembly from compilers like GCC or MSVC, cdecl and stdcall handle stack cleanup differently, but the sub esp, 8 after stdcall calls stems from a fixed stack frame strategy. The caller’s prologue reserves space for locals plus arguments upfront, and args get moved directly onto stack slots instead of pushed. A plain cdecl ret leaves ESP at the args (still within the frame), no tweak needed—but stdcall’s ret 8 jumps over those args, so sub esp, 8 pulls ESP back to realign everything.

cdecl vs stdcall Calling Conventions

Picture this: you’re deep in x86 assembly, debugging why your stack pointer ESP seems off after function calls. Cdecl says the caller cleans up arguments; stdcall puts that job on the callee. Simple, right? But in real disassembly from unoptimized C code, it gets quirky.

With cdecl, you push args right-to-left, call, then you (caller) add ESP by the arg size afterward. Stdcall flips it—the function ends with ret N, popping the return address and ditching args in one go. No variable args for stdcall, though; fixed count only.

From the x86 calling conventions page on Wikipedia, stdcall mimics Pascal but reverses param order to right-to-left pushes. Yet here’s the twist in your example: no pushes at all. Modern compilers skip push for speed, writing args via mov [esp+offset], reg. And that fixed frame? It changes everything.

Why does this matter? Because without understanding prologue allocation, that lone sub esp, 8 looks like a glitch. Spoiler: it’s deliberate.

How Stack Frames Work in Unoptimized Code

Unoptimized code—think -O0 in GCC—prioritizes debuggability over speed. Prologue looks like:

push ebp
mov ebp, esp
sub esp, FRAME_SIZE ; locals + max args space

FRAME_SIZE covers locals and space for outgoing args across all calls. No per-call sub esp before args; instead:

mov [esp+4], edx ; 2nd arg into arg slot
mov [esp], eax ; 1st arg into arg slot
call _function

ESP before call: points to arg area. After call?

Cdecl ret: Pops return addr. ESP now at first arg. Perfect—arg space stays “allocated” until epilogue tears down the whole frame.
Stdcall ret 8: Pops return addr then add esp, 8. ESP skips args, landing where locals end/next free space should be.

Result? Stack misalignment for stdcall. Fix: caller adds sub esp, 8 post-call, “rewinding” to arg slots. Epilogue later cleans it all with leave or mov esp, ebp; pop ebp.

A Stack Overflow thread on stdcall vs cdecl nails it: “ret 8 pops return first, then adjusts ESP—unlike add esp,8; ret.” Compilers pick this for consistency, even if stdcall promised smaller code.

Ever wonder why no push? Pushes mess with EBP-relative addressing. Slot-filling keeps frames stable.

Breaking Down the cdecl Call (foo)

Zoom into your main’s cdecl call to foo:

mov edx, [ebp-0x14] ; n
mov eax, [ebp-0x10] ; m
mov [esp+4], edx ; arg1 (n) at esp+4
mov [esp], eax ; arg0 (m) at esp
call _foo
mov [ebp-0xc], eax ; store result

Inside _foo (cdecl):

push ebp
mov ebp, esp
mov edx, [ebp+8] ; m
mov eax, [ebp+0xc] ; n
add eax, edx
pop ebp
ret ; ESP += 4 (return addr), now at m's slot

Post-ret, ESP sits on the 8-byte arg block. No add esp,8 needed—prolog’s big sub esp included it. Next instructions use EBP offsets, oblivious. Smooth.

This matches another Stack Overflow explanation on cdecl cleanup: prologue pre-allocates arg space, so no mid-function tweaks. Caller “cleans” implicitly via epilogue.

But what if multiple calls? Frame holds enough for the biggest. Args overwrite slots as needed—no fuss.

Why sub esp, 8 Appears After stdcall (bar)

Now the puzzling bar call:

mov edx, [ebp-0x14]
mov eax, [ebp-0x10]
mov [esp+4], edx
mov [esp], eax
call _bar@8
sub esp, 0x8 ; <- Here it is!
mov [ebp-0xc], eax

_bar@8 (stdcall):

push ebp
mov ebp, esp
mov edx, [ebp+8]
mov eax, [ebp+0xc]
add eax, edx
pop ebp
ret 0x8 ; ESP += 12 (ret addr + 8 bytes args)

ret 0x8 vaults ESP past args. Without correction, next mov [ebp-0xc], eax might clobber wrong spots—or worse, segfault on frame mismatch.

Enter sub esp, 0x8: drags ESP back exactly 8 bytes, to arg slots. Frame realigns. Epilogue handles the rest.

From the Stack Overflow post on EBP/ESP cleanup: cdecl repeats cleanup per call (or batches); stdcall shifts burden, but unoptimized compilers use this “compensate via sub” for fixed frames. It’s not “caller cleaning”—it’s frame repair.

Frustrating at first glance, since stdcall should handle it. But this lets compilers use identical arg-passing for both conventions.

Compiler Strategies and Trade-offs

Older lore praised stdcall for tiny code—no repeated add esp in callers. Think Win32 API: millions of calls, callee cleans once. But unoptimized? Compilers like yours flip to caller-side slot management.

Why? Debug info. Fixed frames mean reliable EBP offsets for variables, stepping through calls without recomputing ESP. Wikibooks on x86 conventions shows stdcall asm with ret 8, but ignores frame tricks.

Trade-offs hit hard:

Convention	Unoptimized Pro	Con
cdecl	No post-call insns	Variable args OK
stdcall	Callee owns cleanup	Fixed args; needs `sub esp` hack

Modern compilers? At -O2, pushes return, add esp,N optimizes out. Stdcall shrinks callers further.

What Changes with Optimization?

Crank optimization: sub esp,8 vanishes. Prologue shrinks to just locals. Args? Real pushes:

push edx
push eax
call _bar@8 ; ret 8 cleans perfectly
add esp,8 ; but wait, no—for stdcall, often omitted if optimized

Nah—optimizers batch cleanups. Or inline entirely. Your note’s spot-on: stdcall regains “smaller code” edge at speed.

Test it: gcc -O0 -m32 vs -O2. Watch frames morph. But for debugging? Stick to unoptimized—those ESP dances teach volumes.

Sources

STDCALL vs CDECL: ret vs sub esp — Explains ret 8 vs add esp differences in stack adjustment: https://stackoverflow.com/questions/52727400/stdcall-vs-cdecl-ret-vs-sub-esp-have-anything-to-do-with-the-calling-conven
Unable to understand cdecl cleanup — Details prologue arg allocation avoiding per-call subs: https://stackoverflow.com/questions/49513707/unable-to-understand-example-of-cdecl-calling-convention-where-caller-doesnt-nee
STDCALL vs CDECL with EBP/ESP — Compares parameter cleanup responsibilities: https://stackoverflow.com/questions/58453998/understanding-the-concept-of-stdcall-vs-cdecl-with-ebp-and-esp-cleanup
x86 calling conventions — Defines stdcall and cdecl stack rules: https://en.wikipedia.org/wiki/X86_calling_conventions
x86 Disassembly Calling Conventions — Provides stdcall ret N examples: https://en.wikibooks.org/wiki/X86_Disassembly/Calling_Conventions

Conclusion

That sub esp, 8 after stdcall? It’s the compiler patching a callee-cleanup convention into a fixed-frame world—keeping ESP honest without per-call pushes. Cdecl flows naturally; stdcall needs the nudge. Optimize away the noise, but grasp this for debugging low-level x86. Next time ESP wanders, you’ll know why.

Authors

NeuroAnswers

Author

Verified by moderation