Which hardware blocks prefetch, when/how instructions and data are prefetched, how ART changes fetch behavior and performance, and which manuals/diagrams to study?

On Cortex-M cores instruction prefetching is performed by the Prefetch Unit (PFU) and front-end fetch buffers ahead of the decode stage; the PFU typically buffers up to six 16-bit Thumb instructions (or three 32-bit), issues aligned bus bursts and is flushed on taken branches and exceptions. Data prefetching is limited: M0-M4 rely on simple LSU buffering and AHB bursts, while Cortex-M7 adds L1 I- and D-caches that perform cache linefills (typically 32-byte) on misses. STM32's ART accelerator implements a deeper instruction prefetch queue and a small branch-target cache, enabled via FLASH_ACR (prefetch/cache/I-D cache bits), which reduces effective fetch latency for sequential code but can introduce timing jitter when buffers are flushed by interrupts, barriers or taken branches. For deeper technical details study the Cortex-M3 and Cortex-M7 TRMs (Prefetch Unit, pipeline and timing diagrams), the ARMv7-M memory and bus timing sections, and your STM32 Reference Manual (FLASH_ACR, ART, flash timing tables).

Hardware

Cortex-M Prefetching & ART Accelerator: How It Works

Technical guide to instruction and data prefetching on ARM Cortex-M and STM32 ART accelerator. Explains PFU, caches, FLASH_ACR bits, performance, and determinism.

1 answer• 3 views

01/11/2026, 10:33 AM

How does instruction and data prefetching work on ARM Cortex-M processors, and how does ART in Cortex-M implementations perform prefetching? Specifically:

Which hardware blocks perform prefetching (fetch/prefetch buffers, pipeline stages, caches)?
When and how are instructions and data prefetched (trigger conditions, alignment, burst sizes, branch handling, exceptions)?
How does ART affect fetch behavior, latency, throughput, and real-time determinism?
Which sections of the ARM or vendor reference manuals and timing diagrams should I study for a deeper technical understanding?

Instruction prefetching on ARM Cortex-M processors relies on the Prefetch Unit (PFU), a dedicated hardware block that buffers up to six 16-bit Thumb instructions (or three 32-bit ones) ahead of the execution pipeline to mask flash memory latency. Data prefetching is more limited, appearing mainly in Cortex-M7 via its L1 instruction and data caches with automatic linefills, while earlier cores like M3/M4 handle it through basic load/store buffering. STM32’s ART accelerator supercharges this for Cortex-M cores by adding a prefetch queue, branch cache, and flash-specific optimizations, slashing effective fetch latency from dozens of cycles to near-zero in sequential code.

Overview of Prefetching on Cortex-M
Hardware Blocks Involved
How Instructions Are Prefetched
Data Prefetching Mechanics
ART Accelerator and Its Prefetch Role
Performance Impacts: Latency, Throughput, and Determinism
Practical Measurements and Tips
Key Reference Manuals and Diagrams
Sources
Conclusion

Overview of Prefetching on Cortex-M

Ever wonder why your Cortex-M code runs faster on flash than you’d expect, even with wait states? It’s prefetching at work. ARM Cortex-M processors, from the nimble M0 to the beefy M7, use prefetch mechanisms to fetch instructions (and sometimes data) ahead of time, hiding the painful latency of slow NOR flash—often 10-20+ cycles per instruction at higher clocks.

Prefetching isn’t some magic; it’s hardware predicting you’ll execute the next few instructions sequentially. The Cortex-M3 Technical Reference Manual spells it out: a Prefetch Unit (PFU) grabs multiple Thumb instructions into a FIFO buffer while the pipeline chugs along. Newer cores build on this. And for STM32 users? Enter ART, ST’s accelerator that turns flash into something resembling fast RAM.

But it’s not all smooth. Branches, exceptions, and misaligned code can flush buffers, spiking latency. Understanding this helps tune real-time systems.

Hardware Blocks Involved

Prefetching spreads across a few key blocks in the Cortex-M pipeline. No single “prefetcher” does it all—it’s a team effort.

Start with the Prefetch Unit (PFU), present in M3, M4, M7, and beyond. This bad boy sits before the decode stage, pulling instructions from the bus interface (System or Code bus) into a buffer. The Cortex-M3 TRM describes it as holding a 3-word FIFO—up to six 16-bit Thumb instructions, with two more potentially in flight.

Then there’s the fetch pipeline stages: Fetch → PFU buffer → Decode → Execute. The pipeline overlaps these, so while you’re executing instruction N, the PFU fetches N+3 or so.

Caches enter the chat on M7: a 4-way set-associative Instruction Cache (I-Cache) and Data Cache (D-Cache), both 16KB by default. These prefetch entire cache lines (typically 32 bytes, or 8-16 instructions) on misses.

Vendor extras like STM32 ART wrap around this, adding prefetch queues and branch caches tied to the flash controller.

Data prefetch? Slimmer pickings. M0-M4 rely on load/store unit buffers (1-2 pending loads). M7’s D-Cache auto-prefetches lines on misses, but no aggressive strided prefetch like desktop CPUs.

How Instructions Are Prefetched

Triggers are straightforward: sequential execution. As the pipeline drains the PFU buffer, it issues a bus read for the next aligned chunk—usually 16 or 32 bits for Thumb/Thumb-2.

Burst sizes and alignment: Prefetches align to 32-bit or instruction boundaries. From community insights on Stack Overflow, the PFU grabs in bursts matching the AHB bus, often 1-2 words ahead. Max depth: six instructions buffered, per the ARM DAI0321A note.

Branch handling: Taken branches flush the PFU—costly, 5-10 cycles penalty. M7 adds a tiny branch target buffer (BTB) in the PFU for prediction, but it’s basic. Sequential or predicted-not-taken? Buffer refills seamlessly.

Exceptions: Interrupts or faults invalidate the PFU. Vector fetch bypasses it somewhat, but you’ll stall if flash is slow.

In practice, flash at half core speed (common on M3/M4) means prefetch keeps the pipeline fed every other cycle. Misalign? Extra bus cycles.

Data Prefetching Mechanics

Data prefetching lags behind instructions— Cortex-M isn’t built for data-heavy workloads like superscalars.

On M0-M4: No true prefetch. The Load/Store Unit (LSU) pipelines 1-2 loads/stores, overlapping with execute. Sequential accesses might burst on AHB, but no lookahead.

Cortex-M7 flips the script with its Harvard D-Cache. Misses trigger a linefill: 32-byte burst prefetch from memory. The Feabhas blog on M7 cache notes the PFU focuses on instructions, but D-Cache monitors misses and prefetches lines automatically—though no hardware striding (software must hint via patterns).

Alignment matters: Unaligned loads cost extra cycles; bursts prefer 32-byte aligned.

Exceptions refill from cache if hit, else stall.

Short version: Data prefetch is opportunistic caching, not aggressive like instructions.

ART Accelerator and Its Prefetch Role

STM32’s ART (Adaptive Real-Time) accelerator is a game-changer for Cortex-M4/M7 flash performance. It’s not core IP—it’s ST’s flash controller magic.

How it prefetches: ART sits between flash and the core bus, implementing an instruction prefetch queue (depth ~8-16 instructions) and branch cache (stores recent targets). Sequential fetches? Queue ahead. Branches? Predict and cache paths.

Enable it via FLASH_ACR: bit 8 (PRFTEN) for prefetch, bit 9 (ICEN) for I-Cache, bit 10 (DCEN) for D-Cache. Code snippet from Reddit embedded thread:

FLASH->ACR |= FLASH_ACR_PRFTEN | FLASH_ACR_ICEN | FLASH_ACR_DCEN;

Per DeepBlueEmbedded, this boosts DMIPS from flash-limited to near-RAM speeds at 80MHz+.

Data? ART enables D-Cache but no special data prefetch—relies on M7 core.

Performance Impacts: Latency, Throughput, and Determinism

Latency: Without prefetch, flash wait states (1-7+) kill throughput. PFU/ART drops effective fetch to 1 cycle/instruction sequentially. Branches? 10-20 cycle hit if flushed.

Throughput: M3 with PFU sustains ~1 IPC (instructions per cycle) from slow flash. M7 + ART? Up to 1.25 IPC bursts. Stack Overflow notes confirm ART halves clock-for-clock flash speed.

Real-time determinism: Here’s the rub. Prefetch hides average latency but introduces jitter—buffer flushes on interrupts vary timing. In hard RT, disable ART for worst-case predictability (use TCM or zero-wait flash). Benchmarks show 20-50% jitter reduction with caches off.

Tradeoff: Enable for throughput-critical, disable for tight loops with interrupts.

Practical Measurements and Tips

Measure with cycle counters (DWT_CYCCNT on M3+). Time a loop with/without PRFTEN.

Tips:

Always enable ART on STM32F4/F7 unless RT constraints.
Align code to 32-bit for bursts.
Use ITCM for hot loops—bypasses prefetch entirely.
Barriers (DSB/ISB) flush PFU; mind in RT.
Tools: STM32Cube Perf counters or oscilloscope on GPIO toggles.

From experience, ART shines in GUI loops but bites interrupt-heavy motor control.

Key Reference Manuals and Diagrams

Dive deeper with these:

ARM TRMs:

Cortex-M3: Prefetch Unit (DDI0337) — PFU FIFO diagrams, pipeline timing.
Cortex-M7: PFU & Caches (DDI0489) — Linefill bursts, BTB details.

STM32:

Reference Manual (e.g., RM0090 for F4): FLASH_ACR section, ART description, flash timing tables (wait states vs. clock).
Datasheet timing diagrams: AHB burst reads, flash access cycles.

Others: ARMv7-M Architecture Ref Manual (bus interfaces); app notes like AN4839 (STM32 cache).

Study pipeline diagrams—see how PFU overlaps fetch/decode.

Sources

Conclusion

Cortex-M prefetching, via PFU and caches, keeps pipelines humming despite flash woes, while ART turbocharges STM32 setups for real-world speedups. Weigh throughput gains against RT jitter—test your code. Grab those TRMs; the diagrams reveal timing secrets no blog can match.

Authors

NeuroAnswers

Author

Verified by moderation