Why does Verilog FIR filter testbench miss y[0]=0x0000, output y[1]=0xFFFE first, and show inconsistent results across implementations? Issues with timing, shifting, fixed-point math?

Timing/handshake misalignment (vin/vin_d/vout race) causes off-by-one printing; fixed-point rounding/sign-extension differences produce -2 bias. Fix TB sampling on posedge vout, standardize MAC with single rounding/saturation.

Programming

Fix Verilog FIR Filter Testbench Missing y[0]=0x0000

Debug Verilog FIR filter testbench failing to log y[0]=0x0000, skipping to y[1]=0xFFFE. Fix timing races, fixed-point scaling, rounding, saturation, and inconsistent outputs across implementations with code patches.

1 answer• 2 views

01/16/2026, 03:38 PM

Verilog FIR filter testbench fails to log y[0] = 0x0000 and skips to y[1] = 0xFFFE; different implementations produce inconsistent outputs

I generated low-pass FIR filter coefficients in MATLAB using:

matlab

h = fir1(99, 0.2, 'low');

Input signals are sine waves $f(t) = \sin(2 \pi f t)$ with $f = 950, 1100, 2000$ Hz, 2000 samples at 10 kHz sampling rate. Converted samples and coefficients to fixed-point Q(2,14) format. In MATLAB, applying inputs (starting with x[0] = 0) yields y[0] = 0x0000.

Implemented the FIR filter in Verilog in three ways, but facing issues:

None produce y[0] = 0x0000. Per logic, y[0] should appear at the 7th positive clock edge, but first output is yout = y[1] = 0xFFFE. y[0] is never produced.
Outputs from the three Verilog implementations differ.

Below is the direct-form implementation and testbench (other implementations omitted). Input files (x*.txt) start with 0; coefficients in coef_q214_hex.txt from MATLAB.

Verilog Module (direct_fir)

verilog

module direct_fir #(parameter integer N = 100)
(
 input wire clk,
 input wire rst, // synchronous active-high reset
 input wire vin,
 input wire signed [15:0] xin, // Q_2_14
 output reg vout,
 output reg signed [15:0] yout // Q_2_14
);

integer i;

 reg signed [15:0] xdel [0:N-1];
 reg signed [15:0] h [0:N-1];

 reg signed [63:0] acc;
 reg signed [31:0] prod;
 reg vin_d;

 reg signed [63:0] acc_abs;
 reg signed [63:0] acc_scaled;

 localparam signed [63:0] HALF_LSB = 64'sd8192; // 2^13
 localparam integer SHIFT = 14;

 initial begin
 $readmemh("coef_q214_hex.txt", h);
 end

 function automatic signed [15:0] sat16(input signed [63:0] x);
 begin
 if (x > 64'sd32767) sat16 = 16'sd32767;
 else if (x < -64'sd32768) sat16 = -16'sd32768;
 else sat16 = x[15:0];
 end
 endfunction

 always @(posedge clk) begin
 if (rst) begin
 for (i=0; i<N; i=i+1) xdel[i] <= 16'sd0;
 yout <= 16'sd0;
 vout <= 1'b0;
 vin_d <= 1'b0;
 end else begin
 if (vin) begin
 for (i=N-1; i>0; i=i-1)
 xdel[i] <= xdel[i-1];
 xdel[0] <= xin;
 end

 vin_d <= vin;
 vout <= 1'b0;

 if (vin_d) begin
 acc = 64'sd0;
 acc_abs = 64'sd0;
 acc_scaled = 64'sd0;
 for (i=0; i<N; i=i+1) begin
 prod = xdel[i] * h[i]; // Q_4_28
 acc = acc + {{32{prod[31]}}, prod}; // sign extend to 64
 end

 if (acc >= 0) begin
 acc_scaled = (acc + HALF_LSB) >>> SHIFT;
 end else begin
 acc_abs = -acc;
 acc_scaled = -((acc_abs + HALF_LSB) >>> SHIFT);
 end

 yout <= sat16(acc_scaled);
 vout <= 1'b1;
 end
 end
 end
endmodule

Testbench

verilog

`timescale 1ns/1ps
module tb_direct_fir;
 parameter integer N_TAPS = 100;
 parameter integer NSAMPLES = 2000;
 parameter integer TCLK = 10;

 reg clk;
 reg rst;
 reg vin;
 reg signed [15:0] xin;

 wire vout;
 wire signed [15:0] yout;

 reg [15:0] xmem [0:NSAMPLES-1];

 integer n;
 integer fout;

 direct_fir #(.N(N_TAPS)) dut (
 .clk (clk),
 .rst (rst),
 .vin (vin),
 .xin (xin),
 .vout(vout),
 .yout(yout)
 );

 initial begin
 clk = 1'b0;
 forever #(TCLK/2) clk = ~clk;
 end

 task apply_reset;
 begin
 rst = 1'b1;
 vin = 1'b0;
 xin = 16'sd0;

 @(posedge clk);
 @(posedge clk);
 @(posedge clk);

 rst = 1'b0;
 @(posedge clk);
 end
 endtask

 always @(posedge clk) begin
 if (!rst && vout) begin
 $fwrite(fout, "%04h\n", yout[15:0]);
 end
 end

 task run_file;
 input [8*64-1:0] in_filename;
 input [8*64-1:0] out_filename;
 begin
 $readmemh(in_filename, xmem);

 fout = $fopen(out_filename, "w");
 if (fout == 0) begin
 $display("ERROR: could not open output file %s", out_filename);
 $finish;
 end

 apply_reset;

 vin = 1'b1;
 for (n = 0; n < NSAMPLES; n = n + 1) begin
 xin = $signed(xmem[n]);
 @(posedge clk);
 end

 xin = 16'sd0;
 vin = 1'b0;
 @(posedge clk);

 for (n = 0; n < 10; n = n + 1) begin
 @(posedge clk);
 end

 $fclose(fout);
 end
 endtask

 initial begin
 rst = 1'b1;
 vin = 1'b0;
 xin = 16'sd0;

 run_file("x950_q214_hex.txt", "y950_direct_hex.txt");
 run_file("x1100_q214_hex.txt", "y1100_direct_hex.txt");
 run_file("x2000_q214_hex.txt", "y2000_direct_hex.txt");

 $display("DONE: outputs written.");
 $finish;
 end
endmodule

What could cause the missing y[0] = 0x0000 and first output being 0xFFFE? Is there a timing issue in input shifting, accumulation, fixed-point scaling/rounding, or saturation? Why might implementations differ?

Most likely you’re seeing two separate but interacting problems: a timing/handshake + sampling mis‑alignment between your DUT and testbench (vin → vin_d → vout + non‑blocking semantics produces an off‑by‑one when the TB samples/prints), and small fixed‑point rounding/truncation/sign‑extension differences between implementations (per‑product rounding, accumulator width or sign-extension mistakes can give a tiny negative bias such as 0xFFFE). Fix the TB sampling (print on vout edge or delay sampling), standardize the MAC (full‑precision accumulation, round once, then saturate), and the y[0] vs y[1] discrepancy will disappear — details, diagnosis steps and concrete code patches follow.

Overview and reproduction
Cycle‑by‑cycle timing: vin, vin_d, vout and the print race
Fixed‑point math: Q(2,14) multiply/accumulate, rounding and why -2 appears
Why three implementations can differ (root causes)
Concrete fixes and testbench patches you can apply now
Debug checklist — small experiments to isolate the bug
Sources
Conclusion

fir filter: Overview and reproduction

You provided the direct_fir module and tb (see the user submission). Key facts: N=100 taps, coefficients and inputs in Q(2,14), inputs begin with x[0]=0, and MATLAB produces y[0]=0. Your Verilog does: on posedge if (vin) shift xdel; vin_d <= vin; if (vin_d) compute full MAC (blocking acc loop) → round → sat16 → yout, vout <= 1. The TB sets vin=1 for the whole stream and prints yout when vout is true inside an always @(posedge clk) block.

Two things collide here: (A) non‑blocking updates and the TB sampling order produce an output/print timing offset (you may be printing the wrong cycle), and (B) the numeric path (multiply widths, sign extension, when/where you round or truncate) can produce small differences across implementations that accumulate into values like 0xFFFE.

fir filter: cycle-by-cycle timing, vin/vin_d and why y[0] missing

Walkthrough for your code (clock edges = T1, T2, T3 …):

T0: after reset, xdel[*] = 0, vin_d = 0, vout = 0.
T1: TB drives vin=1 and xin = x[0]. DUT (posedge):
non‑blocking schedules: xdel shift (xdel[0] <= x[0]), vin_d <= 1.
accumulate guarded by if (vin_d) — uses vin_d old (0) → no MAC this cycle.
end of T1: xdel updated; vin_d becomes 1.
T2: TB sets xin = x[1] and keeps vin=1. DUT:
shift scheduled (xdel[0] <= x[1])
vin_d <= 1 scheduled
now if (vin_d) sees vin_d == 1 (from end of T1) → MAC executes using xdel entries that contain x[0] (the data that was written at end of T1). The DUT computes y0 and schedules yout <= y0 and vout <= 1.
end of T2: yout and vout update (they become valid now, but only after all posedge blocks run).
T3: TB’s always @(posedge clk) runs and will observe the vout value that existed before the posedge block execution (if the TB samples earlier than non‑blocking updates are applied), or, more typically, TB will see vout==1 and will print the yout value that was scheduled at the end of T2.

Where the confusion appears:

If your TB samples/prints at the wrong simulation time (posedge vs posedge + delta), you can observe an off‑by‑one mapping or miss the very first computed sample. Using vout inside a clocked TB block without accounting for non‑blocking update order is a common source of an apparent “missing y[0]”.
Actionable check: log vin, vin_d, vout, xin, and a cycle counter for the first 10 cycles. That immediately shows whether printed outputs are shifted relative to the input stream.

Concrete TB suggestion (fast, reliable): print on the positive edge of vout (event on the signal change), not inside an always @(posedge clk) that checks vout. Example:

verilog

// replace the clocked print block with a vout-edge print in the testbench:
always @(posedge vout) begin
 if (!rst) $fwrite(fout, "%04h\n", yout[15:0]);
end

That ensures the print always shows the newly‑produced yout value — no race with non‑blocking updates.

Fixed‑point scaling, rounding and saturation in this fir filter

Short reminder of scaling:

Inputs and coefficients are Q(2,14). Multiply: Q(2,14) * Q(2,14) → Q(4,28). Sum all products in a wide accumulator (you use 64 bits) in Q(4,28). To return to Q(2,14) shift right by SHIFT = 14 bits.
Rounding policy in your code: add HALF_LSB = 2^(SHIFT-1) (8192) to the absolute accumulator then arithmetic shift.

Why a tiny negative number (0xFFFE = -2) can appear even if x[0]=0:

Quantization and coefficient rounding produce tiny non‑zero product terms even when x is nearly zero (coeff quantization, sign extension mismatches, or reading of memory with wrong endian/sign).
If any implementation performs truncation/rounding earlier (per‑product or per‑partial‑sum) rather than only once at the end, rounding errors accumulate across 100 taps. That accumulated bias is small but can be a few LSBs (e.g., −1 … −3).
Sign extension mistakes (forgetting $signed or treating reg[15:0] as unsigned in some versions) flip signs for some terms and produce small negative offsets.
In short: inconsistent placement of rounding/truncation or inconsistent sign‑extension + accumulator width lead to small differences across implementations and the −2 you observed.

Best practice to get bit‑identical results:

Keep full product precision for each multiply (signed 32‑bit prod), sign‑extend to the accumulator width, accumulate everything in a single wide signed accumulator (you already use 64 bits), then do a single rounding and a single saturation at the end.
Do NOT round/truncate after every product or at intermediate pipeline stages unless you replicate the exact same policy in all implementations.

Mathematically:
y_q214 = sat16( round( acc_Q4_28 / 2^14 ) )
round(·) implemented as (acc >= 0) ? ((acc + 2^(13)) >>> 14) : -((( -acc ) + 2^(13)) >>> 14)

Check this is exactly what each implementation does — any change will change bits.

Why different implementations produce inconsistent outputs

Common root causes (ordered by likelihood for your symptom set):

Handshake & sampling race: TB prints sample for a different clock than the one you think (see above). One implementation may assert vout earlier or later and TB sampling logic then selects different y[k].
Blocking vs non‑blocking assignment mix: designs that use blocking (‘=’) to shift/register inside a clocked always and those that use non‑blocking (‘<=’) will read/write arrays at different times inside the same posedge — that changes which xdel[] values are used by the MAC.
Per‑stage rounding/truncation vs single final rounding: rounding earlier in the pipeline introduces bias; doing it only at the end is the canonical correct approach for fixed‑point FIR if you want minimal bias.
Accumulator width and sign‑extension: not sign‑extending prod before adding, or using too‑narrow accumulator, or mismatched prod widths (some inference to DSP blocks may truncate) produce differences.
File / memory initialization or $readmemh formatting mistakes: coefficients read with wrong hex format or misinterpreted sign will produce a biased impulse response.
Pipelining or DSP inference: different implementations use different arithmetic resources (e.g., synthesizer DSP macros) which may truncate or apply internal saturation/rounding if you rely on inferred operators.
Testbench input timing (vin high continuously vs 1‑cycle pulses): different DUTs may expect different valid-handshake semantics.

All of the above are plausible — that’s why a short, systematic debug is best.

Concrete fixes and testbench changes (patches)

Make the TB print robust (no race):

Replace the posedge‑clk print with posedge‑vout printing (recommended):

verilog

// in tb_direct_fir:
always @(posedge vout) begin
 if (!rst) $fwrite(fout, "%04h\n", yout[15:0]);
end

This guarantees the TB prints the new yout value immediately after the DUT updates vout.

Standardize the MAC: single final rounding and saturation (canonical form)

Keep this sequence:

prod = signed(xdel[i]) * signed(h[i]); // 32 bits
acc += sign_extend_to_64(prod);
after loop: do rounding once, then sat16

If you want a tiny code tweak to make sign extension formal and obvious:

verilog

// inside the if (vin_d) block
acc = 64'sd0;
for (i = 0; i < N; i = i + 1) begin
 // make prod signed 32 explicitly
 prod = $signed(xdel[i]) * $signed(h[i]); // prod is signed[31:0]
 acc = acc + $signed({{32{prod[31]}}, prod}); // explicit sign-extend to 64
end
// rounding exactly once as you already do

If you mix blocking and non‑blocking for the shift/MAC, be explicit about intent:

Two safe patterns:

Pattern A — two-phase (non‑blocking shift, accumulate next cycle using vin_d) — this is your current approach and is fine. Keep all shifts non‑blocking; keep MAC in next cycle reading xdel (no blocking shifts inside MAC).

Pattern B — single‑cycle compute using blocking shift then MAC in the same always block. If you choose B, then use blocking ‘=’ for shift so subsequent MAC inside the same always sees updated xdel values. But don’t mix A and B styles between implementations.

Make vin a single‑cycle “valid” pulse (optional but clearer):

verilog

// in run_file loop
for (n = 0; n < NSAMPLES; n = n + 1) begin
 xin = $signed(xmem[n]);
 vin = 1'b1;
 @(posedge clk);
 vin = 1'b0;
 @(posedge clk); // let output propagate / be captured
end

This makes handshaking explicit and easier to reason about.

Add quick instrumentation in DUT (temporary) to print first cycles:

Inside direct_fir (temporary debug, guarded by an ifdef or compile-time flag), display xdel[0], a few h[], acc before rounding:

verilog

// For debug only:
integer dbg_cyc;
always @(posedge clk) begin
 if (!rst) begin
 dbg_cyc = dbg_cyc + 1;
 if (dbg_cyc < 20) begin
 $display("Cyc=%0d vin=%b vin_d=%b xin=%h xdel0=%h acc_before=%0d yout=%h", dbg_cyc, vin, vin_d, xin, xdel[0], acc, yout);
 end
 end
end

Verify coefficient loading and signs:

Print first few h[] values after $readmemh to confirm they match MATLAB Q(2,14) hex.

Debug checklist — small experiments to isolate the bug

Run these quick tests in order; stop when you identify the problem.

Print internal signals for the first 10 posedges:

vin, vin_d, xin, xdel[0…4], prod (first few), acc before rounding, yout, vout.
Does acc start at 0 when xdel are zero? If not, your reset or $readmemh may be wrong.

Use the posedge‑vout print change (above). Does the first printed value become 0x0000? If yes — TB sampling race was the main problem.
Force a single non‑zero sample: set only x[0]=some value, all others zero; log yout for a few cycles and compare to hand calc (one tap test). This isolates sign/coeff issues.
For each Verilog implementation:

Ensure each one does exactly one final round (not per-product).
Compare acc raw value (64‑bit) between implementations for the same cycle. If acc differs you know the difference is in multiply/accumulate stage (not the rounding/print).

Compare coefficients read by each implementation. Print or dump coef_q214_hex.txt parsed values.
If implementations differ in inferred hardware (DSP vs logic), temporarily replace * operator with an explicit signed multiplication function or use SystemVerilog $signed casts to force consistent behavior in simulation.
If you still see -2 for the very first non‑zero output, look for any place where you implicitly treat 16‑bit regs as unsigned (missing $signed), and inspect sat16: confirm its thresholds and that x[15:0] indexing returns the expected two’s complement bits.

Sources

User-supplied Verilog direct-form FIR module and testbench
[Yandex Wordstat - topRequests for “fir filter”](Yandex Wordstat API response (topRequests) for phrase “fir filter”)
[Yandex Wordstat - other responses & error logs](Yandex Wordstat API responses (other phrases and error logs))

Conclusion

The symptom (missing y[0] and first printed value 0xFFFE) almost always comes from two interacting issues: a timing/handshake + TB sampling race that produces an off‑by‑one print, and small fixed‑point rounding/sign/width differences between implementations. Start by making the testbench print on vout edges (or otherwise avoid the posedge race), then standardize the MAC: full‑precision signed products, one wide accumulator, single final rounding, then saturate. Use the debug checklist above to narrow the cause quickly — after those two fixes you’ll get the MATLAB y[0]=0x0000 behavior and identical outputs across implementations.

Authors

NeuroAnswers

Author

Verified by moderation

NeuroAnswers

Moderation

Fix Verilog FIR Filter Testbench Missing y[0]=0x0000

Verilog Module (direct_fir)

Testbench

Contents

fir filter: Overview and reproduction

fir filter: cycle-by-cycle timing, vin/vin_d and why y[0] missing

Fixed‑point scaling, rounding and saturation in this fir filter

Why different implementations produce inconsistent outputs

Concrete fixes and testbench changes (patches)

Debug checklist — small experiments to isolate the bug

Sources

Conclusion