Fix Identical Metrics in TensorFlow Keras CNN Backbones

Diagnose and fix identical accuracy, F1, AUC metrics across VGG16, ResNet50, DenseNet121 in TensorFlow Keras on medical imaging datasets. Check metric state, tf.data caching, preprocessing issues.

1 answer• 2 views

12/30/2025, 04:36 PM

Why do three different CNN backbones (VGG16, ResNet50, DenseNet121) trained from scratch on the same medical imaging dataset in TensorFlow 2.15 produce identical evaluation metrics (accuracy, F1, AUC) to 4 decimal places, and how can I reliably diagnose and fix this?

Context

Dataset: 1,016 medical images saved as .npy — binary classification
Models: VGG16, ResNet50, DenseNet121 (all trained from scratch, use_pretrained=False)
Framework: TensorFlow 2.15, Python 3.10, Windows 11, GPU: RTX 3060
Input: Grayscale images (224×224×1) converted to 3-channel via Concatenate([x, x, x]) for backbone compatibility
Training notes: tf.random.set_seed(random_seed) and np.random.seed(random_seed) are set per model; classifier heads use unique layer names; tf.keras.backend.clear_session() is called between models

Minimal reproducible code (simplified):

python

# Simplified version
def build_model(backbone_name, random_seed):
 tf.random.set_seed(random_seed)
 np.random.seed(random_seed)
 
 # Grayscale to 3-channel conversion
 image_input = tf.keras.Input(shape=(224, 224, 1))
 x = augmentation_layer(image_input)
 x = tf.keras.layers.Concatenate()([x, x, x]) # (224, 224, 3)
 
 # Load backbone (VGG16/ResNet50/DenseNet121)
 backbone = get_backbone(backbone_name, use_pretrained=False)
 features = backbone(x)
 
 # Independent classifier head with unique name
 x = Dense(128, activation='relu', name=f'fc1_{backbone_name}_{random_seed}')(features)
 x = Dropout(0.5)(x)
 x = Dense(64, activation='relu', name=f'fc2_{backbone_name}_{random_seed}')(x)
 output = Dense(1, activation='sigmoid', name=f'output_{backbone_name}_{random_seed}')(x)
 
 model = tf.keras.Model(inputs=image_input, outputs=output)
 return model

# Training
for model_name, seed in [('VGG16', 100), ('ResNet50', 200), ('DenseNet121', 300)]:
 model = build_model(model_name, seed)
 model.compile(optimizer=Adam(1e-4), loss='binary_crossentropy', metrics=['accuracy', AUC()])
 model.fit(train_ds, validation_data=val_ds, epochs=25)
 
 # Evaluate
 predictions = model.predict(test_ds)
 # Calculate metrics...
 
 tf.keras.backend.clear_session() # Clear between models

Observed example results:

Model | Accuracy | F1 | AUC
------------|----------|--------|--------
VGG16 | 0.8234 | 0.8156 | 0.8891
ResNet50 | 0.8234 | 0.8156 | 0.8891
DenseNet121 | 0.8234 | 0.8156 | 0.8891

Diagnostics already performed

Prediction statistics (mean, std) ARE slightly different between models
Number of unique predictions varies (~100–150)
Training histories show different convergence patterns
Model parameter counts differ (VGG ~15M, ResNet ~24M, DenseNet ~7M)

Concrete questions / debugging goals

Could this be a metric-calculation bug (e.g., accidentally reusing or overwriting prediction/label arrays)? What are the simplest checks to confirm predictions/labels used for metric computation are unique per model run?
Can TensorFlow/Keras caching, dataset pipeline reuse, or global graph/session state cause identical final metrics even when calling tf.keras.backend.clear_session()? What specific signs indicate layer/weight reuse or shared instances (e.g., get_backbone implementation returning the same layer instance)?
Could my preprocessing (grayscale→3-channel via Concatenate or augmentation_layer) be inadvertently forcing similar inputs or reducing model expressivity so different backbones converge to the same scores? How to test this (e.g., feed a fixed batch, compare backbone outputs)?
Is it statistically plausible that three very different architectures achieve near-identical metrics on a small dataset by chance? How to quantify this (e.g., bootstrap the test set, compare predictions pairwise, test significance)?
What targeted debugging steps and unit tests should I run in TensorFlow 2.15 to ensure model isolation and correct metric computation (examples: assert different weight tensors after training, save/compare weights, run predictions on a fixed seed batch, verify the dataset iterator is not reused, ensure get_backbone creates fresh layers each call)?

Any targeted suggestions for checks, common Keras pitfalls that cause shared layers/weights, or reproducible tests to pinpoint whether this is a metric bug, framework caching issue, data problem, or genuine model parity would be greatly appreciated.

Identical evaluation metrics across VGG16, ResNet50, and DenseNet121 CNN backbones in TensorFlow Keras, even on medical imaging datasets, usually point to shared metric states or dataset iterator reuse—not true model parity. Despite different predictions and training histories, metrics like accuracy, F1, and AUC can match exactly if Keras metrics aren’t reset or if tf.data pipelines cache outputs across models. Fix it by instantiating fresh metric objects per model, recreating datasets, and verifying backbone outputs on fixed inputs; this ensures isolation in TensorFlow 2.15.

Why CNN Backbones Show Identical Metrics in TensorFlow Keras
Metric Calculation Bugs and Quick Checks
Dataset Pipeline and Caching Pitfalls
Preprocessing Forcing Similarity
Is It Statistically Plausible?
Targeted Debugging Steps and Fixes
Sources
Conclusion

Why CNN Backbones Show Identical Metrics in TensorFlow Keras

You’ve got three powerhouse architectures—VGG16, ResNet50, DenseNet121—all churning out the same accuracy (0.8234), F1 (0.8156), and AUC (0.8891) to four decimals on your 1,016-image medical dataset. Predictions differ slightly, histories vary, params count is unique… yet metrics lockstep. Frustrating, right?

This screams framework gotcha, not coincidence. In TensorFlow Keras, metrics like accuracy or AUC() track state across calls via update_state(y_true, y_pred). Reuse the same metric instance (even implicitly via strings in compile), and it accumulates—or worse, carries over from prior models. Your clear_session() helps with graphs but skips metric objects if they’re global or recompiled sloppily.

Backbones from get_backbone (likely tf.keras.applications) might share layers if not careful, but your unique names and seeds suggest that’s not it. Grayscale-to-RGB via Concatenate([x,x,x])? Could homogenize features, making architectures converge similarly on small data. But identical to 4 decimals? Nah, that’s a computation bug.

Stack Overflow threads nail it: validation metrics compute post-epoch on potentially reused preds, while training’s on-the-fly. With tf.data, iterators can cache silently.

Metric Calculation Bugs and Quick Checks

First question: metric-calculation bug reusing predictions/labels? Absolutely possible. Keras strings like 'accuracy' pick implicit metrics tied to loss (e.g., BinaryAccuracy for binary_crossentropy), which might share state across compiles.

Simplest checks:

Ditch strings. Use explicit instances: metrics=[tf.keras.metrics.BinaryAccuracy(name='acc'), tf.keras.metrics.AUC(name='auc')]. Create new ones per model.

python

acc_metric = tf.keras.metrics.BinaryAccuracy()
auc_metric = tf.keras.metrics.AUC()
model.compile(..., metrics=[acc_metric, auc_metric]) # Fresh per model!

Update manually post-predict: acc_metric.update_state(y_true, y_pred); print(acc_metric.result()). If still identical, bingo—state leak.

model.evaluate(test_ds) logs? Run it standalone per model, reset: metric.reset_states() before each. Per Keras docs, metrics need explicit resets in multi-run setups.
Grab raw preds/labels:

python

y_true = np.concatenate([y for x, y in test_ds], axis=0)
preds_vgg = model_vgg.predict(test_ds)
f1_vgg = f1_score(y_true, (preds_vgg > 0.5).astype(int))

Do for each. Match Keras? Good. Identical across models? Predictions reused somehow.

Your note: unique preds (~100-150 uniques). But if test_ds iterator exhausts once, later models get stale data. Recreate test_ds inside the loop.

Signs of sharing: Print id(model.metrics[0]) pre/post-compile. Same ID across models? Shared instance.

Dataset Pipeline and Caching Pitfalls

tf.data.Dataset is sneaky. Your train_ds, val_ds, test_ds—defined outside the loop? They reuse iterators, especially with prefetch or cache(). clear_session() nukes graph but not dataset state.

Targeted tests:

Move dataset creation inside model loop: Fresh iterator each time.

python

for name in ['VGG16', ...]:
 test_ds = make_dataset(test_data, shuffle=False, cache=False) # No cache!
 model = build_model(...)
 print(model.evaluate(test_ds)) # Fresh eval

tf.data vs NumPy: Load test as arrays, model.evaluate(test_arrays). Per TensorFlow GitHub issue, datasets alter batching/accuracy vs arrays.

Global session pollution? tf.random.set_seed per model is good, but backend globals linger. Add tf.keras.backend.clear_session(); tf.compat.v1.reset_default_graph() (if eager off).

get_backbone culprit? If it returns cached Application instance:

python

backbone1 = get_backbone('VGG16')
backbone2 = get_backbone('ResNet50')
print(id(backbone1), id(backbone2)) # Same? Factory bug.
print(np.allclose(backbone1.layers[0].get_weights(), backbone2.layers[0].get_weights()[:])) # Shared weights.

Keras apps create fresh by default, but custom impl might not.

Preprocessing Forcing Similarity

Grayscale (224x224x1) → Concatenate([x,x,x])? Smart for RGB backbones, but triples identical channels—reduces variance, forces VGG/ResNet/DenseNet to extract same “colorless” features. Augmentation_layer before? If deterministic (no random flips/rotations per call), inputs homogenize.

Test it:

Fixed batch probe:

python

fixed_batch = np.random.uniform(0,1,(1,224,224,1)).astype(np.float32) # Seed it
x = tf.keras.layers.Concatenate()([fixed_batch, fixed_batch, fixed_batch])
vgg_out = vgg16_model(x)
res_out = resnet50_model(x)
print(np.allclose(vgg_out, res_out, atol=1e-4)) # Close? Preproc fault.

Different outputs? Architectures diverge. Same? Concat killed uniqueness.

Bypass Concat: Train tiny 1-channel conv nets. Metrics differ? Preproc issue.
Augmentation randomness: augmentation_layer with seed=None? Rerun—should vary. Per Neptune.ai, small medical sets (~1000 imgs) amplify this; bootstrap to check.

Different param counts but same scores? Possible on tiny data, but 4 decimals? Unlikely without bug.

Is It Statistically Plausibility?

On 1k images (say 200 test), three architectures hitting exact 0.8234 accuracy? Bootstrap it.

Quantify:

Collect all preds: preds = {name: model.predict(test_ds) for name in models}
Bootstrap:

python

from sklearn.utils import resample
n_boots = 1000
acc_diffs = []
for _ in range(n_boots):
 boot_idx = resample(range(len(y_true)))
 boot_acc_vgg = accuracy_score(y_true[boot_idx], (preds['vgg'][boot_idx]>0.5).int)
 boot_acc_res = accuracy_score(y_true[boot_idx], (preds['res'][boot_idx]>0.5).int)
 acc_diffs.append(abs(boot_acc_vgg - boot_acc_res))
p_value = np.mean(np.array(acc_diffs) >= 0.0) # Exact match prob

p < 0.01? Stat implausible—bug confirmed. Per MachineLearningMastery, small medical data needs k-fold CV.

Pairwise KS-test on preds: from scipy.stats import ks_2samp; ks_2samp(preds_vgg.flatten(), preds_res.flatten()). Low p? Distributions differ, metrics shouldn’t match.

Imbalanced binary? F1=AUC match hints threshold issues, but across models? Nah.

Targeted Debugging Steps and Fixes

Hit these in order:

Model isolation:

New metrics per model + reset_states() pre-eval.
Datasets inside loop, cache=False, prefetch=0.
Save weights post-train: model.save_weights(f'{name}.h5'); reload, compare hashes.

Weight uniqueness: hash(tuple(w.flatten() for w in model.get_weights())). Differ?
Pred pipeline: Print shapes/len(np.unique(preds)) per model. Fixed batch backbone outputs as above.
Full reset:

python

import gc; gc.collect()
tf.keras.backend.clear_session()

Run models sequentially, no loop reuse.

sklearn verify: Post-predict, from sklearn.metrics import accuracy_score, f1_score, roc_auc_score. Match model.evaluate? Metric OK.
Unit test backbone:

python

def test_backbones():
 for name in ['vgg16', 'resnet50', 'densenet121']:
 base = tf.keras.applications[name](input_shape=(224,224,3), weights=None)
 print(f"{name} layers: {len(base.layers)}, id: {id(base)}")
test_backbones()

Unique? Good.

Common pitfalls: Implicit metrics, tf.data cache, Concat homogenization. Fixes above + train longer/shuffle more should diverge scores. If not, dataset too small—try k-fold.

TF Metrics docs stress per-model recreation.

Sources

Conclusion

Pinpoint the bug with explicit per-model metrics, fresh datasets, and fixed-batch backbone tests—most likely shared state or tf.data reuse in your TensorFlow Keras CNN setup. Once fixed, expect diverging scores reflecting true architecture diffs on medical images. Bootstrap confirms if parity’s real (spoiler: probably not). Implement the resets and recreations; your VGG16, ResNet50, DenseNet121 should finally show their colors.

Authors