Programming

PyTorch Dataset/DataLoader for Multivariate Time Series

Fix PyTorch Dataset and DataLoader for multivariate time series preprocessing from CSV. Ensure (B, V, L) shapes, avoid data leakage with proper scaling, and validate sliding windows for MAMBA models.

1 answer 1 view

Multivariate Time Series Preprocessing from CSV for a model expecting (B, V, L) input — is my PyTorch Dataset/DataLoader implementation correct?

I have CSV time-series data with variables [‘position_x’, ‘position_y’, ‘position_z’] and want inputs shaped (B, V, L) (batch_size, num_variables, window_length) for a MAMBA model (LSTM normally expects B, L, V). I split data into train/test (test_ratio=0.2) and build a PyTorch Dataset that returns x and y as (V, input_length) and (V, forecast_length). Is this preprocessing correct? Any issues like data leakage or incorrect windowing?

Code:

python
features = df[['position_x', 'position_y', 'position_z']]

scaler = StandardScaler()
scaled_features= scaler.fit_transform(features)

class TimeSeriesDataset(torch.utils.data.Dataset):
 def __init__(self, data, input_length, forecast_length):
 """
 data: numpy array (T, V)
 returns:
 X: (V, input_length)
 y: (V, forecast_length)
 """
 self.data = torch.tensor(data, dtype=torch.float32)
 self.input_length = input_length
 self.forecast_length = forecast_length

 def __len__(self):
 return len(self.data) - self.input_length - self.forecast_length + 1

 def __getitem__(self, idx):
 x = self.data[idx : idx + self.input_length] # (L, V)
 y = self.data[idx + self.input_length :
 idx + self.input_length + self.forecast_length]

 x = x.T # (V, L)
 y = y.T # (V, F)

 return x, y

input_length = 16
forecast_length = 16
test_ratio = 0.2

split_idx = int(len(scaled_features) * (1 - test_ratio))

train_data = scaled_features[:split_idx]
test_data = scaled_features[split_idx:]


train_dataset = TimeSeriesDataset(
 train_data, input_length, forecast_length
)

test_dataset = TimeSeriesDataset(
 test_data, input_length, forecast_length
)


train_loader = DataLoader(
 train_dataset,
 batch_size=32,
 shuffle=False, 
)

test_loader = DataLoader(
 test_dataset,
 batch_size=32,
 shuffle=False,
)

Specific questions:

  • Will DataLoader batches have shape (B, V, L) given the Dataset returns (V, L)?
  • Should the StandardScaler be fit only on the training set to avoid data leakage?
  • Is the len calculation correct for sliding windows?
  • Any other pitfalls or best practices for preparing multivariate time series for models that expect (B, V, L) inputs?

Your PyTorch Dataset/DataLoader will produce batches shaped (B, V, L) if each getitem returns a torch.Tensor of shape (V, L); however, your current pipeline fits StandardScaler before the train/test split (this causes data leakage), so fit the scaler on the training slice only and then transform train/test separately. The len formula you used (T - input_length - forecast_length + 1) is correct, but add guards for tiny T, consider drop_last/shuffle choices for training, and think about efficient window creation (torch.unfold) for speed.


Contents


PyTorch DataLoader & Dataset shape — will batches be (B, V, L)?

Short answer: yes — if your Dataset returns a torch.Tensor x with shape (V, L) for each sample, the default PyTorch DataLoader collate stacks samples along a new batch dimension and you get (B, V, L). The DataLoader collate behavior is standard (it stacks tensors into a single tensor on dim=0), so a sample (3, 16) becomes a batch (32, 3, 16) with batch_size=32. See the PyTorch DataLoader explanation for details: https://pythonguides.com/pytorch-dataloader/.

A few practical checks and tips

  • Make sure getitem returns torch.Tensor objects (not plain numpy arrays) to avoid surprises; your code already converts data to torch.tensor in init, so good.
  • Validate shapes early: inside the training loop assert xb.shape == (batch_size, V, L) (or check xb.shape[1:] == (V, L)). Quick assertions catch shape mismatches before training.
  • If your model (MAMBA) expects (B, V, L) you’re done; if a library LSTM expects (B, L, V) or (L, B, V) you’ll need to permute (x = x.permute(0, 2, 1) or x = x.permute(2, 0, 1)) before feeding it in. The common LSTM convention is (seq_len, batch, features) or (batch, seq_len, features) — check model docs (for LSTM references see https://towardsdatascience.com/pytorch-lstms-for-time-series-data-cd16190929d7/).
  • DataLoader options: use shuffle=True for training (if windows are independent and you’re not carrying hidden state across batches), shuffle=False for validation/test, and consider drop_last=True to avoid smaller final batches during training.

Why this matters: convolutional layers (Conv1d) treat the middle dim as channels ©, so (B, V, L) maps directly to (B, C, L). LSTMs usually want sequence-first or sequence-last shapes — confirm and permute as needed.


Multivariate time series preprocessing — scaling and data leakage

Your current code calls StandardScaler().fit_transform on the whole dataset before splitting. That leaks information from the future into the training process; scale parameters (mean, std) must be computed on the training split only and then applied to test. Confirmed practice in many time-series tutorials: split chronologically first, fit scalers on the train slice, then transform train and test (see https://www.geeksforgeeks.org/data-analysis/time-series-forecasting-using-pytorch/ and https://towardsdatascience.com/introducing-pytorch-forecasting-64de99b9ef46/).

Correct order (brief):

  1. Split the raw CSV array into train_raw and test_raw (chronological split).
  2. Fit StandardScaler on train_raw only: scaler.fit(train_raw).
  3. Transform both parts: train_scaled = scaler.transform(train_raw); test_scaled = scaler.transform(test_raw).
  4. Build Dataset objects from train_scaled and test_scaled.

Why per-variable scaling is OK here

  • StandardScaler operates column-wise on a (T, V) array, so each variable (position_x, position_y, position_z) gets its own mean/std computed on training data. That’s usually what you want for multivariate time series with a single series ID. If you have many independent series (multiple IDs), consider scaling per-series or using group-aware scalers.

Inverse-scaling predictions

  • When you predict multi-step, multivariate outputs (shape e.g. B × V × F), inverse-transforming requires reshaping to (B*F, V), applying scaler.inverse_transform, then reshaping back. Keep that in mind for metric calculations and plotting.

Sliding windows and len correctness

Your len formula:
len = len(self.data) - self.input_length - self.forecast_length + 1
is mathematically correct. Reason: the last valid starting index idx satisfies idx + input_length + forecast_length <= T, so idx_max = T - input_length - forecast_length; number of integer starts = idx_max + 1 = T - input_length - forecast_length + 1.

Quick example: T=100, input_length=16, forecast_length=16 → len = 100 - 16 - 16 + 1 = 69 windows (starts at 0…68). That’s correct.

Edge cases and defensive checks

  • If len(self.data) < input_length + forecast_length then len will be ≤ 0. Add a check in init to raise a clear ValueError with an explanatory message.
  • Sliding-window stride: your code uses stride=1 (full overlap). That’s fine and common. If you want non-overlapping windows or a different overlap, add a stride parameter and use it in calculating len and slicing.
  • Efficient window creation: for very long time series you can precompute all windows using torch.unfold to avoid Python slicing overhead. The technique and trade-offs are described here: https://medium.com/@heyamit10/best-way-to-cut-a-pytorch-tensor-into-overlapping-chunks-14d80a99919c. Example using unfold is shown in the corrected code section below.

Targets and alignment

  • Your target y = data[idx + input_length : idx + input_length + forecast_length] is correctly aligned for forecasting the immediate next forecast_length steps. If you need to forecast with a gap (e.g., predict starting after a horizon skip), adjust the offset accordingly.

Other pitfalls and best practices for time series PyTorch pipelines

Data leakage beyond scaling

  • Don’t use future-derived features (rolling stats that include target future steps) computed on the whole series. Compute any feature/window stats using only past values available at that time.

Timestamps & irregular sampling

  • If your CSV has irregular timestamps or gaps, resample or impute before windowing. Many time-series models assume constant sampling.

NaNs and types

  • Fill or flag NaNs before scaling. Ensure dtype is float32 (torch tensors) to avoid unexpected type promotions.

Batching, hidden state and shuffle

  • If your model is stateful (you carry hidden states across sequential batches), keep shuffle=False and feed contiguous sequences in order. For stateless windowed training, shuffle=True usually improves optimization. You choose based on model design.

Performance tips

  • num_workers>0 and pin_memory=True speed up DataLoader throughput when training on GPU.
  • Precomputing windows (unfold) trades memory for speed; on very large datasets, consider lazy indexing and converting to tensors in getitem (so you don’t keep two full copies).

Evaluation strategy

  • Use chronological validation (rolling/expanding-window CV) rather than random KFold for time series.

Loss shape and metrics

  • Match your loss function to the target tensor shape. If your model outputs (B, V, F), compute loss directly; if it outputs (B, F, V) you may need to permute.

Reproducibility

  • Set seeds for numpy, torch, and any randomness in DataLoader workers; keep deterministic flags if needed.

Corrected / improved code examples (split → scale → dataset)

Minimal corrected pipeline (fixes leakage and shows typical DataLoader flags):

python
import numpy as np
import torch
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, Dataset

features = df[['position_x', 'position_y', 'position_z']].values # (T, V)

input_length = 16
forecast_length = 16
test_ratio = 0.2

split_idx = int(len(features) * (1 - test_ratio))
train_raw = features[:split_idx]
test_raw = features[split_idx:]

scaler = StandardScaler().fit(train_raw) # FIT ON TRAIN ONLY
train_scaled = scaler.transform(train_raw)
test_scaled = scaler.transform(test_raw)

class TimeSeriesDataset(Dataset):
 def __init__(self, data, input_length, forecast_length):
 self.data = torch.tensor(data, dtype=torch.float32) # (T, V)
 self.input_length = input_length
 self.forecast_length = forecast_length
 if len(self.data) < input_length + forecast_length:
 raise ValueError("Not enough time steps for given input/forecast lengths")

 def __len__(self):
 return len(self.data) - self.input_length - self.forecast_length + 1

 def __getitem__(self, idx):
 x = self.data[idx : idx + self.input_length] # (L, V)
 y = self.data[idx + self.input_length : idx + self.input_length + self.forecast_length] # (F, V)
 return x.T, y.T # (V, L), (V, F)

train_dataset = TimeSeriesDataset(train_scaled, input_length, forecast_length)
test_dataset = TimeSeriesDataset(test_scaled, input_length, forecast_length)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True, drop_last=True, num_workers=4, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False, drop_last=False, num_workers=2, pin_memory=True)

Efficient precompute using torch.unfold (optional; faster iteration but uses extra memory):

python
arr = torch.tensor(train_scaled, dtype=torch.float32) # (T, V)
n_windows = len(arr) - input_length - forecast_length + 1

x_all = arr.unfold(0, input_length, 1)[:n_windows] # (n_windows, L, V)
y_all = arr.unfold(0, forecast_length, 1)[input_length:input_length + n_windows] # (n_windows, F, V)

x_all = x_all.permute(0, 2, 1).contiguous() # (n_windows, V, L)
y_all = y_all.permute(0, 2, 1).contiguous() # (n_windows, V, F)

# then create a simple dataset wrapping x_all, y_all

Inverse-transforming multi-step predictions (example):

python
# preds: torch.Tensor with shape (B, V, F)
preds_np = preds.permute(0, 2, 1).reshape(-1, V).cpu().numpy() # (B*F, V)
preds_unscaled = scaler.inverse_transform(preds_np) # (B*F, V)
preds_unscaled = preds_unscaled.reshape(B, F, V).transpose(0, 2, 1) # (B, V, F)

Sources


Conclusion

Your Dataset/DataLoader design is correct for producing (B, V, L) inputs for MAMBA if each sample returns a torch.Tensor shaped (V, L). Fix the crucial leakage bug by fitting StandardScaler on the training split only, keep the len formula as you wrote (with a guard for tiny T), and consider shuffle/drop_last choices plus unfolding for speed. With those changes your pytorch dataset and pytorch dataloader pipeline will be robust for multivariate time series training and evaluation.

Authors
Verified by moderation
Moderation
PyTorch Dataset/DataLoader for Multivariate Time Series