Programming

Polars Rust: Upsample Time-Series Gaps to 5m Intervals

Learn to upsample time-series gaps in Polars Rust to exact 5-minute intervals using date_range, vstack, and forward fill. Preserve non-aligned timestamps like 00:05:17 without replacement. Rust code examples for sensors data.

1 answer 1 view

How can I upsample time-series gaps in Polars (Rust) to 5-minute intervals without removing or replacing existing rows that have non-aligned seconds?

I have a Polars DataFrame like:

text
┌───────────┬─────────────────────┬───────────┐
│ sensor_id ┆ ts ┆ value │
│ --- ┆ --- ┆ --- │
│ i32 ┆ datetime[ms] ┆ f64 │
╞═══════════╪═════════════════════╪═══════════╡
│ 1551 ┆ 2025-12-25 00:00:06 ┆ -1.464e6 │
│ 1551 ┆ 2025-12-25 00:05:17 ┆ -1.4639e6 │
│ 1551 ┆ 2025-12-25 00:15:43 ┆ -1.4638e6 │
│ 1551 ┆ 2025-12-25 00:45:51 ┆ -1.4637e6 │
│ 1551 ┆ 2025-12-25 01:45:06 ┆ -1.4636e6 │
│ 1551 ┆ 2025-12-25 03:45:17 ┆ -1.4635e6 │
│ 1551 ┆ 2025-12-25 03:50:43 ┆ -1.4634e6 │
│ 1551 ┆ 2025-12-25 04:00:51 ┆ -1.4633e6 │
│ 1551 ┆ 2025-12-25 04:30:06 ┆ -1.4632e6 │
│ 1551 ┆ 2025-12-25 05:30:17 ┆ -1.4631e6 │
└───────────┴─────────────────────┴───────────┘

I tried:

rust
df = df.upsample(["sensor_id"], "ts", polars::time::Duration::parse("5m")).unwrap();
dbg!(df.clone());

but that produces:

text
┌─────────────────────┬───────────┬──────────┐
│ ts ┆ sensor_id ┆ value │
│ --- ┆ --- ┆ --- │
│ datetime[ms] ┆ i32 ┆ f64 │
╞═════════════════════╪═══════════╪══════════╡
│ 2025-12-25 00:00:06 ┆ 1551 ┆ -1.464e6 │
│ 2025-12-25 00:05:06 ┆ null ┆ null │
│ 2025-12-25 00:10:06 ┆ null ┆ null │
│ 2025-12-25 00:15:06 ┆ null ┆ null │
│ 2025-12-25 00:20:06 ┆ null ┆ null │
│ … ┆ … ┆ … │
│ 2025-12-25 05:10:06 ┆ null ┆ null │
│ 2025-12-25 05:15:06 ┆ null ┆ null │
│ 2025-12-25 05:20:06 ┆ null ┆ null │
│ 2025-12-25 05:25:06 ┆ null ┆ null │
│ 2025-12-25 05:30:06 ┆ null ┆ null │
└─────────────────────┴───────────┴──────────┘

Is there any way to add rows at 5-minute intervals for the gaps but not remove or replace values already present with different seconds (e.g., keep 2025-12-25 00:05:17)?

I have a function that produces the test frame:

rust
pub(crate) fn create_test_dataframe_many_gaps() -> DataFrame {
 let start =
 NaiveDateTime::parse_from_str("2025-12-25 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
 let end =
 NaiveDateTime::parse_from_str("2025-12-25 06:00:00", "%Y-%m-%d %H:%M:%S").unwrap();

 let mut df = DataFrame::empty();

 // deterministic irregular seconds
 let second_offsets = [6, 17, 43, 51];

 // deterministic minute jumps (forces many missing buckets)
 let minute_jumps = [5, 10, 30, 60, 120];

 for sensor_id in [1551] {
 let mut sensor_ids = Vec::new();
 let mut ts_values = Vec::new();
 let mut values = Vec::new();

 let mut current = start;
 let mut idx = 0usize;

 while current <= end {
 let ts = current
 .with_second(second_offsets[idx % second_offsets.len()])
 .unwrap();

 sensor_ids.push(sensor_id);
 ts_values.push(ts);

 // simple monotonic value (easy to assert fill)
 values.push(-1_464_000.0 + (idx as f64 * 100.0));

 // 🔥 jump forward by large, varying gaps
 let jump = minute_jumps[idx % minute_jumps.len()];
 current += chrono::Duration::minutes(jump);
 idx += 1;
 }

 let sensor_df = DataFrame::new(vec![
 Column::from(Series::new(PlSmallStr::from_str("sensor_id"), sensor_ids)),
 Column::from(Series::new(PlSmallStr::from_str("ts"), ts_values)),
 Column::from(Series::new(PlSmallStr::from_str("value"), values)),
 ])
 .unwrap();

 df = df.vstack(&sensor_df).unwrap();
 }

 df
}

What is the recommended Polars (Rust) pattern to fill missing 5-minute buckets while preserving original timestamped rows? Are there examples using upsample parameters, joins, date_range + concat, asof_join, or other approaches to achieve this?

Use a 5‑minute date_range skeleton + concat/vstack with your original DataFrame, then sort per sensor and forward‑fill only the new (null) rows — that will add exact 5‑minute buckets while keeping original non‑aligned timestamps like 2025-12-25 00:05:17. In Rust Polars the canonical pattern is: generate a DatetimeChunked via date_range, build a skeleton DataFrame (null values), vstack with your real data, sort by sensor_id + ts, and group‑wise forward fill (fill_null strategy = Forward). Alternatives: asof_join or upsample/upsample_stable exist, but upsample will anchor to the first timestamp’s seconds (so you saw 00:05:06) — the skeleton approach avoids that.


Contents


Polars Rust pattern: date_range + vstack + forward fill

Goal: add exact 5‑minute anchor rows (00:00:00, 00:05:00, 00:10:00, …) for any gaps while leaving rows with non‑aligned seconds untouched (00:05:17 remains).

Rationale in plain steps

  • Generate a 5‑minute grid (DatetimeChunked) from the desired start to end using polars’ date_range. See the date_range API: https://docs.rs/polars/latest/polars_time/fn.date_range.html.
  • Build a skeleton DataFrame whose ts column is that grid and whose value columns are null (Option = None). Include the same group key(s) (sensor_id).
  • Concatenate (vstack) the skeleton with your original DataFrame (same schema).
  • Sort by sensor_id then ts.
  • Forward‑fill only the nulls (fill_null strategy = Forward) in each sensor group: the original rows are non‑null and remain unchanged — the skeleton rows pick up the most recent known value.
  • Optionally remove skeleton rows that duplicate an original ts (e.g., if an original happened to land exactly on the 5‑minute mark) by sorting so originals come first and dropping duplicate (sensor_id, ts) pairs.

Why this beats naive upsample in your case

  • upsample in Polars constructs intervals anchored to the first timestamp for each group; that is why your output used 00:00:06, 00:05:06, … — upsample aligns on the first row’s seconds. For full control of anchor seconds and to preserve non‑aligned rows, the date_range + concat + forward_fill approach is the recommended pattern (discussed in the Polars docs and issues) — see the official resampling guide: https://docs.pola.rs/user-guide/transformations/time-series/resampling/ and the date_range API above.

Example: Polars (Rust) code — single sensor (concrete)

Below is a concrete Rust example that follows the pattern. It uses your known start/end window (as in your test generator). This example adds a boolean “skeleton” flag so you can later prefer original rows when deduping.

Note: adapt imports and minor API names to your polars version; the pattern and APIs used below are the documented building blocks (see date_range, upsample docs and join docs linked in Sources).

rust
use polars::prelude::*;
use polars::time::{date_range, Duration};
use chrono::NaiveDateTime;

// Assumes create_test_dataframe_many_gaps() is available in-scope and returns
// a DataFrame with columns: ["sensor_id" (i32), "ts" (datetime[ms]), "value" (f64)].
pub fn add_5m_buckets_preserve_original() -> PolarsResult<DataFrame> {
 let mut df = create_test_dataframe_many_gaps();

 // Build a skeleton 5-minute grid (same start/end you used for the test data)
 let start = NaiveDateTime::parse_from_str("2025-12-25 00:00:00", "%Y-%m-%d %H:%M:%S").unwrap();
 let end = NaiveDateTime::parse_from_str("2025-12-25 06:00:00", "%Y-%m-%d %H:%M:%S").unwrap();

 let dt_chunked = date_range(
 "ts",
 start,
 end,
 Duration::parse("5m"),
 ClosedWindow::Both,
 TimeUnit::Milliseconds,
 None, // timezone
 )?; // DatetimeChunked

 let n = dt_chunked.len();
 // single sensor example: sensor_id = 1551
 let sensor_col = Series::new("sensor_id", vec![1551i32; n]);
 let ts_col = dt_chunked.into_series();
 // null values for skeleton rows
 let value_col = Series::new("value", vec![Option::<f64>::None; n]);
 let mut skeleton = DataFrame::new(vec![sensor_col, ts_col, value_col])?;

 // Mark original rows as not skeleton, skeleton rows as true.
 // hstack_mut appends the new column in-place.
 df.hstack_mut(&[Series::new("skeleton", vec![false; df.height()])])?;
 skeleton.hstack_mut(&[Series::new("skeleton", vec![true; n])])?;

 // vstack the original data and the skeleton grid
 let mut union = df.vstack(&skeleton)?; // now contains both types of rows

 // Group-wise: sort by ts and forward-fill nulls (fills skeleton rows only)
 // GroupBy::apply lets us operate per group
 let filled = union
 .groupby(&["sensor_id"])?
 .apply(|mut g| {
 // sort the group by timestamp (ascending)
 g = g.sort("ts", false)?;
 // forward-fill nulls in all columns (value will be filled; skeleton stays present)
 g = g.fill_null(FillNullStrategy::Forward)?;
 Ok(g)
 })?;

 // Optional: prefer original rows when ts duplicates occur:
 // sort so skeleton == false (original) comes first, then drop duplicates on (sensor_id, ts)
 // (polars has drop_duplicates / unique functions; pick the one available in your version)
 // Example (pseudocode — adapt to your version):
 // let out = filled
 // .lazy()
 // .sort_by_exprs(vec![col("sensor_id"), col("ts"), col("skeleton")], vec![false, false, false])
 // .collect()?;
 // let out = out.drop_duplicates(Some(&["sensor_id","ts"]), KeepStrategy::First)?;

 Ok(filled)
}

This will produce:

  • exact 5‑minute anchor rows (00:00:00, 00:05:00, 00:10:00, …) with values carried forward, and
  • original rows such as 00:05:17 preserved (they were not overwritten).

See the date_range docs for the exact signature used above: https://docs.rs/polars/latest/polars_time/fn.date_range.html.


Generalization: multiple sensors and production hints

If you have many sensor_ids with different coverage windows, generate a per‑sensor skeleton:

  • Compute per‑sensor min/max timestamps (groupby sensor_id → min(ts), max(ts)).
  • For each sensor_id, call date_range(min, max, 5m) and build a per‑sensor skeleton DataFrame.
  • vstack all skeletons (or build a single skeleton by repeating sensor_id for each sensor).
  • Proceed to concat, sort, and group-wise forward-fill.

Pseudocode:

  • bounds = df.groupby(“sensor_id”).agg(min_ts=min(“ts”), max_ts=max(“ts”))
  • loop bounds rows:
  • ts = date_range(min_ts, max_ts, 5m)
  • build skeleton for this sensor
  • push into vec
  • skeleton_all = concat(skeleton_vec)
  • union = df.vstack(&skeleton_all)
  • groupby per sensor → sort by ts → fill_null(Forward)

When data is large:

  • Prefer lazy execution (LazyFrame), avoid .apply() where possible.
  • If you must use apply and your groups are many, keep the per-group operations minimal.
  • As a faster alternative to per-group apply, consider generating skeletons and using join_asof (see next section) in lazy mode.

For an example community pattern (date_range + vstack + forward fill) see this Polars GitHub issue which demonstrates the approach: https://github.com/pola-rs/polars/issues/1555.


Alternatives: upsample, upsample_stable, asof_join, group_by_dynamic

  • upsample / upsample_stable

  • upsample will create regularly spaced rows but anchors to the time component of the first timestamp per group — that’s why your attempt produced times like 00:05:06. The stable variant preserves original rows while adding interval rows, but it still uses the computed anchor. See the Polars upsample implementation notes: https://docs.pola.rs/api/rust/dev/src/polars_time/upsample.rs.html and the resampling guide: https://docs.pola.rs/user-guide/transformations/time-series/resampling/.

  • You can sometimes re-anchor by first flooring timestamps to a 5m grid, calling upsample, then re-merging originals — but that’s awkward and error-prone.

  • asof_join (recommended alternative for large tables)

  • Create a skeleton DataFrame with exact 5m timestamps (date_range) for each sensor, then asof_join the original data to the skeleton with strategy = backward. That fills each skeleton row with the most recent value from the original stream without changing original rows. The Polars joins guide has the asof_join explanation and use cases: https://docs.pola.rs/user-guide/transformations/joins/.

  • After an asof_join you typically union (vstack) the original rows back if you want both the skeleton rows and the original timestamp rows in one table.

  • group_by_dynamic

  • If your use-case is windowing rather than exact buckets you can use group_by_dynamic with every=‘5m’. It can insert implicit gaps as nulls which you then fill — this is useful for rolling/window-based aggregations. See the rolling/grouping docs: https://docs.pola.rs/user-guide/transformations/time-series/rolling/.

  • Community posts / tutorials

  • Practical join/concat patterns are described in posts like this Rho Signal article on gap filling: https://www.rhosignal.com/posts/filling-gaps-lazy-mode/.


Performance notes & pitfalls

  • Memory: generating a huge skeleton for a long time range and millions of sensors can be heavy. Prefer per-sensor date_range limited to observed min/max or use a streaming/lazy approach.
  • groupby.apply is flexible but can be slower than vectorized LazyFrame operations or join_asof on large data. If performance is critical, implement the skeleton + asof_join path in lazy mode.
  • Time unit / timezone: be explicit about the TimeUnit you pass to date_range (milliseconds here, since your ts is datetime[ms]). Mismatched units create unexpected stepping.
  • Duplicate timestamps: if a skeleton timestamp exactly matches an original timestamp you’ll end up with two rows unless you drop duplicates. Add a small “skeleton” boolean flag when building the skeleton to allow deterministic deduplication (prefer original rows).
  • Upsample alignment: upsample behavior (anchoring) is the usual cause of the “00:05:06” result; prefer date_range if you need anchors at :00 seconds.

Sources

  1. Polars user guide — Resampling / upsample
  2. upsample.rs (Polars source) — PolarsUpsample trait
  3. Stack Overflow — Resample time series using Polars in Rust (example patterns)
  4. Polars GitHub issue: How to fill missing dates in time series · Issue #1555
  5. Polars user guide — Grouping / Rolling / Dynamic windows
  6. Rho Signal — Filling time series gaps (lazy mode)
  7. Polars prelude: PolarsUpsample docs (API)
  8. Polars user guide — Joins / asof_join
  9. polars_time::date_range documentation (Rust docs.rs)

Conclusion

To upsample to exact 5‑minute buckets while preserving non‑aligned rows in Polars (Rust), generate a 5‑minute date_range skeleton, vstack/concat it with your real data, sort by sensor + ts, and forward‑fill nulls per group. This date_range + vstack + fill_null pattern keeps rows like 2025-12-25 00:05:17 untouched and adds the exact grid rows you want; use asof_join or lazy as optimizations for larger datasets.

Authors
Verified by moderation
Moderation
Polars Rust: Upsample Time-Series Gaps to 5m Intervals