bv.reservoir_sample
Uniform K-sample over the entity's full history via Vitter Algorithm R.
samplesis a required register-time kwarg per V0-MEM-GOV-02.
Signature
bv.reservoir_sample(
field: str,
*,
samples: int, # REQUIRED — register-time kwarg
where: bv.Col | None = None,
) -> AggDescriptor
Description
bv.reservoir_sample returns a uniform random sample of samples values
from the entity's full history of matching events using Vitter Algorithm R
(Vitter, 1985). For the first samples events the reservoir fills directly;
each subsequent event is admitted with probability samples/items_seen,
overwriting a uniformly chosen existing slot. The result is statistically
indistinguishable from sampling samples of items_seen events without
replacement, in a single pass. Use it for "show me 100 representative
transactions across this user's lifetime" or "100 random failed-login
attempts to spot-check" — features that need a uniform sample of the entire
event history without storing every event.
samples is a required keyword argument per
V0-MEM-GOV-02: the lifetime-aggregation
memory contract requires every unbounded-by-default operator to declare a
finite per-entity ceiling at register time. bv.reservoir_sample's ceiling
is exactly samples × sizeof(Value) bytes plus a u64 for items_seen.
The register-time JSON-prelude shim
(pre_check_unbounded_op_in_lifetime_mode) rejects any reservoir_sample
payload missing samples with the structured error code
unbounded_op_in_lifetime_mode. There is no fallback default — picking
samples is a deliberate capacity-planning + statistical-power step.
samples is clamped to ≥ 1 at state construction.
bv.reservoir_sample belongs to the bounded-buffer family. Per-event
update is Tier 3 (~14 ns floor / ~35 ns measured per
cost-class.md) — one Value::clone(), one modulo, one
indexed write. The clone-path variance dominates (Value::Bytes of large
payloads can be expensive — see bv.most_recent_n for
the same caveat).
Determinism: no rand:: dependency. The random index is driven by an
inline xorshift64 PRNG seeded from items_seen XOR'd with the
0x9E37_79B9_7F4A_7C15 golden-ratio constant. The same event sequence
always produces the same reservoir — replay-safe. There is no window=
kwarg in v0 — bv.reservoir_sample is lifetime-only by design (the
algorithm samples from the entire history). For "uniform sample within the
last 30 days", compose with @bv.event(cold_after="30d") per
V0-MEM-GOV-01.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
field |
str |
Yes | — | Name of the field whose values to sample. Any scalar Value type. |
samples |
int |
Yes | — | Reservoir size — number of values to retain. Must be ≥ 1 per V0-MEM-GOV-02 BoundedByRequiredKwarg("samples"). Bounds the per-entity memory ceiling at register time. |
where |
bv.Col |
No | None |
Boolean expression on event fields; only matching events are considered for the reservoir. |
Note: the wire-form params field is named samples, not k, to match
the v0 SDK signature (samples=) and the
V0-MEM-GOV-02 classifier
BoundedByRequiredKwarg("samples").
Returns
A list of up to samples values from the entity's full matching event
history, sampled uniformly. Wire form is Value::List — Python SDK readers
receive a native list. The order of the values within the list is the
arrival order in which they entered the reservoir (not their original
arrival rank), which is implementation-defined and should not be relied
upon. When fewer than samples events have been observed, the list is the
partial reservoir. Cold-start (no events) returns the empty list [] —
never null.
Complexity
| Resource | Bound |
|---|---|
| CPU per event | Tier 3 (~14 ns floor / ~35 ns measured — xorshift PRNG + modulo + one Value::clone()) — see cost-class.md. Clone-path variance: Value::Str is Arc::clone (cheap); Value::Bytes of large payloads can dominate |
| Memory per entity | BoundedByRequiredKwarg("samples") — samples × sizeof(Value) bytes + 1 u64 (items_seen) per Phase 12.8 V0-MEM-GOV-02 |
| Lifetime mode | Required — bv.reservoir_sample has no window= kwarg in v0; lifetime is the only mode (the algorithm samples from the entire history by design) |
Examples
Example 1: 100-sample of lifetime transaction amounts per user
import beava as bv
@bv.event
class Txn:
user_id: str
amount: float
@bv.table(key="user_id")
def UserAmountSample(txn) -> bv.Table:
return (
txn.group_by("user_id")
.agg(amount_sample=bv.reservoir_sample("amount", samples=100))
)
# After 50,000 transactions for "alice":
result = app.get("UserAmountSample", "alice")
# result == {"amount_sample": [12.5, 87.0, 240.0, ...]} # 100 values uniformly chosen
Example 2: 50-sample of failed login IPs
@bv.table(key="account_id")
def AccountFailedLoginIps(logins) -> bv.Table:
return (
logins.group_by("account_id")
.agg(failed_ip_sample=bv.reservoir_sample("ip_address",
samples=50,
where=bv.col("status") == "failed"))
)
Wire
JSON wire form in a register payload:
{
"kind": "derivation",
"name": "UserAmountSample",
"output_kind": "table",
"key": ["user_id"],
"agg": {
"amount_sample": {
"op": "reservoir_sample",
"params": {
"field": "amount",
"samples": 100
}
}
}
}
See examples/wire/register-fraud-team.request.json for a full payload example.
Edge cases
samplesmissing at register time: rejected with structured error codeunbounded_op_in_lifetime_modeper V0-MEM-GOV-02. The JSON-prelude shim catches this before any state is allocated.samples=0or negativesamples: clamped to1at state construction (samples.max(1)), but the SDK helper rejects pre-wire withaggregation_invalid_param.- Fewer than
samplesevents seen: returns the partial reservoir (e.g.[v1, v2, v3]after 3 events whensamples=100). - Empty stream / cold-start: returns
[](empty list) — nevernull. - Null source field (
Value::Null): events whosefieldisnullare skipped and do not count towarditems_seen(the reservoir's denominator). - Missing source field: events without
fieldare skipped — does not advanceitems_seen. where=filter excludes everything: returns[]until matching events arrive.window=kwarg attempted: raisesTypeErrorat SDK-helper-call time. The algorithm requires the entire history; for "uniform sample over the last N days" use@bv.event(cold_after="...")to bound the lifetime via per-entity TTL.- Determinism guarantee: the xorshift PRNG is seeded from
items_seenXOR'd with a golden-ratio constant — no calls torand::or wall-clock — so a snapshot + WAL replay reconstructs the same reservoir contents. This makesreservoir_samplesafe for replay-determinism contracts. - Sampling-quality guarantee: Algorithm R (Vitter, 1985) is provably uniform — each of the
items_seenevents has equal probabilitysamples/items_seenof appearing in the reservoir. - Large
Value::Bytescost: the per-event admission clones the value into the reservoir; for high-throughput workloads with large payloads, sample a derived id (hash, summary) rather than the raw bytes. - Out-of-order event-time: does not matter. beava is processing-time-only per
project_redis_shaped_no_event_time_ever; admission probability is governed by server arrival order viaitems_seen. - Lifetime mode: the only mode. Per-entity ceiling is
samples × sizeof(Value)bytes per V0-MEM-GOV-02 BoundedByRequiredKwarg("samples").
See also
- cost-class.md — performance tier (Tier 3)
- bv.most_recent_n — recency sibling (last
nevents vs. uniform sample over all events; sameBoundedByRequiredKwargpattern, different kwarg name) - bv.first_n — first-N companion (locks the first
nmatching values; never rotates) - bv.last_n — last-N companion (point/ordinal family — chooses between by your traceability bucket)
- bv.top_k — frequency-weighted-sample companion (top-K by count, not uniform random)
- V0-MEM-GOV-02 —
BoundedByRequiredKwargmemory governance contract - pipeline-dsl/compilation-rules.md — chain compilation rules