bv.entropy
Shannon entropy over a categorical-distribution field.
Signature
bv.entropy(
field: str,
*,
window: str | None = None,
where: bv.Col | None = None,
max_categories: int = 256,
) -> AggDescriptor
Description
bv.entropy returns the Shannon entropy (log₂ base) of the categorical
distribution of a field across events that match the optional where=
predicate. State is a per-category frequency table capped at max_categories
distinct keys; once full, the cap-and-drop policy keeps the most-frequent
categories and discards the tail (Phase 19.2-06 D-05a).
Entropy ranges from 0.0 (degenerate distribution — every event has the
same value) to log₂(K) for K equally-likely categories. It quantifies how
diverse / unpredictable the field's values are for this entity. Use
bv.entropy("merchant", window="24h") for "how diverse is this user's
merchant mix today?" or bv.entropy("user_agent") for "how varied are the
client UA strings ever seen for this account?".
bv.entropy belongs to the sketch family. BoundedByConfig("max_categories", 256)
per Phase 12.8 V0-MEM-GOV-02 — the
per-entity memory ceiling is declared at register time via the
max_categories kwarg. Per-event update is a BTreeMap key insert; Tier 3
cost (~60 ns floor / ~160 ns measured) — string-key allocation is the
irreducible per-event cost.
Parameters
| Name | Type | Required | Default | Description |
|---|---|---|---|---|
field |
str |
Yes | — | Name of the categorical field to compute entropy over. Any hashable type (str, i64, f64). |
window |
str |
No | None (lifetime) |
Duration string matching \d+(ms|s|m|h|d) or "forever". |
where |
bv.Col |
No | None |
Boolean expression on event fields; only matching events contribute. |
max_categories |
int |
No | 256 |
Cap on distinct categories retained per entity. Memory bounded by O(max_categories) BTreeMap entries. Phase 12.8 BoundedByConfig ceiling. |
Returns
A single f64 in [0, log₂(max_categories)]. When the entity has seen zero
matching events, the result is null (Python None).
Complexity
| Resource | Bound |
|---|---|
| CPU per event | Tier 3 (~60 ns algorithm floor / ~160 ns measured — BTreeMap key insert + cap-and-drop) — see cost-class.md |
| Memory per entity | BoundedByConfig("max_categories", 256) per Phase 12.8 V0-MEM-GOV-02 — BTreeMap of size ≤ max_categories |
Lifetime mode (window=None) |
Allowed — BoundedByConfig declares the per-entity ceiling at register time |
Examples
Example 1: Per-user merchant diversity, daily
import beava as bv
@bv.event
class Txn:
user_id: str
merchant: str
amount: float
@bv.table(key="user_id")
def UserMerchantDiversity(txn) -> bv.Table:
return (
txn.group_by("user_id")
.agg(merchant_entropy_24h=bv.entropy("merchant", window="24h"))
)
# Push events
app.push("Txn", {"user_id": "alice", "merchant": "amazon", "amount": 50.0})
app.push("Txn", {"user_id": "alice", "merchant": "amazon", "amount": 20.0})
app.push("Txn", {"user_id": "alice", "merchant": "starbucks", "amount": 5.0})
app.push("Txn", {"user_id": "alice", "merchant": "uber", "amount": 12.0})
# Query
result = app.get("UserMerchantDiversity", "alice")
# result == {"merchant_entropy_24h": ~1.5} # 2/4 amazon, 1/4 starbucks, 1/4 uber
Example 2: Lifetime user-agent entropy with a tighter cap
@bv.table(key="account_id")
def UaDiversity(reqs) -> bv.Table:
return (
reqs.group_by("account_id")
.agg(ua_entropy=bv.entropy("user_agent", max_categories=64))
)
Wire
JSON wire form in a register payload:
{
"kind": "derivation",
"name": "UserMerchantDiversity",
"output_kind": "table",
"key": ["user_id"],
"agg": {
"merchant_entropy_24h": {
"op": "entropy",
"params": {
"field": "merchant",
"window": "24h",
"max_categories": 256
}
}
}
}
See examples/wire/register-fraud-team.request.json for a full payload example.
Edge cases
- Empty stream / cold-start: result is
null— no events ⇒ no entropy defined. - Single category (degenerate distribution): result is
0.0— perfectly predictable. max_categoriesexceeded: cap-and-drop policy keeps the most-frequent categories; the entropy estimate is biased low (concentrates probability mass into the retained categories). For high-cardinality fields, raisemax_categoriescautiously — memory grows linearly.- Field type:
str,i64,f64all hashable. Non-hashable types fail at register time withschema_mismatch. - NaN inputs: treated as a single distinct category (NaN equals itself in the BTreeMap key); for cleaner semantics filter with
where=~bv.col("field").isnull(). max_categoriesset to 0: rejected at register time withaggregation_invalid_param.- Lifetime mode (
window=None): explicitly allowed —BoundedByConfig("max_categories", 256)declares the per-entity ceiling at register time per V0-MEM-GOV-02. - Quadkey-for-geo recipe: the recommended replacement for the deleted
bv.geo_entropyop (Phase 19.2) isbv.entropy(quadkey(lat, lon, zoom), max_categories=1024)— thequadkey(...)expression at apply time produces a deterministic integer cell id forentropyto bin.
See also
- cost-class.md — performance tier (Tier 3)
- bv.top_k — heavy-hitters companion (same
BoundedByConfigpattern) - bv.n_unique — cardinality companion
- bv.event_type_mix — proportion-per-category companion (also
BoundedByConfig) - bv.quantile — order-statistics companion
- pipeline-dsl/compilation-rules.md — chain compilation rules