Overview
Schemas define the structure and types of event payloads. They are the foundation of the MAPS data pipeline.
1. Why Schemas Matter
- Validation: Ensures every message matches expected structure.
- Transformation: Enables field-level mapping and enrichment.
- Filtering: Makes expressions type-aware and accurate.
- Statistics: Allows numeric and string fields to be aggregated correctly.
Without a schema, payloads are opaque and only minimal inference is possible.
2. Supported Formats
| Name | Description |
|---|---|
| AVRO | Apache Avro |
| CBC | Compact Binary Schema (fixed layout) |
| CBOR | Concise Binary Object Representation |
| CSV | Comma separated values |
| JSON | JavaScript Object Notation |
| MessagePack | Binary JSON-like representation |
| Native | Single Java scalar value (int, long, double, String) |
| ProtoBuf | Google Protocol Buffers |
| RAW | Opaque bytes, no structure |
| XML | Extensible Markup Language |
3. Schema Registration
Schemas are stored in the Schema Repository and are bound to contexts, usually topic names. When an event arrives, the server determines which schema applies and uses that to parse, validate, and type the payload.
How Registration Works
-
Each schema has a unique ID
Identifies a concrete schema definition (JSON, Protobuf, Avro, CBC, etc.). -
Schemas are bound to a context
A context may be:- a topic
- a topic pattern
- a schema reference inside a processor or transformation
-
Multiple versions per context
MAPS stores all versions. The active one may be:- latest
- pinned ID
- version chosen by a transformation pipeline
-
Fallback behaviour
If no schema is registered:- payload becomes RAW
- no structural validation
- stats and transformations run best-effort only
What the Server Does During Registration
-
Parse & validate the schema
- JSON validated against internal rules
- Avro / Protobuf / IDL correctness checks
- CBC validated for layout integrity
-
Store the schema in the Schema Repository as an internal unified model.
-
Create a context binding
Mapping: topic/pattern → schema ID -
Update dependent subsystems
- caches
- processing engines
- transform pipelines
- statistics processors
- format conversion logic
During Event Processing
- Resolve schema based on topic
- Load schema (cached)
- Decode payload into a Typed Event
- Run processing pipeline
Typed events allow MAPS to:
- evaluate expressions with correct typing
- compute numeric/string stats
- run transformations
- convert formats safely
Without a schema, payloads are raw bytes and limited processing is possible.
4. Schema in Processing Flow
Schemas influence every stage of the event lifecycle. Once applied, the payload becomes a Typed Event, enabling correct filtering, transformation, statistics, and format conversion.
High-Level Flow
Typed Event Definition
A Typed Event is the unified internal representation MAPS generates from any schema format:
- typed value tree
- per-field type metadata
- fast indexed paths
- normalised fields (timestamps → epoch millis)
The rest of MAPS does not care whether the original payload was JSON, Avro, Protobuf, CBC, etc.
Step-by-Step Processing
1. Schema Lookup
Find schema bound to the topic/context.
If none exists: event stays opaque.
2. Parsing into Unified Model
Payload decoded via schema rules:
- JSON
- Protobuf
- Avro
- CBC
- MessagePack / CBOR
- XML
Output: Typed Event
3. Filtering
Typed evaluation allows:
- numeric comparisons
- string logic
- boolean checks
- proper null handling
All fast, no string-guessing.
4. Transformations
Typed Events can be mapped into new schema types:
- renaming
- flattening
- nested extraction
- type conversion
- enrichment
- full schema-to-schema translation
5. Statistics
Typed Events enable correct:
- min/max/avg/median
- stddev (Welford)
- string frequency
- histograms
- sliding-window summaries
6. Format Conversion
Typed Events can be re-encoded into:
- JSON
- Protobuf
- Avro
- CBC
- MessagePack
- CBOR
- XML
- CSV (best-effort conventions)
Without a Schema
- payload is opaque
- filtering becomes string-based heuristics
- transformations are restricted
- stats degrade to string mode
- cross-format conversions are impossible
5. Schema Compatibility Matrix
Different serialization formats support different field types natively. The matrix below shows what MAPS can safely convert between formats, and where conventions or special handling are required.
| Field | Avro | CBC | CBOR | CSV | Protobuf | JSON | MessagePack | XML |
|---|---|---|---|---|---|---|---|---|
| stringId (string) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| intId (int32) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| longId (int64) | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| floatId (float32) | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| doubleId (float64) | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| booleanId | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| bytesId (binary) | ✅ | ✅ | ✅ | ⚠️ | ✅ | ⚠️ | ✅ | ⚠️ |
| arrayId (array<int>) | ✅ | ✅* | ✅ | ⚠️ | ✅ | ✅ | ✅ | ⚠️ |
| nestedObj | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ✅ | ✅ |
| enumId | ✅ | ✅ | ✅ | ⚠️ | ✅ | ✅ | ⚠️ | ✅ |
| timestampId | ✅† | ✅‡ | ✅§ | ⚠️ | ✅¶ | ✅ | ✅ | ✅ |
This table outlines schema type compatibility across all serialization formats: Avro, CBC, CBOR, CSV, Protobuf, JSON, MessagePack, and XML.
Legend
- ✅ Native / straightforward support
- ⚠️ Supported with convention (e.g. stringify, JSON-in-cell, base64, etc.)
- ✅* CBC arrays require explicit element length/layout
- ✅† Avro:
logicalType: timestamp-millisortimestamp-micros - ✅‡ CBC: store integer epoch or ISO string via
encalc/decalc - ✅§ CBOR: use tag 0/1 (date/time) or plain ISO string
- ✅¶ Protobuf: use
google.protobuf.Timestampor epoch millis
Notes on Compatibility
CSV
CSV cannot encode structured or binary data natively. MAPS uses conventions such as:
- JSON-in-cell for objects or arrays
- base64 for binary data
- string rendering for timestamps
Usable, but not elegant.
CBC
CBC always requires an explicit field layout. Arrays, enums, timestamps, and nested objects all rely on exact offsets and encoding rules.
Nothing is inferred.
JSON / MessagePack
Formats allow multiple valid encodings for:
- timestamps
- enums
- binary data
MAPS applies internal conventions, but without schemas conversions can be lossy or ambiguous.
XML
XML supports all MAPS field types, but round-tripping requires consistent rules for:
- ordering
- attributes vs elements
- namespaces
MAPS enforces deterministic mapping to keep cross-format conversions stable.