Skip to main content

Overview

Schemas define the structure and types of event payloads. They are the foundation of the MAPS data pipeline.


1. Why Schemas Matter

  • Validation: Ensures every message matches expected structure.
  • Transformation: Enables field-level mapping and enrichment.
  • Filtering: Makes expressions type-aware and accurate.
  • Statistics: Allows numeric and string fields to be aggregated correctly.

Without a schema, payloads are opaque and only minimal inference is possible.


2. Supported Formats

NameDescription
AVROApache Avro
CBCCompact Binary Schema (fixed layout)
CBORConcise Binary Object Representation
CSVComma separated values
JSONJavaScript Object Notation
MessagePackBinary JSON-like representation
NativeSingle Java scalar value (int, long, double, String)
ProtoBufGoogle Protocol Buffers
RAWOpaque bytes, no structure
XMLExtensible Markup Language

3. Schema Registration

Schemas are stored in the Schema Repository and are bound to contexts, usually topic names. When an event arrives, the server determines which schema applies and uses that to parse, validate, and type the payload.

How Registration Works

  • Each schema has a unique ID
    Identifies a concrete schema definition (JSON, Protobuf, Avro, CBC, etc.).

  • Schemas are bound to a context
    A context may be:

    • a topic
    • a topic pattern
    • a schema reference inside a processor or transformation
  • Multiple versions per context
    MAPS stores all versions. The active one may be:

    • latest
    • pinned ID
    • version chosen by a transformation pipeline
  • Fallback behaviour
    If no schema is registered:

    • payload becomes RAW
    • no structural validation
    • stats and transformations run best-effort only

What the Server Does During Registration

  1. Parse & validate the schema

    • JSON validated against internal rules
    • Avro / Protobuf / IDL correctness checks
    • CBC validated for layout integrity
  2. Store the schema in the Schema Repository as an internal unified model.

  3. Create a context binding
    Mapping: topic/pattern → schema ID

  4. Update dependent subsystems

    • caches
    • processing engines
    • transform pipelines
    • statistics processors
    • format conversion logic

During Event Processing

  1. Resolve schema based on topic
  2. Load schema (cached)
  3. Decode payload into a Typed Event
  4. Run processing pipeline

Typed events allow MAPS to:

  • evaluate expressions with correct typing
  • compute numeric/string stats
  • run transformations
  • convert formats safely

Without a schema, payloads are raw bytes and limited processing is possible.


4. Schema in Processing Flow

Schemas influence every stage of the event lifecycle. Once applied, the payload becomes a Typed Event, enabling correct filtering, transformation, statistics, and format conversion.

High-Level Flow

Typed Event Definition

A Typed Event is the unified internal representation MAPS generates from any schema format:

  • typed value tree
  • per-field type metadata
  • fast indexed paths
  • normalised fields (timestamps → epoch millis)

The rest of MAPS does not care whether the original payload was JSON, Avro, Protobuf, CBC, etc.

Step-by-Step Processing

1. Schema Lookup

Find schema bound to the topic/context.
If none exists: event stays opaque.

2. Parsing into Unified Model

Payload decoded via schema rules:

  • JSON
  • Protobuf
  • Avro
  • CBC
  • MessagePack / CBOR
  • XML

Output: Typed Event

3. Filtering

Typed evaluation allows:

  • numeric comparisons
  • string logic
  • boolean checks
  • proper null handling

All fast, no string-guessing.

4. Transformations

Typed Events can be mapped into new schema types:

  • renaming
  • flattening
  • nested extraction
  • type conversion
  • enrichment
  • full schema-to-schema translation

5. Statistics

Typed Events enable correct:

  • min/max/avg/median
  • stddev (Welford)
  • string frequency
  • histograms
  • sliding-window summaries

6. Format Conversion

Typed Events can be re-encoded into:

  • JSON
  • Protobuf
  • Avro
  • CBC
  • MessagePack
  • CBOR
  • XML
  • CSV (best-effort conventions)

Without a Schema

  • payload is opaque
  • filtering becomes string-based heuristics
  • transformations are restricted
  • stats degrade to string mode
  • cross-format conversions are impossible

5. Schema Compatibility Matrix

Different serialization formats support different field types natively. The matrix below shows what MAPS can safely convert between formats, and where conventions or special handling are required.

FieldAvroCBCCBORCSVProtobufJSONMessagePackXML
stringId (string)
intId (int32)
longId (int64)⚠️
floatId (float32)⚠️
doubleId (float64)⚠️
booleanId⚠️
bytesId (binary)⚠️⚠️⚠️
arrayId (array<int>)✅*⚠️⚠️
nestedObj⚠️
enumId⚠️⚠️
timestampId✅†✅‡✅§⚠️✅¶

This table outlines schema type compatibility across all serialization formats: Avro, CBC, CBOR, CSV, Protobuf, JSON, MessagePack, and XML.

Legend

  • ✅ Native / straightforward support
  • ⚠️ Supported with convention (e.g. stringify, JSON-in-cell, base64, etc.)
  • ✅* CBC arrays require explicit element length/layout
  • ✅† Avro: logicalType: timestamp-millis or timestamp-micros
  • ✅‡ CBC: store integer epoch or ISO string via encalc/decalc
  • ✅§ CBOR: use tag 0/1 (date/time) or plain ISO string
  • ✅¶ Protobuf: use google.protobuf.Timestamp or epoch millis

Notes on Compatibility

CSV

CSV cannot encode structured or binary data natively. MAPS uses conventions such as:

  • JSON-in-cell for objects or arrays
  • base64 for binary data
  • string rendering for timestamps

Usable, but not elegant.

CBC

CBC always requires an explicit field layout. Arrays, enums, timestamps, and nested objects all rely on exact offsets and encoding rules.
Nothing is inferred.

JSON / MessagePack

Formats allow multiple valid encodings for:

  • timestamps
  • enums
  • binary data

MAPS applies internal conventions, but without schemas conversions can be lossy or ambiguous.

XML

XML supports all MAPS field types, but round-tripping requires consistent rules for:

  • ordering
  • attributes vs elements
  • namespaces

MAPS enforces deterministic mapping to keep cross-format conversions stable.