3 min read

What Serialization Really Does to Your Data

Serialization converts in-memory data structures into a format that can be stored or transmitted. Deserialization reconstructs them. This process is fundamental to every distributed system, yet its implications are often underestimated.

Key sources: "Designing Data-Intensive Applications" by Martin Kleppmann, Protocol Buffers documentation, Apache Avro specification.


What Is Serialization?

When a program runs, data exists in memory as complex structures — objects, trees, hash maps, arrays. This representation is specific to the programming language and the running process. To write this data to a file or send it over a network, it must be converted into a sequence of bytes. That conversion is serialization.

```python user = {"name": "Alice", "age": 30, "email": "[email protected]"}

b'{"name":"Alice","age":30,"email":"[email protected]"}' ```

The reverse process — reading bytes and reconstructing the in-memory structure — is deserialization.


Common Serialization Formats

| Format | Type | Human Readable | Schema Required | Performance | |--------|------|----------------|-----------------|-------------| | JSON | Text | Yes | No | Moderate | | XML | Text | Yes | Optional | Slow | | YAML | Text | Yes | No | Slow | | MessagePack | Binary | No | No | Fast | | Protocol Buffers (protobuf) | Binary | No | Yes | Very fast | | Apache Avro | Binary | No | Yes | Very fast | | Thrift | Binary | No | Yes | Very fast |


Why Serialization Matters

Schema Evolution

The most challenging aspect of serialization is schema evolution. A service writes data in version 1 of a schema. Months later, a different service reads that data with version 2. Can it still interpret the data?

JSON approach: Add a field. Old data does not have it. The reader must handle missing fields gracefully. Remove a field. Old readers crash because they expect it. JSON has no explicit schema, so readers must be manually tolerant of changes.

Protocol Buffers approach: Fields are numbered and optional by default. Adding a field does not break old readers (they ignore unknown field numbers). Removing a field? Mark it as "reserved" so it is never reused. Forward and backward compatibility is built into the encoding.

```protobuf // Version 1 message User { required string name = 1; optional int32 age = 2; }

// Version 2 (backward compatible) message User { required string name = 1; optional int32 age = 2; optional string email = 3; // Old readers ignore this } ```

Performance

Serialization performance affects throughput in data-intensive systems:

  • JSON: Simple to debug but slow to parse. A JSON parser must read variable-length strings, handle escaping, and build a tree of objects.
  • Protocol Buffers: Binary format with fixed-length headers. Parsing is a simple matter of reading field numbers and copying bytes. 10-100x faster than JSON.
  • MessagePack: Binary JSON. Faster than JSON but slower than schematized formats.

Data Size

Network bandwidth and storage cost money. Serialization format directly affects both:

  • JSON: Verbose. Field names are repeated for every object. {"name":"Alice"} takes 17 bytes for 5 bytes of actual data.
  • Protocol Buffers: Compact. Field numbers replace field names. 0x0A 0x05 "Alice" takes 7 bytes — less than half the size.

Serialization in Distributed Systems

Inter-Service Communication

Microservices communicate over the network. Each request and response must be serialized. The choice of format affects:

  • Latency: Faster serialization means lower request latency.
  • Throughput: Smaller payloads mean more requests per second.
  • Coupling: Shared schemas create coupling between services.

Many organizations standardize on Protocol Buffers or Avro for internal traffic and reserve JSON for external APIs.

Message Queues and Streams

Systems like Kafka are a natural fit for schematized serialization. Kafka stores bytes. Producers serialize messages. Consumers deserialize them. If both sides agree on a schema registry, schema evolution becomes manageable:

  1. Producer registers schema version N with the schema registry
  2. Producer serializes data using schema N
  3. Consumer reads the schema ID from the message header
  4. Consumer fetches schema N from the registry
  5. Consumer deserializes

This allows independent evolution of producers and consumers, as long as changes are backward compatible.

Database Storage

Databases store serialized data. The format affects storage efficiency and query performance:

  • Row-oriented storage: Serialize each row as a contiguous block. Good for fetching entire records.
  • Column-oriented storage: Serialize each column as a contiguous block. Good for aggregations on specific columns.

Parquet and ORC are columnar storage formats designed for analytics workloads, using techniques like dictionary encoding and run-length encoding to compress data by 10-100x.


Security Considerations

Deserialization is a common attack vector. A malicious serialized payload can:

  • Trigger arbitrary code execution during deserialization
  • Allocate large amounts of memory (billion laughs attack)
  • Cause infinite loops in poorly designed formats

Safe practices:

  1. Validate input before deserialization
  2. Use schematized formats (protobuf, Avro) rather than free-form (Java serialization, Python pickle)
  3. Set strict size limits on deserialized data
  4. Do not deserialize data from untrusted sources with languages like Java or Python that execute code during deserialization

Key Takeaways

  1. Serialization converts in-memory data to bytes for storage or transmission.
  2. Schema evolution is the hardest serialization problem — plan for forward and backward compatibility.
  3. Binary formats (protobuf, Avro) are 10-100x faster and significantly more compact than JSON.
  4. Schema registries decouple producers and consumers for independent evolution.
  5. Deserialization is a security risk — validate input and avoid unsafe formats.

Design principle: Choose a serialization format based on your schema evolution requirements, not convenience.