Modern serialization formats in the Big Data Era: Alternatives to JSON

Data sorting process.jpg

May 5, 2025                                                                 ⏱️ 10 min
By Tibi V. (RnD – WebFrontend Group)

In an industry where JSON format has been the standard choice for powering web APIs and microservices for the last decade, increases in data complexity and scale have favored the appearance of alternative data formats, optimized for different use cases.

This article aims to give a bit of insight to some of the emerging new formats that gained traction and usage over the past year.

JSON

JavaScript Object Notation (JSON) emerged in the early 2000s as a lightweight, text-based data interchange format. Due to its simplicity, human readability, and compatibility with most programming languages, JSON immediately became the standard for transmitting structured data between servers and clients. While JSON is widely supported even to this day, it has several downsides:

  • Inefficient Storage: Being a text-based format, JSON results in larger file sizes compared to binary formats, leading to increased storage and bandwidth costs.
  • Parsing Overhead: JSON parsing is slower than binary alternatives, making it less suitable for high-performance data processing.
  • Lack of Schema Support: JSON lacks built-in schema validation, which can lead to inconsistencies in large-scale data processing workflows.
  • Data Redundancy: Repeated field names in large datasets contribute to unnecessary storage overhead.

These limitations have led to the adoption of more efficient formats such as Apache Parquet, Avro, Protocol Buffers or ORC, each designed to address specific needs in modern data systems.

Comparing New Data Formats

Choosing a serialization format is more than just a technical preference, it’s an important decision that can influence storage strategies, processing solutions and long-term system scalability. So before evaluating their performance, it’s essential to understand what each of these modern formats is designed for, where are they currently used and how their unique features align with different use cases.

Apache Parquet 

  • Structure: Columnar storage format.
  • Benefits: Optimized for read-heavy analytical workloads; efficient data compression and encoding schemes enhance performance.
  • Use Cases: Ideal for scenarios requiring complex queries and aggregations, such as data warehousing and business intelligence applications.
  • Industry Adoption: Used extensively by big data frameworks like Apache Spark, Google BigQuery, and Amazon Redshift.

Apache Avro

  • Structure: Row-based storage format.
  • Benefits: Strong support for schema evolution, with compact binary encoding that enables efficient serialization.
  • Use Cases: Suitable for write-heavy transactional operations and scenarios requiring frequent schema changes, such as log processing and streaming data pipelines.
  • Industry Adoption: Used by companies like LinkedIn for data serialization and communication between microservices.

Protocol buffers

  • Structure: Binary serialization format.
  • Benefits: Compact, fast, and efficient, with support for schema evolution and backward/forward compatibility.
  • Use Cases: Ideal for inter-service communication, particularly in gRPC-based microservices and mobile applications where low-latency data transmission is required.
  • Industry Adoption: Used extensively by Google, as well as in various distributed systems and cloud applications.

Optimized Row Columnar (ORC)

  • Structure: Columnar storage format.
  • Benefits: High compression ratios and optimized read performance, specifically designed for read-heavy operations.
  • Use Cases: Well-suited for data warehousing applications requiring fast read access and complex analytics.
  • Industry Adoption: Predominantly used in the Apache Hive ecosystem for efficient data storage and querying.

JSON vs. Modern Formats

In order to showcase the different benefits of each format, we have performed a benchmark with read and write operations performed on a set of 100k records (consisting of a basic user entity) in different formats.

The below conclusions provide a slight overview of how each format behaves under load/stress and help decide which solution fit specific needs.

Read operations benchmark:

CPU Read By Format

Write operations benchmark:

CPU Write By Format

Brief interpretation of the results:

    1. Protocol Buffers (ProtoBuf) demonstrated exceptional performance in read speed, with the highest throughput of approximately 897,000 records/sec. Its binary structure enables ultra-fast deserialization, making it ideal for real-time systems and microservices. The reported CPU and memory usage during write operations was 0, likely due to the efficiency of the format or limitations in short-interval measurement.
    2. Apache Avro and Apache Parquet showed solid performance across the board: Parquet is better suited for read-heavy analytical workloads, with good file size compression and decent read throughput, while Avro excelled in write performance and also maintained moderate read throughput. It is highly suitable for streaming data pipelines due to its row-based structure and schema evolution support.
    3. ORC provided balanced read/write performance but had slightly larger memory footprints. It is optimized for big data environments, especially when used with Apache Hive. Its partial schema evolution support might be a drawback compared to Avro or ProtoBuf.
    4. JSON, while human-readable, performed the worst in terms of file size and speed. It had significantly lower throughput and larger file size, reaffirming its unsuitability for high-performance or storage-sensitive applications.

Key Takeaways:

  • For maximum speed and compactness, ProtoBuf format is recommended, specially for latency-critical systems.
  • For data lakes and analytics, Parquet or ORC are preferable due to compression and columnar structure.
  • Avro remains a strong choice for log pipelines and Kafka-based systems.
  • Avoid JSON in scenarios where performance and storage efficiency matter.

Emerging Data Formats

Besides the presented formats, it’s useful to highlight that several other emerging data formats are gaining traction:

  • BSON (Binary JSON): Used in MongoDB, BSON extends JSON by adding support for binary data types, improving efficiency while maintaining JSON compatibility.
  • MessagePack: A compact binary format that reduces size while maintaining speed, making it ideal for network communication.
  • CBOR (Concise Binary Object Representation): Designed for small-footprint applications, particularly in IoT and embedded systems.
  • UBJSON (Universal Binary JSON): Aims to be a universal, efficient binary format compatible with JSON.

Conclusion

The rise of big data and performance-intensive applications has exposed JSON’s limitations, leading to the adoption of more specialized formats like the ones we detailed above. While JSON remains a viable option for many applications, understanding when to use these newer formats can significantly improve storage efficiency, query performance, and overall system scalability.

As data-driven applications continue to evolve, keeping an eye on these new formats will be crucial for optimizing data management strategies.

Îndemnul nostru

Efortul pus în programele pentru studenți completează teoria din facultate cu practica care “ne omoară”. Profitați de ocazie, participând la cât mai multe evenimente!

Acest site folosește cookie-uri și date personale pentru a vă îmbunătăți experiența de navigare. Continuarea utilizării presupune acceptarea lor.