Deep Dive into Dremel: Record Shredding and Assembly
In the world of big data, efficient storage and retrieval of nested data structures are paramount. Google’s Dremel system (the precursor to BigQuery and the inspiration for the popular Parquet format) introduced a novel columnar storage format that handles nested data (like JSON or Protocol Buffers) with remarkable scanning efficiency. In this post, I’ll share my own implementation of the Dremel record shredding and assembly algorithms, with code snippets and explanations of the core concepts. The implementation mostly follows the pseudocode as described in the original paper so you may also find it useful to read the original paper for more details.