Hadoop can store and process unstructured data like video, text, etc. This is a significant advantage for Hadoop. However, there are many, perhaps more, uses of Hadoop with structured or “flexibly structured” data, meaning it can have fields added, changed or removed over time and even vary amongst concurrently ingested records. Many of the file format choices focus on managing structured and flexibly structured data.
Factors affecting functionality and performance
- Flexible schema evolution – Do you want to add and delete fields from a file? You may have massive re-processing just to add a new field if the format is not suited for flexible and evolving schema.
- Block splitting – Since Hadoop stores and processes data in blocks you must be able to begin reading data at any point within a file in order to take fullest advantage of Hadoop’s distributed processing. For example, CSV files are splittable since you can start reading at any line in the file and the data will still make sense; however, an XML file is not splittable since XML has an opening tag at the beginning and a closing tag at the end. You cannot start processing at any point in the middle of those tags.
- Compression – Since Hadoop stores large files by splitting them into blocks, its best if the blocks can be independently compressed. Snappy and LZO are commonly used compression technologies that enable efficient block storage and processing. If a file format does not support block compression then, if compressed, the file is rendered non-splittable. When processed, the decompressor must begin reading the file at its beginning in order to obtain any block within the file. For a large file with many blocks, this could generate a substantial performance penalty.
- Size of file – If your files are smaller than the size of an HDFS block, then splittability and block compression don’t matter. You may be able to store the data uncompressed or with a simple file compression algorithm. Of course, small files are the exception in Hadoop and processing too many small files can cause performance issues. Hadoop wants large, splittable files so that its massively distributed engine can leverage data locality and parallel processing.Large files in Hadoop consume a lot of disk — especially when considering standard 3x replication. So, there is an economic incentive to compress data. i.e. store more data per byte of disk. There is also a performance incentive as disk IO is expensive. If you can reduce the disk footprint through compression, you can relieve IO bottlenecks. As an example, I converted an uncompressed, 1.8GB CSV file into the following formats, achieving much smaller disk footprints.
Uncompressed CSV 1.8 GB Avro 1.5 GB Avro w/ Snappy Compression 750 MB Parquet w/ Snappy Compression 300 MB
- Do you want to do processing or get query performance – There are three types of performance to consider:
- Write performance — how fast can the data be written.
- Partial read performance — how fast can you read individual columns within a file.
- Full read performance — how fast can you read every data element in a file.
A columnar, compressed file format like Parquet or ORC may optimize partial and full read performance, but they do so at the expense of write performance. Conversely, uncompressed CSV files are fast to write but due to the lack of compression and column-orientation are slow for reads. You may end up with multiple copies of your data each formatted for a different performance profile.
Avoid any format that is not splittable, like XML File and JSON File formats is a common mistake. Each of these formats contain a single document per file with an opening tag at the beginning and a closing tag at the end. Note that files containing JSON records are ok.
|Format||Description||Performance||Flexible Schema Evolution||Block Compression||Sample|
CSV files are still quite common and often used for exchanging data between Hadoop and external systems. They are readable and ubiquitously parsable. They come in handy when doing a dump from a database or bulk loading data from Hadoop into an analytic database.
When working with Text/CSV files in Hadoop, never include header or footer lines. Each line of the file should contain a record. This, of course, means that there is no metadata stored with the CSV file. You must know how the file was written in order to make use of it. Also, since the file structure is dependent on field order, new fields can only be appended at the end of records while existing fields can never be deleted.
|JSON Records||JSON records are different from JSON Files in that each line is its own JSON datum — making the files splittable. Unlike CSV files, JSON stores metadata with the data, fully enabling schema evolution. JSON support was a relative late comer to the Hadoop toolset and many of the native serdes contain significant bugs.||
|Avro Files||Avro files are quickly becoming the best multi-purpose storage format within Hadoop. Avro files store metadata with the data but also allow specification of an independent schema for reading the file.||
|Sequence Files||Sequence files store data in a binary format with a similar structure to CSV. Like CSV, sequence files do not store metadata with the data so the only schema evolution option is appending new fields. Due to the complexity of reading sequence files, they are often only used for “in flight” data such as intermediate data storage used within a sequence of MapReduce jobs.||
|RC Files||RC Files or Record Columnar Files were the first columnar file format adopted in Hadoop. In order to add a column to your data you must rewrite every pre-existing RC file. Also, although RC files are good for query, writing an RC file requires more memory and computation than non-columnar file formats. They are generally slower to write.||
|ORC Files||ORC Files or Optimized RC Files were invented to optimize performance in Hive and are primarily backed by HortonWorks. Some benchmarks indicate that ORC files compress to be the smallest of all file formats in Hadoop. It is worthwhile to note that, at the time of this writing, Cloudera Impala does not support ORC files.||
|No||Yes, Better than RC||Columnar storage|
Parquet Files are yet another columnar file format that originated from Hadoop. However, unlike RC and ORC files Parquet serdes support limited schema evolution. In Parquet, new columns can be added at the end of the structure. At present, Hive and Impala are able to query newly added columns, but other tools in the ecosystem such as Hadoop Pig may face challenges. Parquet is supported by Cloudera and optimized for Cloudera Impala. Native Parquet support is rapidly being added for the rest of the Hadoop ecosystem.
Note – It is very important that Parquet column names are lowercase or it won’t work nor log errors.
Common Use Case Recommendations
|File Format||Use case recommendations|
|Sequence files||Storing intermediate data between MapReduce jobs|
|ORC (HortonWorks/Hive) or Parquet (Cloudera/Impala)||Query performance against the data is most important. Downside writes are slow.|
|Avro||If your schema is going to change over time, but query performance will be slower than ORC or Parquet|
|CSV files||Extract data from Hadoop to bulk load into a database|
As discussed, each file format is optimized by purpose. Your choice of format is driven by your use case and environment. Here are the key factors to consider:
- Hadoop Distribution- Cloudera and Hortonworks support/favor different formats
- Schema Evolution- Will the structure of your data evolve? In what way?
- Processing Requirements – Will you be crunching the data and with what tools?
- Read/Query Requirements- Will you be using SQL on Hadoop? Which engine?
- Extract Requirements- Will you be extracting the data from Hadoop for import into an external database engine or other platform?
- Storage Requirements- Is data volume a significant factor? Will you get significantly more bang for your storage buck through compression?