Choosing the correct file format is one of the crucial steps in big-data projects. Whenever we deal with MapReduce and Spark, the prime concern is the time that it takes to find the relevant/proper information/data from its location. This may impact the performance of the complete job.
A proper file format plays a vital role in Big Data projects.
As Hadoop uses replication to store the data redundantly to achieve fault tolerance, the storage cost gets increased due to this replication. Processing cost should also be taken into consideration which includes CPU, network, I/O costs.
Using the correct file format, based on a given scenario may help us to lower these costs along with a better performance.
Advantages of using appropriate file formats:
- Faster read
- Faster write
- Splittable files support
- Schema evolution can be supported
- Advanced compression can be achieved
File formats:
CSV (comma-separated values)
CSV files are comma-delimited files, where data is stored in row-based file format.
They are mostly used for exchanging tabular data in CSV files where each header/value will be delimited using comma (“ ,”), pipe(“ |”) etc. Generally, the first row contains header names.
For example: Let us consider below the data in tabular format
CustId |
First_Name |
Last_Name |
City |
9563218 |
FN1 |
LN1 |
Delhi |
9558120 |
FN2 |
LN2 |
Kolkata |
Above data will be displayed as below in CSV format:
CustId,First Name,Last Name,City
9563218,FN1,LN1,Delhi
9558120,FN2,LN2,Kolkata
Suitable options for using tabular data are for spreadsheet processing and human reading.
- Use this format for analysis, POCs, or small data sets.
PARQUET file format
Parquet is an open-source file format for Hadoop.
Parquet helps to achieve efficient storage and performance. This is a column-oriented format, where the values of each column of the same type in the records are stored together.
To understand parquet file format, let’s take an example as below:
Example: There is a table which consists of CustId, First_Name, Last_Name and City, then all the values for the CustId column will be stored together, values for First_Name column will be together, and another column will also be stored in a similar way. If we take the same record schema as mentioned above having below four fields, the table will look like:
CustId |
First_Name |
Last_Name |
City |
9563218 |
FN1 |
LN1 |
Delhi |
9558120 |
FN2 |
LN2 |
Kolkata |
For this table, the data in a row-wise storage format will be stored as follows:
9563218 |
FN1 |
LN1 |
Delhi |
9558120 |
FN2 |
LN2 |
Kolkata |
Whereas the same data in a column-oriented storage format will look like this:
9563218 |
9558120 |
FN1 |
FN2 |
LN1 |
LN2 |
Delhi |
Kolkata |
- The columnar storage format is relatively more efficient, and the requirement is to fetch column-based data by querying a few columns from a table.
The column-oriented file format increases the query performance as it takes less time to fetch the required column value. Also, less IO is required as the required columns are adjacent to each other, thus minimizing IO.
AVRO file format
Avro is a row-based storage format, which is widely used for serialization.
Avro depends on the schema, which is stored in JSON format, this makes it easier to read and understand by any program. The data is stored in a binary format, thus making it compact and efficient.
One of the prime features of Avro is that it supports dynamic data schemas that change over time. Since this format supports schema evolution, it can easily handle schema changes like missing fields, added fields, and changed fields.
- Avro format is preferred for loading data lake landing, because downstream systems can easily retrieve table schemas from files, and any source schema changes can be easily handled.
- Due to its efficient serialization and deserialization property, it offers good performance.
ORC file format
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. This format was designed to overcome the limitations of other file formats. It improves the overall performance when Hive (A SQL kind of interface, built on top of Hadoop) reads, writes, and processes the data.
ORC stores collections of rows in one file and within the collection, the row data is stored in a columnar format.
Image reference:- https://cwiki.apache.org/confluence/display/hive/languagemanual+orc
There is a group of row data called stripes in ORC file; file footer contains auxiliary information as well.
Postscript consists of compression parameters and the size of the compressed footer, which is present at the end of the file. The default stripe size is 250 MB. Large stripe sizes help in achieve large, efficient reads from HDFS.
Stripe footer stores a directory of stream locations.
Row data is used in table scans.
Index data consists of min and max values for each column and each column’s row positions.
ORC is relatively more compression-efficient data storage than other file formats.
Feather file format
This is a portable file format used for storing Arrow tables or data frames (From Python and R kind of languages) which utilizes the Arrow IPC format. Arrow IPC format (which is part of Arrow columnar specification) is basically designed for transporting large quantities of data in chunks.
It is a fast, lightweight, and easy-to-use binary file format for storing data frames.
Compression was not supported in the earlier version of Feather. However, Feather V2 files (the default version) supports two fast compression libraries, LZ4 (using the frame format) and ZSTD.
- This file format provides high read and write performance. Feather operations should be bound by local disk performance, whenever possible. Internal structure supports random access and slicing from the middle. This also means that we can read a large file piece by piece without having to pull the whole thing into memory.
Image Reference: Feather V2 with Compression Support in Apache Arrow 0.17.0 · Ursa Labs
Comparison among Avro, Parquet, and ORC file format
Options |
Avro File |
Parquet File |
ORC File |
Schema evolution support |
Best |
Good |
Better |
Compression of files |
Good |
Better |
Best |
Splittability support |
Good |
Good |
Best |
Row or column based |
Row |
Column |
Column |
Read or write operation |
Write |
Read |
Write |
CSV format is easier to use, human-readable, and widely used format for tabular data representation, but it lacks many of the capabilities that other formats provide. It is too slow to be used to query the data lake.
ORC and Parquet are widely used in the Hadoop ecosystem to query data, ORC is mostly used in Hive, and Parquet format is the default format for Spark.
Avro can be used outside of Hadoop, like in Kafka.
Row-oriented formats usually offer better schema evolution and capabilities than column-oriented formats, which makes them a good fit for ingestion.
Feather files are generally faster in read and write performance when used with solid state drives, due to its simpler compression scheme.