Big Data File Formats | HCLTech

Big Data File Formats
March 16, 2022

Choosing the correct file format is one of the crucial steps in big-data projects. Whenever we deal with MapReduce and Spark, the prime concern is the time that it takes to find the relevant/proper information/data from its location. This may impact the performance of the complete job.

A proper file format plays a vital role in Big Data projects.

As Hadoop uses replication to store the data redundantly to achieve fault tolerance, the storage cost gets increased due to this replication. Processing cost should also be taken into consideration which includes CPU, network, I/O costs.

Using the correct file format, based on a given scenario may help us to lower these costs along with a better performance.

Advantages of using appropriate file formats:

  1. Faster read
  2. Faster write
  3. Splittable files support
  4. Schema evolution can be supported
  5. Advanced compression can be achieved

File formats:

CSV (comma-separated values)

CSV files are comma-delimited files, where data is stored in row-based file format.

They are mostly used for exchanging tabular data in CSV files where each header/value will be delimited using comma (“ ,”), pipe(“ |”) etc. Generally, the first row contains header names.

For example: Let us consider below the data in tabular format

CustId

First_Name

Last_Name

City

9563218

FN1

LN1

Delhi

9558120

FN2

LN2

Kolkata

Above data will be displayed as below in CSV format:

CustId,First Name,Last Name,City

9563218,FN1,LN1,Delhi

9558120,FN2,LN2,Kolkata

Suitable options for using tabular data are for spreadsheet processing and human reading.

  • Use this format for analysis, POCs, or small data sets.

PARQUET file format

Parquet is an open-source file format for Hadoop.

Parquet helps to achieve efficient storage and performance. This is a column-oriented format, where the values of each column of the same type in the records are stored together.

To understand parquet file format, let’s take an example as below:

Example: There is a table which consists of CustId, First_Name, Last_Name and City, then all the values for the CustId column will be stored together, values for First_Name column will be together, and another column will also be stored in a similar way. If we take the same record schema as mentioned above having below four fields, the table will look like:

CustId

First_Name

Last_Name

City

9563218

FN1

LN1

Delhi

9558120

FN2

LN2

Kolkata

For this table, the data in a row-wise storage format will be stored as follows:

9563218

FN1

LN1

Delhi

9558120

FN2

LN2

Kolkata

Whereas the same data in a column-oriented storage format will look like this:

9563218

9558120

FN1

FN2

LN1

LN2

Delhi

Kolkata

  • The columnar storage format is relatively more efficient, and the requirement is to fetch column-based data by querying a few columns from a table.

The column-oriented file format increases the query performance as it takes less time to fetch the required column value. Also, less IO is required as the required columns are adjacent to each other, thus minimizing IO.

AVRO file format

Avro is a row-based storage format, which is widely used for serialization.

Avro depends on the schema, which is stored in JSON format, this makes it easier to read and understand by any program. The data is stored in a binary format, thus making it compact and efficient.
One of the prime features of Avro is that it supports dynamic data schemas that change over time. Since this format supports schema evolution, it can easily handle schema changes like missing fields, added fields, and changed fields.

  • Avro format is preferred for loading data lake landing, because downstream systems can easily retrieve table schemas from files, and any source schema changes can be easily handled.
  • Due to its efficient serialization and deserialization property, it offers good performance.

ORC file format

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. This format was designed to overcome the limitations of other file formats. It improves the overall performance when Hive (A SQL kind of interface, built on top of Hadoop) reads, writes, and processes the data.

ORC stores collections of rows in one file and within the collection, the row data is stored in a columnar format.

Big data

Image reference:-  https://cwiki.apache.org/confluence/display/hive/languagemanual+orc

There is a group of row data called stripes in ORC file; file footer contains auxiliary information as well.

Postscript consists of compression parameters and the size of the compressed footer, which is present at the end of the file. The default stripe size is 250 MB. Large stripe sizes help in achieve large, efficient reads from HDFS.

Stripe footer stores a directory of stream locations.

Row data is used in table scans.

Index data consists of min and max values for each column and each column’s row positions.

ORC is relatively more compression-efficient data storage than other file formats.

Feather file format

This is a portable file format used for storing Arrow tables or data frames (From Python and R kind of languages) which utilizes the Arrow IPC format. Arrow IPC format (which is part of Arrow columnar specification) is basically designed for transporting large quantities of data in chunks.

It is a fast, lightweight, and easy-to-use binary file format for storing data frames.

Compression was not supported in the earlier version of Feather. However, Feather V2 files (the default version) supports two fast compression libraries, LZ4 (using the frame format) and ZSTD.

  • This file format provides high read and write performance. Feather operations should be bound by local disk performance, whenever possible. Internal structure supports random access and slicing from the middle. This also means that we can read a large file piece by piece without having to pull the whole thing into memory.

File

Image Reference: Feather V2 with Compression Support in Apache Arrow 0.17.0 · Ursa Labs

Comparison among Avro, Parquet, and ORC file format

Options

Avro File

Parquet File

ORC File

Schema evolution support

Best

Good

Better

Compression of files

Good

Better

Best

Splittability support

Good

Good

Best

Row or column based

Row

Column

Column

Read or write operation

Write

Read

Write

 

CSV format is easier to use, human-readable, and widely used format for tabular data representation, but it lacks many of the capabilities that other formats provide. It is too slow to be used to query the data lake. 

ORC and Parquet are widely used in the Hadoop ecosystem to query data, ORC is mostly used in Hive, and Parquet format is the default format for Spark. 

Avro can be used outside of Hadoop, like in Kafka.

Row-oriented formats usually offer better schema evolution and capabilities than column-oriented formats, which makes them a good fit for ingestion.

Feather files are generally faster in read and write performance when used with solid state drives, due to its simpler compression scheme.

Reference

  1. https://cwiki.apache.org/confluence/display/hive/languagemanual+orc
  2. https://en.wikipedia.org/wiki/Comma-separated_values
  3. Feather V2 with Compression Support in Apache Arrow 0.17.0 · Ursa Labs

Get HCLTech Insights and Updates delivered to your inbox