Thursday, February 16, 2023

What is Parquet?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads. It is similar to other columnar-storage file formats available in Hadoop, namely RCFile and ORC.


Characteristics of Parquet

Free and open source file format.

Language agnostic.

Column-based format - files are organized by column, rather than by row, which saves storage space and speeds up analytics queries.

Used for analytics (OLAP) use cases, typically in conjunction with traditional OLTP databases.

Highly efficient data compression and decompression.

Supports complex data types and advanced nested data structures.


Benefits of Parquet

Good for storing big data of any kind (structured data tables, images, videos, documents).

Saves on cloud storage space by using highly efficient column-wise compression, and flexible encoding schemes for columns with different data types.

Increased data throughput and performance using techniques like data skipping, whereby queries that fetch specific column values need not read the entire row of data.

references:

https://www.databricks.com/glossary/what-is-parquet#:~:text=What%20is%20Parquet%3F,handle%20complex%20data%20in%20bulk.

No comments:

Post a Comment