Data storage formats: Avro, Protobuffers, Parquet, ORC

Comparsion of data formats for storaging and transmitting.

Usage domains:

Protobuf

= Google's protocol buffers
defines one record serialisation
suitable for data transportation
good for optional attributes - message contains just data for present attributes
attributes are identified by id
schema evolution
similar: capnproto

row-based
defines record and also container serialisation
schema evolution
IDL uses JSON
splittable in Hadoop
data corruption of container: sync markers between data blocks => after corruption, all records to the end of the particular block will be lost
good for complex tables with strings
schema in the header => no need of external schema
rows can be appended

File Format Benchmark - Avro, JSON, ORC & Parquet from Hadoop Summit

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon 2015 from StampedeCon

Odkaz na článek

Vytvořil 4. března 2017 v 17:14:45 mira. Upravováno 3x, naposledy 4. března 2017 ve 21:16:06, mira