Impala is the open source, analytic MPP database for Apache Hadoop that provides the fastest time-to-insight.
Key features:
- MPP, powered by Hadoop
- querying data from HDFS, S3, HBase
- SQL syntax with fast responses (“interactive” - human-scale response times)
Architecture
Daemon - runs on each DataNode
- reads and writes to data files
- gets the query from clients and becomes the coordinator for that query = parallelizes the query, send the subqueries to other data nodes, gets the results back and merges them for the final response for client
Statestore
- communicates with Statestore - assures it
- for better robustness of cluster, but not necessary (cluster works without it)
- only one in the cluster
Catalog service
- only one in the cluster, typically on the same node as statestore
- when the metadata of Impala tables are changed, it notifies all data nodes involved
- reduces the necessity to perform REFRESH or INVALIDATE METADATA command only when metadata are changed through Hive or HDFS - but not through Impala itself
Querying the data
- SQL - DML restricted to CREATE, DROP, INSERT and INSERT OVERWRITE
- impala-shell
- Hue web-based interface
- JDBC, ODBC
- Subqueries
- ———————————————
- Hive metastore - používá se pro uložení metadat, ukládá ty z Impaly i umí číst ty z Hive
- stejny HW - Impala rozdeluje rovnomerne
- 1.6 GB/second odhad rychlosti zpracování na nodu v ideálním případě
- lepší jsou binární než textová data, Parquet
- nutno spočítat STATS když se nalijou data
- EXPLAIN a SUMMARY