Cloudera Impala

Fast as Impala.

Impala is the open source, analytic MPP database for Apache Hadoop that provides the fastest time-to-insight.

Key features:

Architecture

Daemon - runs on each DataNode

reads and writes to data files
gets the query from clients and becomes the coordinator for that query = parallelizes the query, send the subqueries to other data nodes, gets the results back and merges them for the final response for client

Statestore

Catalog service

only one in the cluster, typically on the same node as statestore
when the metadata of Impala tables are changed, it notifies all data nodes involved
reduces the necessity to perform REFRESH or INVALIDATE METADATA command only when metadata are changed through Hive or HDFS - but not through Impala itself

Querying the data

SQL - DML restricted to CREATE, DROP, INSERT and INSERT OVERWRITE
impala-shell
Hue web-based interface
JDBC, ODBC
Subqueries
———————————————
Hive metastore - používá se pro uložení metadat, ukládá ty z Impaly i umí číst ty z Hive
stejny HW - Impala rozdeluje rovnomerne
1.6 GB/second odhad rychlosti zpracování na nodu v ideálním případě
lepší jsou binární než textová data, Parquet
nutno spočítat STATS když se nalijou data
EXPLAIN a SUMMARY

Odkaz na článek

Vytvořil 4. listopadu 2015 v 05:48:01 mira. Upravováno 1x, naposledy 20. února 2016 v 10:16:43, mira

Diskuze ke článku