Druid is a column-oriented open-source distributed data store written in Java
Druid is a system built to allow fast ("real-time") access to large sets of seldom-changing data. It was designed with the intent of being a service and maintaining 100% uptime in the face of code deployments, machine failures and other eventualities of a production system.
external dependencies: storage (HDFS, S3, local filesystem), metadata storage (MySQL, Derby), Zookeeper
timestamp based records (real-time imports) and analysis
Example usage:
timestamp-based aggregations
filtering, ordering by aggregated data
(Wikipedia editations: aggregate by time, filter by country, rank by number of edits)
Configuration:
single real-time node for testing (requires only Zookeeper, not HDFS or database for metadata)
cluster configuration:
Realtime node - loads the new data
Coordinator node - monitors the state of the nodes in cluster
Historical node - holds the archived segments
Broker node - receives the queries from clients and queries the realtime and/or historical node, then merges responses and sends them back to clients
Indexer node - loading the data into the system
Data
data segments (files with data over some span of time)
speed: single digit seconds for 6TB data
data source - similar to table in relational databases
Fault tolerance
When some component dies…
historical node: can be replaced by other node, which simply loads the data from deep storage
deep storage: segments are duplicated on historical nodes
coordinator: no changes to the topology can be introduced, but everything works fine
broker: can be run in parallel
realtime: can be run in parallel and ingest the same data
metadata storage: similar to coordinator
Zookeeper: similar to Coordinator
Data ingestion
realtime:
- from Kafka, Samza, Storm - with Tranquility plugin
- realitime node (standalone)
- Indexing Service nodes with Tranquility
batch ingestion:
- HadoopDruidIndexer - better and simpler
- Indexing service nodes
jak se spustí?
Querying
via HTTP REST API in JSON
timeseries - aggregate data from the given time range
topN
groupBy
SQL: with external libraries: http://druid.io/docs/0.8.1/development/libraries.html (Sql4D - used in Yahoo)