Přejít na menu

Druid.io

Správa článků

Vyhledávání Vyhledávání
4.11.2015 05:46
,
Počet přečtení: 716

Druid is a column-oriented open-source distributed data store written in Java
Druid is a system built to allow fast ("real-time") access to large sets of seldom-changing data. It was designed with the intent of being a service and maintaining 100% uptime in the face of code deployments, machine failures and other eventualities of a production system.

external dependencies: storage (HDFS, S3, local filesystem), metadata storage (MySQL, Derby), Zookeeper
timestamp based records (real-time imports) and analysis 

Example usage:
timestamp-based aggregations
filtering, ordering by aggregated data
(Wikipedia editations: aggregate by time, filter by country, rank by number of edits)

Configuration:
single real-time node for testing (requires only Zookeeper, not HDFS or database for metadata)
cluster configuration:

Realtime node - loads the new data
Coordinator node - monitors the state of the nodes in cluster
Historical node - holds the archived segments
Broker node - receives the queries from clients and queries the realtime and/or historical node, then merges responses and sends them back to clients
Indexer node - loading the data into the system

Data
data segments (files with data over some span of time)
speed: single digit seconds for 6TB data
data source - similar to table in relational databases

Fault tolerance
When some component dies…
historical node: can be replaced by other node, which simply loads the data from deep storage
deep storage: segments are duplicated on historical nodes
coordinator: no changes to the topology can be introduced, but everything works fine
broker: can be run in parallel
realtime: can be run in parallel and ingest the same data
metadata storage: similar to coordinator
Zookeeper: similar to Coordinator

Data ingestion
realtime:
 - from Kafka, Samza, Storm - with Tranquility plugin
 - realitime node (standalone)
 - Indexing Service nodes with Tranquility 
batch ingestion:
 - HadoopDruidIndexer - better and simpler
 - Indexing service nodes
    jak se spustí?

Querying
via HTTP REST API in JSON
timeseries - aggregate data from the given time range
topN
groupBy
SQL: with external libraries: http://druid.io/docs/0.8.1/development/libraries.html (Sql4D - used in Yahoo)

Vytvořil 4. listopadu 2015 v 05:46:43 mira. Upravováno 1x, naposledy 20. února 2016 v 10:16:13, mira


Diskuze ke článku

Vložení nového komentáře
*
*
*