What is Hadoop?
- environment built on the distributed filesystem
- scalability (sending function to the data), robustness (redundancy)
- bunch of open-source tools
- typically provided as a service (because of its complicated managing)
Two main cornerstones of Hadoop
- HDFS (Hadoop Distributed File System)
- MapReduce paradigm - sending function to the data and collecting them, inspiration by pioneering
Other important tools
Tools for developing:
- Map Reduce - see above
- Hive - language with HQL (Hive Query Language), which is automatically transformed to map-reduce tasks. Read-only queries, high latention. Created in Facebook.
- Pig - language and runtime, translation to map-reduce. Created in Yahoo.
- Jaql
- Mahout
Data storage and management tools
- HDFS - see above
- Cassandra - NoSQL (key-value) DB, alternative to HDFS, fast
- HBase - no-relational DB on the top of HDFS, good for sparse data
- HCatalog - Hadoop tables and storage management
Control tools
- Zookeeper - controlling configurations, sync...
- Oozie - jobs management
Data aggregation and mining
- Sqoop, Chukwa, Flume
Article about hadoop (in czech)