hadoop架构

Hadoop架构

RHadoop: use R programming language to do statistic data processing
Mahout: machine learning tools
Hive and Pig: doing no SQL
sqoop: data in and out the system

Core hadoop

core hadoop

Hadoop的两个core的部分：
HDFS：hadoop distributed file system -> store data

MapReduce: Process data

Hadoop的衍生周边（software works with Hadoop together）：

hadoop ecosystem

hadoop ecosystem: software work along with hadoop, design for making hadoop easier to use.
Writing MapReduce languages can be in Java, Python, Ruby and Perl, or even SQL

HIVE : HIVE interpretes SQL like, SELECT * FROM... into MapReduce
PIG: enable to analyse data in simple script language rather than MapReduce
(Code is turned into MapReduce and run on the cluster)
Impala: with SQL, no need MapReduce, low latency queries. Run quickly than HIVE.

data input from outside HDFS

sqoop: takes data from traditional relational database, such as Microsoft's SQL Server, put to HDFS. So the data can be processed along with other dat on the cluster.
Flume: injects data as it's generated by external systems, puts into the cluster
HBase: a real time database, built on top of HDFS
Hue: graphical front-end to the cluster
oozie: a workflow management tool
Mahout: a machine learning library

Cloudera has a distribution of HADOOP, called CDH (free and open source), put together the tools in the Hadoop ecosystem.

HDFS and MapReduce

HDFS:
store one large file which means store large data into several blocks
DataNode: a block to store data, a cluster (a HDFS) has several blocks (could be several DataNodes)
NameNode: a demon store metadata about which blocks make up the original file

a large file
HDFS content

HDFS content

HDFS会出问题的点

When DataNode Fails

HDFS make replications for each block, every block stores in HDFS 3 times. So if one DataNode fails, other DataNodes can provide backup and re-replicate again for the blocks in the failed DataNode.

When NameNode Fails

会有single point failure问题

So, here is NFS (network file system). Store Metadata on a remote disk. If NameNode lost all data, there would be a copy of the metadata on the network.

2 NameNode

Hadoop 的基本操作命令

Manipulate by Unix like commend
In terminal,

hadoop fs -ls //show all files infomation
hadoop fs -put purchases.txt //put purchases.txt into HDFS
hadoop fs -tail purchases.txt //show last few lines of purchases.txt
hadoop fs -cat purchases.txt //show entire contents of the file
hadoop fs -mv purchases.txt newname.txt //rename
hadoop fs -rm newname.txt // delete txt file
hadoop fs -mkdir myinput // create a directory in HDFS named myinput
hadoop fs -put purchases.txt myinput  //upload txt to the new directory

MapReduce

File divide into chunks and then process in Parallel

Store in <Key, Value> pairs

key ,value pair problem

MapReduce process

How can the final results to be in a sorted order?

final result 怎么才能是有序的？

那么问题就又来了

If there is only 2 Reducer, which keys go to the first reducer?

image.png

Don't know. Because there is no guarantees that each reducers can get the same number of keys. It might be that one will get none.

Deamons of MapReduce

It is alike the relation of NameNode and DataNode
Job is submitted to Job Tracker, that splits the work into mappers and reducers. The Task Trackers runs in the same machine as the DataNodes. If all DataNodes have the green block are busy, then another DataNode will be chosen to process the green block, it will be streamed over the network. (rather rarely)
The Mappers will read their input data and they will produce intermediate data which the Hadoop framework will pass to the reducers (shuffle and sort). Then the reducers process that data and write their final output back to HDFS.