美文网首页我爱编程
Hadoop相关知识

Hadoop相关知识

作者: SpringWolfM | 来源:发表于2018-04-17 08:28 被阅读0次

    hadoop架构

    Hadoop架构

    RHadoop: use R programming language to do statistic data processing
    Mahout: machine learning tools
    Hive and Pig: doing no SQL
    sqoop: data in and out the system

    Core hadoop

    core hadoop

    Hadoop的两个core的部分:
    HDFS:hadoop distributed file system -> store data

    MapReduce: Process data

    Hadoop的衍生周边(software works with Hadoop together):


    hadoop ecosystem

    hadoop ecosystem: software work along with hadoop, design for making hadoop easier to use.
    Writing MapReduce languages can be in Java, Python, Ruby and Perl, or even SQL

    HIVE : HIVE interpretes SQL like, SELECT * FROM... into MapReduce
    PIG: enable to analyse data in simple script language rather than MapReduce
    (Code is turned into MapReduce and run on the cluster)
    Impala: with SQL, no need MapReduce, low latency queries. Run quickly than HIVE.

    data input from outside HDFS

    sqoop: takes data from traditional relational database, such as Microsoft's SQL Server, put to HDFS. So the data can be processed along with other dat on the cluster.
    Flume: injects data as it's generated by external systems, puts into the cluster
    HBase: a real time database, built on top of HDFS
    Hue: graphical front-end to the cluster
    oozie: a workflow management tool
    Mahout: a machine learning library

    Cloudera has a distribution of HADOOP, called CDH (free and open source), put together the tools in the Hadoop ecosystem.

    HDFS and MapReduce

    HDFS:
    store one large file which means store large data into several blocks
    DataNode: a block to store data, a cluster (a HDFS) has several blocks (could be several DataNodes)
    NameNode: a demon store metadata about which blocks make up the original file

    a large file
    HDFS content
    HDFS content

    HDFS会出问题的点

    HDFS会出问题的点

    When DataNode Fails

    • HDFS make replications for each block, every block stores in HDFS 3 times. So if one DataNode fails, other DataNodes can provide backup and re-replicate again for the blocks in the failed DataNode.


    When NameNode Fails

    会有single point failure问题

    So, here is NFS (network file system). Store Metadata on a remote disk. If NameNode lost all data, there would be a copy of the metadata on the network.


    2 NameNode

    Hadoop 的基本操作命令

    Manipulate by Unix like commend
    In terminal,

    hadoop fs -ls //show all files infomation
    hadoop fs -put purchases.txt //put purchases.txt into HDFS
    hadoop fs -tail purchases.txt //show last few lines of purchases.txt
    hadoop fs -cat purchases.txt //show entire contents of the file
    hadoop fs -mv purchases.txt newname.txt //rename
    hadoop fs -rm newname.txt // delete txt file
    hadoop fs -mkdir myinput // create a directory in HDFS named myinput
    hadoop fs -put purchases.txt myinput  //upload txt to the new directory
    

    MapReduce

    File divide into chunks and then process in Parallel


    Store in <Key, Value> pairs
    key ,value pair problem
    MapReduce process

    How can the final results to be in a sorted order?

    final result 怎么才能是有序的?

    那么问题就又来了

    If there is only 2 Reducer, which keys go to the first reducer?


    image.png

    Don't know. Because there is no guarantees that each reducers can get the same number of keys. It might be that one will get none.

    Deamons of MapReduce

    It is alike the relation of NameNode and DataNode
    Job is submitted to Job Tracker, that splits the work into mappers and reducers. The Task Trackers runs in the same machine as the DataNodes. If all DataNodes have the green block are busy, then another DataNode will be chosen to process the green block, it will be streamed over the network. (rather rarely)
    The Mappers will read their input data and they will produce intermediate data which the Hadoop framework will pass to the reducers (shuffle and sort). Then the reducers process that data and write their final output back to HDFS.


    Job Tracker & Task Trackers

    code for running a job

    Java/python
    Because of Hadoop Streaming, the code can be written in much any languages.

    Configure a single cluster

    还是官方的教程靠谱!!
    https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/SingleCluster.html

    image.png
    image.png
    image.png
    image.png

    相关文章

      网友评论

        本文标题:Hadoop相关知识

        本文链接:https://www.haomeiwen.com/subject/kotukftx.html