美文网首页我爱编程
Hadoop相关知识

Hadoop相关知识

作者: SpringWolfM | 来源:发表于2018-04-17 08:28 被阅读0次

hadoop架构

Hadoop架构

RHadoop: use R programming language to do statistic data processing
Mahout: machine learning tools
Hive and Pig: doing no SQL
sqoop: data in and out the system

Core hadoop

core hadoop

Hadoop的两个core的部分:
HDFS:hadoop distributed file system -> store data

MapReduce: Process data

Hadoop的衍生周边(software works with Hadoop together):


hadoop ecosystem

hadoop ecosystem: software work along with hadoop, design for making hadoop easier to use.
Writing MapReduce languages can be in Java, Python, Ruby and Perl, or even SQL

HIVE : HIVE interpretes SQL like, SELECT * FROM... into MapReduce
PIG: enable to analyse data in simple script language rather than MapReduce
(Code is turned into MapReduce and run on the cluster)
Impala: with SQL, no need MapReduce, low latency queries. Run quickly than HIVE.

data input from outside HDFS

sqoop: takes data from traditional relational database, such as Microsoft's SQL Server, put to HDFS. So the data can be processed along with other dat on the cluster.
Flume: injects data as it's generated by external systems, puts into the cluster
HBase: a real time database, built on top of HDFS
Hue: graphical front-end to the cluster
oozie: a workflow management tool
Mahout: a machine learning library

Cloudera has a distribution of HADOOP, called CDH (free and open source), put together the tools in the Hadoop ecosystem.

HDFS and MapReduce

HDFS:
store one large file which means store large data into several blocks
DataNode: a block to store data, a cluster (a HDFS) has several blocks (could be several DataNodes)
NameNode: a demon store metadata about which blocks make up the original file

a large file
HDFS content
HDFS content

HDFS会出问题的点

HDFS会出问题的点

When DataNode Fails

  • HDFS make replications for each block, every block stores in HDFS 3 times. So if one DataNode fails, other DataNodes can provide backup and re-replicate again for the blocks in the failed DataNode.


When NameNode Fails

会有single point failure问题

So, here is NFS (network file system). Store Metadata on a remote disk. If NameNode lost all data, there would be a copy of the metadata on the network.


2 NameNode

Hadoop 的基本操作命令

Manipulate by Unix like commend
In terminal,

hadoop fs -ls //show all files infomation
hadoop fs -put purchases.txt //put purchases.txt into HDFS
hadoop fs -tail purchases.txt //show last few lines of purchases.txt
hadoop fs -cat purchases.txt //show entire contents of the file
hadoop fs -mv purchases.txt newname.txt //rename
hadoop fs -rm newname.txt // delete txt file
hadoop fs -mkdir myinput // create a directory in HDFS named myinput
hadoop fs -put purchases.txt myinput  //upload txt to the new directory

MapReduce

File divide into chunks and then process in Parallel


Store in <Key, Value> pairs
key ,value pair problem
MapReduce process

How can the final results to be in a sorted order?

final result 怎么才能是有序的?

那么问题就又来了

If there is only 2 Reducer, which keys go to the first reducer?


image.png

Don't know. Because there is no guarantees that each reducers can get the same number of keys. It might be that one will get none.

Deamons of MapReduce

It is alike the relation of NameNode and DataNode
Job is submitted to Job Tracker, that splits the work into mappers and reducers. The Task Trackers runs in the same machine as the DataNodes. If all DataNodes have the green block are busy, then another DataNode will be chosen to process the green block, it will be streamed over the network. (rather rarely)
The Mappers will read their input data and they will produce intermediate data which the Hadoop framework will pass to the reducers (shuffle and sort). Then the reducers process that data and write their final output back to HDFS.


Job Tracker & Task Trackers

code for running a job

Java/python
Because of Hadoop Streaming, the code can be written in much any languages.

Configure a single cluster

还是官方的教程靠谱!!
https://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/SingleCluster.html

image.png
image.png
image.png
image.png

相关文章

  • Hadoop相关知识

    hadoop架构 RHadoop: use R programming language to do statis...

  • Hadoop 2.x HA和环境搭建

    最近看hadoop和人工智能相关的知识,hadoop搭建笔记,晒出来给大家分享下。

  • Hive 基础搭建教程

    需要安装Hadoop,教程:Hadoop 基础搭建教程 需要了解Hive基本概念:Hive 基础知识 1. 相关依...

  • 从零开始学Hadoop大数据分析之MapReduce

    导言 通过上面几部分学习了hadoop基础,以及hdfs、zookeeper的相关知识,hadoop是大数据处理框...

  • 从零开始学Hadoop大数据分析之HDFS 实战

    导言 通过上两节学习了hadoop基础理论知识以及hdfs相关知识,同时也搭建了相关运行环境,今天开始hdfs实践...

  • Hadoop--spark

    在刚接触大数据的时候,我们主要接受的是关于hadoop的相关知识,虽然比较浅显,但是基本介绍了hadoop每一个过...

  • Hadoop进阶1

    outline 一.HDFS相关 二.mapreduce相关 三.Hadoop的调度策略 四.Hadoop的安全机...

  • hadoop相关

    19.7.1hadoop从本地上传,下载文件到hdfs里面下载:hadoop fs -get /user/biz_...

  • Hadoop基本知识点之HDFS

    自上一篇文章《Hadoop安装与集群配置》之后,需要对hadoop的一些基础知识进行一些总结。此文为HDFS相关的...

  • Hadoop 安装指南

    最近由于要学习一下大数据相关的知识,首先遇到的就是Hadoop。那么接下来就看一下如何安装Hadoop。 下载ta...

网友评论

    本文标题:Hadoop相关知识

    本文链接:https://www.haomeiwen.com/subject/kotukftx.html