Hadoop建立单机模式和运行它的自带的例子

作者: 波洛的汽车电子世界 | 来源:发表于2019-08-25 03:35 被阅读0次

Hadoop建立单机模式和运行它的自带的例子
在ubuntu下hadoop安装步骤
大数据测试之hadoop单机环境搭建（超级详细版）
大数据Hadoop面试题（一）
win10下hadoop伪分布式搭建
大数据下：hadoop伪分布式的搭建
Mac单机运行spark自带例子
ubuntu16.04安装hadoop3.02(伪分布式)+集群
最新Hadoop的面试题总结
Hadoop伪分布模式搭建

三种安装模式：单机模式，伪分布式模式，分布式模式
Standalone模式是默认的，不需要修改配置文件（只改了Java_home），也不需要启动进程。
Pseudo-Distributed也是单机就能跑的，但必须改配置文件（改了Java_home，设置了pid在的文件夹，设置了core-site,hdfs-site），也要启动对应的进程。
Distributed是完全模式，需要多台主机，每台主机都需要合理配置并启动进程。

安装

ssh localhost
如果不行就生成密钥

$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

brew install hadoop
看hadoop安装的版本，终端输入 hadoop version，可知安装在/usr/local/Cellar/hadoop/3.1.2下。
首先设置/usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop/hadoop-env.sh的JAVA_HOME。
查看Java的安装路径：终端执行 /usr/libexec/java_home -V
得到/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home
将这个写入JAVA_HOME，将#去掉，得：

export JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home

此时注意，如果是单机模式，只需要修改这个配置。千万不要修改完其他的配置后再去运行单机模式，不然会出现 connection refused的错误！
接下来就运行单机模式的案例了。

单机模式

来源：https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html
以上提供了grep案例。假设我们想在很多文件中找出以'dfs'开头的单词和个数。那么就需要一个放着文件的文件夹，这里首先新建了一个文件夹input，然后复制了hadoop的配置文件中所有xml格式的文件到input文件夹里。然后执行map-reduce计算，计算结果放在output里面，然后输出output。

运行

Huizhi$ etc Huizhi$ cd /usr/local/Cellar/hadoop/3.1.2/libexec
libexec Huizhi$ mkdir input #新建文件夹input
libexec Huizhi$ cp etc/hadoop/*.xml input #复制文件
libexec Huizhi$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+' #map-reduce计算
libexec Huizhi$ cat output/* #输出output

这里，我为了试验输出的结果，加了两个以dfs开头的单词。最后的结果是

$ cat output/*
1   dfstwo
1   dfsone
1   dfsadmin

'dfs[a-z.]+'的

出现了以下这个错误，是因为改动了其他配置，要么就修改恢复默认配置，要么像我一样，卸载重装：

2019-08-23 16:14:16,220 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-23 16:14:17,048 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:9000
java.net.ConnectException: Call From HuizhiXu.local/172.16.233.171 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

运行完这个例子之后我想看下系统自带的这个share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar都有哪些功能，运行

libexec Huizhi$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar

会出现结果。
要想知道这些功能怎么运用，可以

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar  功能名
例如：hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar aggregatewordcount

就会出现

usage: inputDirs outDir [numOfReducer [textinputformat|seq [specfile [jobName]]]]

aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
这个语句用来计算word在文本中出现的次数。
注意：当输入为普通文本时会出现以下错误：

Caused by: java.io.IOException: file:/usr/local/Cellar/hadoop/3.1.2/libexec/input/capacity-scheduler.xml not a SequenceFile

因为该语句只能对二进制SequenceFile进行解析，需要hadoop 的api把普通文本转换成SequenceFile。

aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
这个语句用来绘制字数出现的直方图。
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.

dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
对输入文件按正则表达式查找，把结果写到输出文件上
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
格式：
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi map的个数样本的个数
例子：
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi 10 50
结果：

Job Finished in 2.973 seconds
Estimated value of Pi is 3.16000000000000000000

randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
这就是最著名的数单词字数的功能。
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

Linux命令知识点：

cd: cd命令用于切换当前工作目录至 dirName(目录参数)。
格式：cd [dirName]
"~" 也表示为 home 目录的意思，"." 则是表示目前所在的目录，".." 则表示目前目录位置的上一层目录。
例子：cd ~ cd ../..
来源
grep是一个最初用于Unix操作系统的命令行工具。在给出文件列表或标准输入后，grep会对匹配一个或多个正则表达式的文本进行搜索，并只输出匹配的行或文本。(来源：维基百科)
mkdir: Linux mkdir命令用于建立名称为 dirName的文件夹。
格式：mkdir dirName
cp: 主要用于复制文件或目录。
格式：cp [options] source dest
options 常用的是 -r
例子：$ cp –r test/ newtest 将test/下所有的文件和文件夹都复制到newtest下
scp除了复制文件要输密码外，其余和cp是一样的