三种安装模式:单机模式,伪分布式模式,分布式模式
Standalone模式是默认的,不需要修改配置文件(只改了Java_home),也不需要启动进程。
Pseudo-Distributed也是单机就能跑的,但必须改配置文件(改了Java_home,设置了pid在的文件夹,设置了core-site,hdfs-site),也要启动对应的进程。
Distributed是完全模式,需要多台主机,每台主机都需要合理配置并启动进程。
安装
-
ssh localhost
如果不行就生成密钥
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
brew install hadoop
- 看hadoop安装的版本,终端输入
hadoop version
,可知安装在/usr/local/Cellar/hadoop/3.1.2下。 - 首先设置
/usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop/hadoop-env.sh
的JAVA_HOME。
查看Java的安装路径:终端执行/usr/libexec/java_home -V
得到/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home
将这个写入JAVA_HOME,将#去掉,得:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/openjdk-12.0.2.jdk/Contents/Home
此时注意,如果是单机模式,只需要修改这个配置。千万不要修改完其他的配置后再去运行单机模式,不然会出现 connection refused的错误!
接下来就运行单机模式的案例了。
单机模式
来源:https://hadoop.apache.org/docs/r3.1.2/hadoop-project-dist/hadoop-common/SingleCluster.html
以上提供了grep案例。假设我们想在很多文件中找出以'dfs'开头的单词和个数。那么就需要一个放着文件的文件夹,这里首先新建了一个文件夹input,然后复制了hadoop的配置文件中所有xml格式的文件到input文件夹里。然后执行map-reduce计算,计算结果放在output里面,然后输出output。
运行
Huizhi$ etc Huizhi$ cd /usr/local/Cellar/hadoop/3.1.2/libexec
libexec Huizhi$ mkdir input #新建文件夹input
libexec Huizhi$ cp etc/hadoop/*.xml input #复制文件
libexec Huizhi$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar grep input output 'dfs[a-z.]+' #map-reduce计算
libexec Huizhi$ cat output/* #输出output
这里,我为了试验输出的结果,加了两个以dfs开头的单词。最后的结果是
$ cat output/*
1 dfstwo
1 dfsone
1 dfsadmin
'dfs[a-z.]+'的
出现了以下这个错误,是因为改动了其他配置,要么就修改恢复默认配置,要么像我一样,卸载重装:
2019-08-23 16:14:16,220 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-23 16:14:17,048 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:9000
java.net.ConnectException: Call From HuizhiXu.local/172.16.233.171 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
运行完这个例子之后我想看下系统自带的这个share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar都有哪些功能,运行
libexec Huizhi$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar
会出现结果。
要想知道这些功能怎么运用,可以
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar 功能名
例如:hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar aggregatewordcount
就会出现
usage: inputDirs outDir [numOfReducer [textinputformat|seq [specfile [jobName]]]]
- aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
这个语句用来计算word在文本中出现的次数。
注意:当输入为普通文本时会出现以下错误:
Caused by: java.io.IOException: file:/usr/local/Cellar/hadoop/3.1.2/libexec/input/capacity-scheduler.xml not a SequenceFile
因为该语句只能对二进制SequenceFile进行解析,需要hadoop 的api把普通文本转换成SequenceFile。
- aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
这个语句用来绘制字数出现的直方图。 - bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
对输入文件按正则表达式查找,把结果写到输出文件上
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
格式 :
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi map的个数 样本的个数
例子:
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.2.jar pi 10 50
结果:
Job Finished in 2.973 seconds
Estimated value of Pi is 3.16000000000000000000
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
这就是最著名的数单词字数的功能。
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
Linux命令知识点:
-
cd: cd命令用于切换当前工作目录至 dirName(目录参数)。
格式:cd [dirName]
"~" 也表示为 home 目录 的意思,"." 则是表示目前所在的目录,".." 则表示目前目录位置的上一层目录。
例子:cd ~ cd ../..
来源 -
grep是一个最初用于Unix操作系统的命令行工具。在给出文件列表或标准输入后,grep会对匹配一个或多个正则表达式的文本进行搜索,并只输出匹配的行或文本。(来源:维基百科)
-
mkdir: Linux mkdir命令用于建立名称为 dirName的文件夹。
格式:mkdir dirName -
cp: 主要用于复制文件或目录。
格式:cp [options] source dest
options 常用的是 -r
例子:$ cp –r test/ newtest 将test/下所有的文件和文件夹都复制到newtest下
scp除了复制文件要输密码外,其余和cp是一样的
参考资料:
AggregateWordCount源代码注释
MapReduce的输入格式
https://docs.microsoft.com/bs-latn-ba/azure/hdinsight/hadoop/apache-hadoop-run-samples-linux?view=netcore-2.0
网友评论