记录Ubuntu 16.04版本安装Hadoop过程.
版本
操作系统: Ubuntu 16.04 LTS
Hadoop: 2.9.0
预安装
java 8. (最好用下载jdk, 配置classpath的方式安装, 用apt-get的方式安装一直失败)
安装完成后确认:
user@ubuntu:~$ java -version
java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)
分配hadoop专用账号
为hadoop的运行专门分配一个系统账号, 这一步不是必须的, 但有利于将hadoop与其他软件分离.
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser
配置SSH
hadoop使用ssh方式访问节点(无论本地还是远程).
1.开启ssh服务.
2.给hduser生成SSH key.
user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
...
...
...
hduser@ubuntu:~$
3.SSH授权hduser访问本机
hduser@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
4.检验SSH登录是否成功
hduser@ubuntu:~$ ssh localhost
Welcome to Ubuntu 16.04.4 LTS (GNU/Linux 4.13.0-36-generic x86_64)
* Documentation: https://help.ubuntu.com
* Management: https://landscape.canonical.com
* Support: https://ubuntu.com/advantage
87 packages can be updated.
30 updates are security updates.
Last login: Wed Apr 4 03:04:20 2018 from 127.0.0.1
hduser@ubuntu:~$
关闭IPv6
这一步主要是避免IPv6的干扰(据说IPv6在ubuntu上还有问题)
在/etc/sysctl.conf末尾添加:
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
重启生效.
检验IPv6是否关闭.
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
输出0代表IPv6依然开启, 1代表关闭.
Hadoop安装
下载
下载Hadoop, 本次安装使用的是2.9.0版本.
$ cd /usr/install
$ sudo tar xzf hadoop-2.9.0.tar.gz
$ sudo mv hadoop-2.9.0 hadoop
$ sudo chown -R hduser:hadoop hadoop
更新$HOME/.bashrc
在$HOME/.bashrc末尾增加:
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/install/hadoop
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
Hadoop练手
配置
目标是运行一个单节点的伪分布式集群模式(pseudo-distributed, single-node Hadoop cluster).
1.hadoop-env.sh
只需添加JAVA_HOME.
# /usr/install/hadoop/etc/hadoop/hadoop-env.sh
# 改成自己的JAVA目录
export JAVA_HOME=/usr/install/jdk1.8
- core-site.xml
配置临时数据文件位置.
假设将临时数据文件位置设为/usr/app/hadoop/tmp.
$ sudo mkdir -p /usr/app/hadoop/tmp
$ sudo chown hduser:hadoop /usr/app/hadoop/tmp
修改文件owner很重要否则会报java.io.IOException.
在core-site.xml文件<configuration> ... </configuration>之间添加配置:
#/usr/install/hadoop/etc/hadoop/core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
3.hdfs-site.xml
配置replication份数.
#/usr/install/hadoop/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
格式化HDFS文件系统
格式化是运行hadoop的第一步, 如无必要, 不要格式化文件.
hduser@ubuntu:~$ hadoop namenode -format
输出结果如下:
hduser@ubuntu:~$ hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
18/04/07 02:35:57 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.9.0
...
...
...
18/04/07 02:36:10 INFO common.Storage: Storage directory /usr/app/hadoop/tmp/dfs/name has been successfully formatted.
18/04/07 02:36:10 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/app/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
18/04/07 02:36:10 INFO namenode.FSImageFormatProtobuf: Image file /usr/app/hadoop/tmp/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 323 bytes saved in 0 seconds.
18/04/07 02:36:10 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
18/04/07 02:36:10 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:~$
启动单节点集群(single-node cluster)
hduser@ubuntu:~$ /usr/install/hadoop/sbin/start-all.sh
输出如下:
hduser@ubuntu:~$ /usr/install/hadoop/sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/install/hadoop/logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/install/hadoop/logs/hadoop-hduser-datanode-ubuntu.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/install/hadoop/logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting yarn daemons
starting resourcemanager, logging to /usr/install/hadoop/logs/yarn-hduser-resourcemanager-ubuntu.out
localhost: starting nodemanager, logging to /usr/install/hadoop/logs/yarn-hduser-nodemanager-ubuntu.out
hduser@ubuntu:~$
启动Namenode, Datanode, secondarynamenode, yarn, resourcemanager, nodemanager.(hadoop1 时会启动Jobtracker and a Tasktracker, 后来因为架构变迁, 不会再启动这两个)
检验是否启动:
hduser@ubuntu:~$ netstat -plten | grep java
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 0.0.0.0:50090 0.0.0.0:* LISTEN 1001 89227 12084/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 86335 11757/java
tcp 0 0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 86425 11757/java
tcp6 0 0 :::8040 :::* LISTEN 1001 95284 12360/java
tcp6 0 0 :::8042 :::* LISTEN 1001 95296 12360/java
tcp6 0 0 :::8088 :::* LISTEN 1001 97396 12242/java
tcp6 0 0 :::8030 :::* LISTEN 1001 89932 12242/java
tcp6 0 0 :::8031 :::* LISTEN 1001 89925 12242/java
tcp6 0 0 :::8032 :::* LISTEN 1001 89938 12242/java
tcp6 0 0 :::8033 :::* LISTEN 1001 97402 12242/java
tcp6 0 0 :::45447 :::* LISTEN 1001 95277 12360/java
hduser@ubuntu:~$
停止集群
hduser@ubuntu:~$ /usr/install/hadoop/sbin/stop-all.sh
运行一个MapReduce
1.下载输入数据
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
下载txt版本的数据到/tmp/gutenberg:
hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3600
-rw-r--r-- 1 hduser hadoop 1580890 Aug 17 2017 4300-0.txt
-rw-r--r-- 1 hduser hadoop 1428841 Apr 6 2015 5000-8.txt
-rw-r--r-- 1 hduser hadoop 674570 Apr 7 00:42 pg20417.txt
hduser@ubuntu:~$
- 重启Hadoop集群(如果Hadoop集群已经关闭)
/usr/install/hadoop/sbin/start-all.sh
3.本地文件上传到HDFS
hduser@ubuntu:~$ hadoop fs -mkdir /usr
hduser@ubuntu:~$ hadoop fs -mkdir /usr/hduser
hduser@ubuntu:~$ hadoop fs -mkdir /usr/hduser/gutenberg
hduser@ubuntu:~$ hadoop fs -put /tmp/gutenberg/*.txt /usr/hduser/gutenberg/
查看
hadoop fs -ls /usr/hduser/gutenberg
Found 3 items
-rw-r--r-- 3 hduser supergroup 1580890 2018-04-07 18:01 /user/hduser/gutenberg/4300-0.txt
-rw-r--r-- 3 hduser supergroup 1428841 2018-04-07 18:01 /user/hduser/gutenberg/5000-8.txt
-rw-r--r-- 3 hduser supergroup 674570 2018-04-07 18:01 /user/hduser/gutenberg/pg20417.txt
hduser@ubuntu:$
4.运行MapReduce
hduser@ubuntu:cd /usr/install/hadoop
hduser@ubuntu:/usr/install/hadoop$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar wordcount /usr/hduser/gutenberg /usr/hduser/gutenberg-output
这个任务会统计输入文件中单词出现的次数. 输出结果如下:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop fs -ls /usr/hduser
Found 2 items
hduser@ubuntu:/usr/install/hadoop$ hadoop fs -ls /usr/hduser/gutenberg-output
Found 2 items
-rw-r--r-- 1 hduser supergroup 0 2018-04-07 03:11 /usr/hduser/gutenberg-output/_SUCCESS
-rw-r--r-- 1 hduser supergroup 880802 2018-04-07 03:11 /usr/hduser/gutenberg-output/part-r-00000
part-r-00000中部分数据如下:
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
Hadoop Web UI
Hadoop包含了Web UI, 方便用户管理. 打开http://192.168.1.6:50070/dfshealth.html#tab-overview, 界面如下:
参考文件:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/CLIMiniCluster.html
网友评论