Spark实战(1)_Spark2.0环境搭建

作者: padluo | 来源:发表于2018-02-28 13:19 被阅读32次

环境安装软件准备

CentOS-7-x86_64-Everything-1611.iso

spark-2.0.1-bin-hadoop2.7.tgz

hadoop-2.7.3.tar.gz

scala-2.11.8.tgz

jdk-8u91-linux-x64.tar.gz

建立Linux虚拟机(全节点)

客户机操作系统:CentOS-7-x86_64。

网络和主机名设置:

常规选项卡:可用时自动连接到这个网络,打勾。

IPv4选项卡设置如下:

hostname Address Netmask Gateway
sparkmaster 192.168.169.221 255.255.255.0
sparknode1 192.168.169.222 255.255.255.0
sparknode2 192.168.169.223 255.255.255.0

安装类型:最小安装

创建用户(全节点)

su root
useradd spark
passwd spark
su spark
cd ~
pwd
mkdir softwares

修改语系为英文语系(全节点)

# 显示目前所支持的语系
locale

LANG=en_US.utf8
export LC_ALL=en_US.utf8

# 修改系统预设
cat /etc/locale.conf

LANG=en_US.utf8

修改hostname(全节点)

vi /etc/hostname

# 192.168.169.221
sparkmaster
# 192.168.169.222
sparknode1
# 192.168.169.223
sparknode2

修改hosts(全节点)

su root
vi /etc/hosts

192.168.169.221 sparkmaster
192.168.169.222 sparknode1
192.168.169.223 sparknode2

为了使集群能够用域名在Windows下访问,Windows下配置hosts的路径为:C:\Windows\System32\drivers\etc。

配置固定IP(全节点)

vi /etc/sysconfig/network-scripts/ifcfg-ens33

# BOOTPROTO=dhcp
BOOTPROTO=static
IPADDR0=xxx
GATEWAY0=xxx
NETMASK=xxx
DNS1=xxx

systemctl restart network

关闭防火墙(全节点)

systemctl status firewalld.service

systemctl stop firewalld.service
systemctl disable firewalld.service

配置无密钥登录(全节点)

su spark
cd ~
  • ssh-keygen -t rsa -P ''
  • 将每个节点生成的id_rsa.pub里面的内容拷贝出来
  • 将所有节点拷贝好的公钥一起拷贝到每个节点用户家目录下的.sshauthorized_keys这个文件中
  • 每个节点的authorized_keys这个文件访问权限必须改成600,chmod 600 authoried_keys

上传软件(master节点)

把环境安装准备的软件jdk、Hadoop、Spark、Scala上传到sparkmaster:/home/spark/softwares

安装jdk(master节点)

tar -zxvf jdk-8u91-linux-x64.tar.gz
vi ~/.bashrc

export JAVA_HOME=/home/spark/softwares/jdk1.8.0_91
export PATH=$PATH:$JAVA_HOME/bin

source ~/.bashrc
which java

安装Scala(master节点)

tar -zxvf scala-2.11.8.tgz
vi ~/.bashrc

export SCALA_HOME=/home/spark/softwares/scala-2.11.8
export PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin

source ~/.bashrc
which scala

安装Hadoop(master节点)

tar -zxvf hadoop-2.7.3.tar.gz

Hadoop配置文件所在目录:/home/spark/softwares/hadoop-2.7.3/etc/hadoop

core-site.xml

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://sparkmaster:8082</value>
</property>

hdfs-site.xml

<property>
    <name>dfs.name.dir</name>
    <value>file:/home/spark/softwares/hadoop-2.7.3/hdfs/name</value>
</property>
<property>
    <name>dfs.data.dir</name>
    <value>file:/home/spark/softwares/hadoop-2.7.3/hdfs/data</value>
</property>
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
<property>
    <name>dfs.namenode.secondary.http-address</name>
    <value>sparkmaster:9001</value>
</property>
<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

masters

sparkmaster

slaves

sparkmaster
sparknode1
sparknode2

hadoop-env.sh

export JAVA_HOME=${JAVA_HOME}

环境变量

vi ~/.bashrc

export HADOOP_HOME=/home/spark/softwares/hadoop-2.7.3
export PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin

source ~/.bashrc

安装Spark(master节点)

tar -zxvf spark-2.0.1-bin-hadoop2.7.tgz

# /home/spark/softwares/spark-2.0.1-bin-hadoop2.7/conf

vi slaves

sparkmaster
sparknode1
sparknode2

vi spark-env.sh

export SPARK_HOME=$SPARK_HOME
export HADOOP_HOME=$HADOOP_HOME
export MASTER=spark://sparkmaster:7077
export SCALE_HOME=$SCALE_HOME
export SPARK_MASTER_IP=sparkmaster
vi ~/.bashrc

export SPARK_HOME=/home/spark/softwares/spark-2.0.1-bin-hadoop2.7
export PATH=$PATH:$JAVA_HOME/bin:$SCALA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin

source ~/.bashrc

搭建本地yum源(local方式)(master节点)

挂载iso镜像文件,拷贝文件内容

su root
mkdir -p /mnt/CentOS /mnt/dvd
mount /dev/cdrom /mnt/dvd
df -h
cp -av /mnt/dvd/* /mnt/CentOS
umount /mnt/dvd

备份原有yum配置文件

cd /etc/yum.repos.d
rename .repo .repo.bak *.repo

新建yum配置文件

vi /etc/yum.repos.d/local.repo

[local]
name=CentOS-$releasever - Local
baseurl=file:///mnt/CentOS
enabled=1
gpgcheck=0

# 验证
yum list | grep mysql

搭建本地yum源(http方式)(master节点)

启动httpd服务

# 验证是否安装httpd服务
rpm -qa|grep httpd
# yum install -y httpd
yum install -y httpd
# 启动httpd服务
# service httpd start
systemctl status httpd.service
systemctl start httpd.service
# 设置httpd服务开机自启动
# chkconfig httpd on
systemctl is-enabled httpd.service
systemctl enable httpd.service

安装yum源

# 在/var/www/html/下创建文件夹CentOS7
mkdir -p /var/www/html/CentOS7

# 将iso文件中的内容copy到CentOS7
# cp -av /mnt/CentOS/* /var/www/html/CentOS7/
# rm -rf /mnt/CentOS/*
mv /mnt/CentOS/* /var/www/html/CentOS7/

利用ISO镜像,yum源搭建OK。浏览器验证访问:

http://sparkmaster/CentOS7/

使用yum源

# 备份原有的repo文件
# mkdir -p /etc/yum.repos.d/repo.bak
# cd /etc/yum.repos.d/
# cp *.repo *.repo.bak repo.bak/
# rm -rf *.repo *.repo.bak

cd /etc/yum.repos.d/
# 新建文件CentOS-http.repo
vi CentOS-http.repo

[http]
name=CentOS-$releasever - http
baseurl=http://sparkmaster:80/CentOS7/
enabled=1
gpgcheck=1
gpgkey=http://sparkmaster:80/CentOS7/RPM-GPG-KEY-CentOS-7

# 把前面搭建的本地yum源禁用,设置local.repo中的enabled=0

# 更新yum源
yum clean
yum repolist

集群yum源配置(http方式)(全节点)

# sparknode1/sparknode2
cd /etc/yum.repos.d
rename .repo .repo.bak *.repo

# sparkmaster
scp /etc/yum.repos.d/*.repo sparknode1:/etc/yum.repos.d/
scp /etc/yum.repos.d/*.repo sparknode2:/etc/yum.repos.d/

异步传输工具(全节点)

利用异步传输工具进行master节点下/home/spark/softwares所安装软件jdk、Hadoop、Spark、Scala的同步。

rpm -qa | grep rsync
yum list | grep rsync
yum install -y rsync

vi sync_tools.sh

echo "-----begin to sync jobs to other workplat-----"
SERVER_LIST='sparknode1 sparknode2'
for SERVER in $SERVER_LIST
do
    rsync -avz ./* $SERVER:/home/spark/softwares
done
echo "-----sync jobs is done-----"
cd ~/softwares
chmod 700 sync_tools.sh
./sync_tools.sh

环境变量配置同步(全节点)

# sparknode1/sparknode2
mv ~/.bashrc ~/.bashrc.bak

# sparkmaster
su spark
scp ~/.bashrc sparknode1:~/.bashrc
scp ~/.bashrc sparknode2:~/.bashrc

# sparknode1/sparknode2
source ~/.bashrc

启动Spark及验证

cd $SPRAK_HOME
cd sbin
./stop-all.sh
./start-all.sh
jps

验证:

http://sparkmaster:8080/

启动HDFS及验证

cd $HADOOP_HOME
# 格式化
hadoop namenode -format
cd ../sbin
./stop-all.sh
./start-dfs.sh
jps

验证:

http://sparkmaster:50070

至此,Spark2.0环境搭建结束。


您可能还想看

Hadoop/CDH

Hadoop实战(1)_阿里云搭建Hadoop2.x的伪分布式环境

Hadoop实战(2)_虚拟机搭建Hadoop的全分布模式

Hadoop实战(3)_虚拟机搭建CDH的全分布模式

Hadoop实战(4)_Hadoop的集群管理和资源分配

Hadoop实战(5)_Hadoop的运维经验

Hadoop实战(6)_搭建Apache Hadoop的Eclipse开发环境

Hadoop实战(7)_Apache Hadoop安装和配置Hue

Hadoop实战(8)_CDH添加Hive服务及Hive基础

Hadoop实战(9)_Hive进阶及UDF开发

Hadoop实战(10)_Sqoop import与抽取框架封装


微信公众号「数据分析」,分享数据科学家的自我修养,既然遇见,不如一起成长。

数据分析

读者交流电报群

https://t.me/sspadluo


知识星球交流群

知识星球读者交流群

相关文章

网友评论

    本文标题:Spark实战(1)_Spark2.0环境搭建

    本文链接:https://www.haomeiwen.com/subject/ahvbxftx.html