1.下载Spark
链接:http://spark.apache.org/downloads.html
2.将spark-2.3.1.tgz上传至~/software文件夹下
3.将spark-2.3.1.tgz解压至~/app文件夹下
[hadoop@hadoop001 software]$ tar -zxvf spark-2.3.1.tgz -C ~/app
4.编译Spark所需做的准备:
1)Maven 3.3.9 or newer
2)Java 8+
3)Scala-2.11.8
4)Git
确认各组件是否安装好
[hadoop@hadoop001 software]$ java -version
java version "1.8.0_45"
Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
[hadoop@hadoop001 software]$ mvn -version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
Maven home: /home/hadoop/app/apache-maven-3.3.9
Java version: 1.8.0_45, vendor: Oracle Corporation
Java home: /usr/java/jdk1.8.0_45/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "2.6.32-431.el6.x86_64", arch: "amd64", family: "unix"
[hadoop@hadoop001 software]$ scala
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_45).
Type in expressions for evaluation. Or try :help.
scala>
5.将export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"写入环境变量
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
If you don’t add these parameters to MAVEN_OPTS, you may see errors and warnings like the following:
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.11/classes...
[ERROR] Java heap space -> [Help 1]
You can fix these problems by setting the MAVEN_OPTS variable as discussed before.
6.确认环境变量都已设置
[hadoop@hadoop001 ~]$ vi .bash_profile
export JAVA_HOME=/usr/java/jdk1.8.0_45
export HADOOP_HOME=/home/hadoop/app/hadoop-2.6.0
export HIVE_HOME=/home/hadoop/app/hive-1.1.0-cdh5.7.0
export MVN_HOME=/home/hadoop/app/apache-maven-3.3.9
export FINDBUGS_HOME=/home/hadoop/app/findbugs-1.3.9
export PROTOC_HOME=/usr/local/protobuf
export SQOOP_HOME=/home/hadoop/app/sqoop-1.4.6-cdh5.7.0
export SCALA_HOME=/home/hadoop/app/scala-2.11.8
export FLUME_HOME=/home/hadoop/app/apache-flume-1.6.0-cdh5.7.0-bin
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
export PATH=$FLUME_HOME/bin:$SCALA_HOME/bin:$SQOOP_HOME/bin:$PROTOC_HOME/bin:$FINDBUGS_HOME/bin:$MVN_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$JAVA_HOME/bin:$PATH
[hadoop@hadoop001 ~]$ source .bash_profile
7.设置Maven的本地仓库
[hadoop@hadoop001 conf]$ cd /home/hadoop/app/apache-maven-3.3.9/conf
[hadoop@hadoop001 conf]$ cat settings.xml
<!-- localRepository
| The path to the local repository maven will use to store artifacts.
|
| Default: ${user.home}/.m2/repository
<localRepository>/path/to/local/repo</localRepository>
-->
<localRepository>/home/hadoop/maven_repo</localRepository>
8.安装git
[hadoop@hadoop001 ~]$ sudo yum install git
9.配置make-distribution.sh
[hadoop@hadoop001 ~]$ cd /home/hadoop/app/spark-2.3.1/dev
[hadoop@hadoop001 dev]$ vi make-distribution.sh
#VERSION=$("$MVN" help:evaluate -Dexpression=project.version $@ 2>/dev/null | grep -v "INFO" | tail -n 1)
#SCALA_VERSION=$("$MVN" help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | tail -n 1)
#SPARK_HADOOP_VERSION=$("$MVN" help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\
# | grep -v "INFO"\
# | tail -n 1)
#SPARK_HIVE=$("$MVN" help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\
# | grep -v "INFO"\
# | fgrep --count "<id>hive</id>";\
# # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\
# # because we use "set -o pipefail"
# echo -n)
VERSION=2.3.1
SCALA_VERSION=2.11
SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0
SPARK_HIVE=1
(是个检测的过程,不配会很慢,配置后相当于写死了,节省时间)
9.在pom.xml文件中添加cloudera repos
[hadoop@hadoop001 spark-2.3.1]$ vi pom.xml
<repository>
<id>cloudera</id>
<name>cloudera Repository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
</repository>
10.编译spark
[hadoop@hadoop001 spark-2.3.1]$ ./dev/make-distribution.sh \
> --name 2.6.0-cdh5.7.0 \
> --tgz \
> -Dhadoop.version=2.6.0-cdh5.7.0 \
> -Phadoop-2.6 \
> -Phive -Phive-thriftserver \
> -Pyarn
网友评论