美文网首页
大数据环境部署(三)

大数据环境部署(三)

作者: 梧上擎天 | 来源:发表于2018-11-08 15:30 被阅读11次

    Kafka安装

    1 下载

    地址http://kafka.apache.org/downloads

    image.png

    2 解压

    tar -zxvf kafka_2.11-0.11.0.1.tgz -C /opt/wsqt/core

    mv kafka_2.11-0.10.2.1/ kafka

    分发kafka文件夹

    scp -r kafka hadoop@hadoop004:/opt/wsqt/core/

    scp -r kafka hadoop@hadoop005:/opt/wsqt/core/

    3 修改配置文件

    在hadoop003 hadoop004 hadoop005 机器上 分别修改

    vi /opt/wsqt/core/kafka/config/server.properties

    配置host 和 zk连接 以及 brokerID

    Hadoop003

    image.png

    Hadoop004

    image.png

    Hadoop005

    image.png
    broker.id=1 delete.topic.enable=true host.name=192.168.139.137 port=9092 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=/opt/wsqt/data/kafka num.partitions=2 num.recovery.threads.per.data.dir=1 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=hadoop003:2181,hadoop004:2181,hadoop005:2181 zookeeper.connection.timeout.ms=6000

    4.配置环境变量

    5 启动kafka

    Hadoop003 Hadoop004 Hadoop005 分别启动

    kafka-server-start.sh -daemon /opt/wsqt/core/kafka/config/server.properties

    6 jps验证

    jps

    image.png

    Spark安装

    1 下载

    http://spark.apache.org/downloads.html

    image.png

    2 解压

    tar -zxvf spark-2.3.0-bin-hadoop2.6.tgz -C /opt/wsqt/core

    mv spark-2.3.0-bin-hadoop2.6/ spark

    分发spark文件夹

    scp -r spark hadoop@hadoop002:/opt/wsqt/core/

    scp -r spark hadoop@hadoop003:/opt/wsqt/core/

    scp -r spark hadoop@hadoop004:/opt/wsqt/core/

    scp -r spark hadoop@hadoop005:/opt/wsqt/core/

    3 修改配置文件

    spark/conf/slaves
    
    *\#\# A Spark Worker will be started on each of the machines listed below.*
    
    hadoop002
    
    hadoop003
    
    hadoop004
    
    hadoop005
    
    spark/conf/spark-defaults.conf (可不修改)
    *\#主节点设置,多个节点高可用*
    
    spark.master spark://hadoop001:7077,hadoop002:7077
    
    spark.eventLog.enabled true
    
    spark.eventLog.dir hdfs://wsqt/SparkeventLog
    
    spark.eventLog.compress false
    
    *\#一个类,用于序列化网络传输或者以序列化形式缓存起来的各种对象*
    
    spark.serializer org.apache.spark.serializer.KryoSerializer
    
    *\#\#Spark应用程序Application所占的内存大小,这里的Driver对应Yarn中的ApplicationMaster;*
    
    spark.driver.memory 1g
    
    *\#\# Amount of memory to use per executor process, in the same format as JVM
    memory strings (e.g. 512m, 2g).对应的container ---每个处理器可以使用的内存大小*
    
    spark.executor.memory 2g
    
    spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one
    two three"
    
    *environment variables set by the cluster manager.设置多个分散io*
    
    spark.local.dir /opt/wsqt/data/disk1/tmp/spark/local
    
    *\#\#Port for your application's dashboard, which shows memory and workload
    data应用程序控制面板端口号*
    
    spark.ui.port 4040
    
    *\#\#job运行期间的压缩的相关设置*
    
    spark.broadcast.compress true *\#\#广播变量在发送之前是否先要被压缩*
    
    spark.rdd.compress false *\#\#是否要压缩序列化的RDD分区*
    
    spark.io.compression.codec snappy *\#\#压缩内部数据使用的编码器*
    
    spark.io.compression.snappy.block.size 32768 *\#\#字节为单位*
    
    *\#\#Default number of tasks to use across the cluster for distributed shuffle
    operations (groupByKey, reduceByKey, etc) when not set by user.*
    
    spark.default.parallelism 12 *\#\#本地机器内核数*
    
    spark/conf/spark-env.sh
    export JAVA_HOME=/opt/wsqt/core/java
    
    export HADOOP_CONF_DIR=/opt/wsqt/core/hadoop/etc/hadoop
    
    export SPARK_HOME=/opt/wsqt/core/spark
    
    *\#\#可以启动多个实例*
    
    export SPARK_EXECUTOR_INSTANCES=1
    
    export SPARK_EXECUTOR_CORES=1
    
    export SPARK_EXECUTOR_MEMORY=1G
    
    export SPARK_DRIVER_MEMORY=1G
    
    *\#\#master节点的webui端口*
    
    export SPARK_MASTER_WEBUI_PORT=18080
    
    *\#\#worker节点的webui端口*
    
    export SPARK_WORKER_WEBUI_PORT=18081
    
    export SPARK_WORKER_DIR=/opt/wsqt/core/spark/work
    
    export SPARK_LOG_DIR=/opt/wsqt/logs/spark
    
    *\#\#为pid文件设置目录,防止默认/tmp目录下长时间运行导致的pid文件丢失。*
    
    export SPARK_PID_DIR=/opt/wsqt/tmp
    
    *\#\#为了兼容hive元数据,设置mysql连接器*
    
    export
    SPARK_CLASSPATH=\$SPARK_CLASSPATH:/opt/wsqt/core/spark/lib/mysql-connector-java-5.1.23-bin.jar
    
    \#\#这里注意mysql连接器的版本可以从/opt/wsqt/core/hive/lib/里面找到,并将它复制到/opt/wsqt/core/spark/lib下
    
    *\#\#启动参数里设置zookeeper,提供master节点的高可用*
    
    export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
    -Dspark.deploy.zookeeper.url=hadoop001:2181,hadoop002:2181,hadoop003:2181
    -Dspark.deploy.zookeeper.dir=/spark"
    
    *\#\#设置这个为了兼容hadoop下面的一些lib库,比如snappy*
    
    export LD_LIBRARY_PATH=/opt/wsqt/core/hadoop/lib/native
    

    4 修改sbin目录权限

    vi /opt/wsqt/core/spark/sbin/start-master.sh

    5 启动spark

    直接启动master和所有worker

    /opt/wsqt/core/spark/sbin/start-spark-all.sh

    或者单独启动master服务可用

    /opt/wsqt/core/spark/sbin/start-master.sh

    ##单独启动所有worker服务可用,此脚本必须。只能在active的master节点上启动。

    /opt/wsqt/core/spark/sbin/start-slaves.sh

    image.png

    6 登陆master节点webui端查看服务器状态

    image.png

    Flume安装

    1 下载

    http://www.apache.org/dyn/closer.lua/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

    2 解压

    tar -zxvf apache-flume-1.6.0-bin.tar.gz

    mv apache-flume-1.6.0-bin/ flume

    3 环境变量

    4 配置agent

    eg 1 logger

    exec-memory-logger.sources = exec-source exec-memory-logger.sinks = logger-sink exec-memory-logger.channels = memory-channel exec-memory-logger.sources.exec-source.type = exec exec-memory-logger.sources.exec-source.command = tail -F /opt/wsqt/logs/myproject/access.log exec-memory-logger.sources.exec-source.shell = /bin/sh -c exec-memory-logger.channels.memory-channel.type = memory exec-memory-logger.sinks.logger-sink.type = logger exec-memory-logger.sources.exec-source.channels = memory-channel exec-memory-logger.sinks.logger-sink.channel = memory-channel

    Eg2 kafka

    exec-memory-kafka.sources = exec-source exec-memory-kafka.sinks = kafka-sink exec-memory-kafka.channels = memory-channel exec-memory-kafka.sources.exec-source.type = exec exec-memory-kafka.sources.exec-source.command = tail -F /opt/wsqt/logs/myproject/access.log exec-memory-kafka.sources.exec-source.shell = /bin/sh -c exec-memory-kafka.channels.memory-channel.type = memory exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop003:9092 exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic exec-memory-kafka.sinks.kafka-sink.batchSize = 5 exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1 exec-memory-kafka.sources.exec-source.channels = memory-channel exec-memory-kafka.sinks.kafka-sink.channel = memory-channel

    5 启动

    flume-ng agent \

    --name exec-memory-logger \

    --conf-file /$FLUME_HOME/conf/streaming_project.conf \

    -Dflume.root.logger = INFO,console

    部署为三篇:

    相关文章

      网友评论

          本文标题:大数据环境部署(三)

          本文链接:https://www.haomeiwen.com/subject/amewxqtx.html