HADOOP

作者: 手扶拖拉机_6e4d | 来源:发表于2019-08-25 14:02 被阅读0次

    HADOOP:
    1.创建Hadoop用户
    useradd -m hadoop -s /bin/bash
    设置密码
    passwd hadoop (WOaiyuyu123)
    增加管理员权限

    2.hadoop下载地址:
    https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.8.5/hadoop-2.8.5.tar.gz

    3.java环境变量配置:
    which java
    /usr/bin/java

    ls -lrt  /usr/bin/java
    

    /usr/bin/java -> /etc/alternatives/java

    ls -lrt /etc/alternatives/java
    /etc/alternatives/java -> /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.el7_6.x86_64/jre/bin/java

    cd /usr/lib/jvm

    vi /etc/profile

    export JAVA_HOME=/usr/lib/jvm/java-1.8.0
    export JRE_HOME=$JAVA_HOME/jre
    export PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
    export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib
    

    source /etc/profile

    4.查看hadoop版本
    bin/hadoop version

    HADOOP配置

    vim /etc/profile

    export HADOOP_HOME=/usr/local/hadoop/hadoop-2.8.5
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    

    source /etc/profile

    5.修改HADOOP
    5.1-修改core-site.xml

    <!-- 指定HDFS(namenode)的通信地址 -->
    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
        <!-- 指定Hadoop运行时产生的存储路径 -->
        <property>
            <name>hadoop.tmp.dir</name>
            <value>/usr/local/hadoop/tmp</value>
            <description>Abase for other temporary directories.</description>
        </property>
        <property>
            <name>io.file.buffer.size</name>
            <value>4096</value>
        </property>
    </configuration>
    
    

    5.2-修改hdfs-site.xml
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    <property>
    <name>dfs.namenode.name.dir</name>
    <value>/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
    <name>dfs.datanode.data.dir</name>
    <value>/usr/local/hadoop/tmp/dfs/data</value>
    </property>
    </configuration>

    修改 : /usr/local/hadoop/hadoop-2.8.5/etc/hadoop/hadoop-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0

    5.3格式化节点
    bin/hdfs namenode -format

    5.4启动
    sbin/start-dfs.sh
    启动namenode 和 datanode
    sbin/hadoop-daemon.sh start namenode
    sbin/hadoop-daemon.sh start datanode

    启动失败:需要安装ssh
    localhost: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found
    localhost: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found
    Starting secondary namenodes [0.0.0.0]
    0.0.0.0: /usr/local/hadoop/hadoop-2.8.5/sbin/slaves.sh: line 60: ssh: command not found

    5.5安装ssh

    5.6
    passwd root (YUyuaiwo123)

    5.7创建目录
    bin/hdfs dfs -mkdir -p /user/wudy/input

    把wcinput 目录下的文件上传到 /user/wudy/input
    bin/hdfs dfs -put wcinput/wc.input /user/wudy/input

    计算word数量

    bin/hadoop   jar   share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar  wordcount   /user/wudy/input   /user/wudy/output
    

    伪分布式NameNode格式化注意事项:


    60941A32-CAD4-4CE5-B8EE-2017D4842041.png

    5.8启动YARN 并运行MapReduce程序
    分析:
    1>配置集群在YARN上运行MR
    2>启动、测试集群增删查
    3>在YARN上执行WordCount案例

    执行步骤:
    配置集群:
    配置yarn-env.sh
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0

    配置 yarn-site.xml

    <!-- reducer获取数据的方式 -->
    <configuration>
        <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        </property>
    
    <!-- 指定YARN的ResourceManager的地址 -->
        <property>
             <name>yarn.resourcemanager.hostname</name>
             <value>hdfs://localhost:9000</value>
        </property>
    </configuration>
    

    配置mapred.env.sh
    export JAVA_HOME=/usr/lib/jvm/java-1.8.0

    配置 mapred-site.xml.template

    <configuration>
            <property>
                    <name>mapreduce.framework.name</name>
                    <value>yarn</value>
            </property>
    </configuration>
    

    启动resourcemanager 和 nodemanager:

        sbin/yarn-daemon.sh start resourcemanager  (sbin/yarn-daemon.sh stop resourcemanager)
        sbin/yarn-daemon.sh start nodemanager   (sbin/yarn-daemon.sh stop nodemanager)
        jps
    
    • Hadoop端口号:
      50070 - 查看HDFS (172.17.0.4:50070/explore.html#/)
      8088 - 查看Mapreduce(hadoop4:8088/cluster)
    vim  yarn-site.xml
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>hadoop2:8032</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>hadoop2:8030</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>hadoop2:8031</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.admin.address</name>
        <value>hadoop2:8033</value>
    </property>
    
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <value>hadoop2:8088</value>
    </property>
    

    6.配置历史服务器(History):
    vim mapred-site.xml

        <!-- 历史服务器地址 -->
        <property>
                <name>mapreduce.jobhistory.address</name>
                <value>hadoop2:10020</value>
        </property>
    
        <!-- 历史服务器web端地址 -->
        <property>
                <name>mapreduce.jobhistory.webapp.address</name>
                <value>hadoop2:19888</value>
        </property>        
    

    启动历史服务器: sbin/mr-jobhistory-daemon.sh start historyserver
    关闭历史服务器:sbin/mr-jobhistory-daemon.sh stop historyserver

    yarn日志聚集:
    需要重启NodeManager, ResourceManager, HistoryManager

    vim yarn-site.xml

        <property>
                <name>yarn.log-aggregration-enable</name>
                <value>true</value>
        </property>    
    
        <property>
                <name>yarn.log-aggregration-retain-seconds</name>
                <value>604800</value>
        </property>    
    
    • 7.完全分布式运行模式(开发重点):
    • 7.1 编写集群分发脚本 xsync
      scp (secure copy) 安全拷贝
      scp可以实现服务器与服务器之间的数据拷贝
      格式 user@hadoop3:pdir/$fname (目的用户@主机:目的路径/名称)
      scp -r logs root@172.17.0.3:/opt/logs (将hadoop2上的数据推到hadoop3)

    在hadoop3机器上, 将hadoop2的logs拉到当前的目录(/opt)下
    scp -r root@hadoop2:/usr/local/hadoop/hadoop-2.8.5/logs ./

    假如新机器hadoop5, 将hadoop3的配置直接同步到hadoop5:
    scp /etc/profile root@hadoop5:/etc/profile

    • 7.2 rsync远程同步工具
      rsync主要用户同步和镜像,具有速度快、避免复制相同内容和支持符号链接的有点, 只对差异文件做更新
    #!/bin/bash
    pcount=$#
    if((pcount==0)); then
    echo no args;
    exit;
    fi
    
    # 获取文件名称
    p1=$1
    fname=`basename $p1`
    echo fname=$fname
    
    # 获取商机目录到绝对路径
    pdir=`cd -P $(dirname $p1); pwd`
    echo pdir=$pdir
    
    # 获取当前用户名称
    user=`whoami`
    
    
    for((host=2; host<5; host++)); do
        echo ------------hadoop$host----------
        rsync -rvl $pdir/$fname $user@hadoop$host:$pdir
    
    done
    
    • 7.3 集群的配置
      集群部署规划:
                    172.17.0.2(hadoop2)           172.17.0.3(hadoop3)           172.17.0.4(hadoop4)
    HDFS            NameNode、DataNode               DataNode                 SecondaryNameNode、dataNode
    YARN日志聚集       NodeManager                 ResourceManager、NodeManager     NodeManager 
    

    配置hadoop2:
    vim core-site.xml

            <property>
                    <name>fs.defaultFS</name>
                    <value>hdfs://hadoop2:9000</value>
            </property>
    

    vim hdfs-site.xml

            <property>
                 <name>dfs.replication</name>
                 <value>3</value>
            </property>
    
            <property>
                 <name>dfs.namenode.secondary.http-address</name>
                 <value>hadoop4:50090</value>
            </property>
    

    vim yarn-site.xml

        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    
        <property>
             <name>yarn.resourcemanager.hostname</name>
             <value>hadoop3</value>
        </property>
    

    vim mapred-site.xml

        <property>
                    <name>mapreduce.framework.name</name>
                    <value>yarn</value>
            </property>
    
    • rsync下载地址:
      https://download.samba.org/pub/rsync/rsync-3.1.3.tar.gz

    注意:需要使用xync 把hadoop2的配置同步到hadoop3和hadoop4(我目前是手动修改的)

    • 同步命令如下:
      rsync -avzP --delete mapred-env.sh root@hadoop3:/usr/local/hadoop/hadoop-2.8.5/etc/hadoop/

    先停掉服务

    sbin/hadoop-daemon.sh  stop namenode
    sbin/hadoop-daemon.sh  stop datanode
    sbin/yarn-daemon.sh stop nodemanager
    sbin/yarn-daemon.sh stop resourcemanager
    sbin/mr-jobhistory-daemon.sh stop historyserver
    
    • 将hadoop2、hadoop3, hadoop4的 data logs 目录删除,然后格式化
    rm -rf logs  
    rm -rf /tmp/dfs/name/current
    bin/hdfs namenode  -format
    
    • 格式化datanode
    rm -rf /tmp/dfs/data/current
    bin/hdfs datanode  -format
    

    hadoop2启动namenode 和 datanode:

        sbin/hadoop-daemon.sh start namenode
        sbin/hadoop-daemon.sh start datanode
    

    hadoop3、hadoop4启动datanode:
    sbin/hadoop-daemon.sh start datanode

    • 配置SSH免密登陆

    安装ssh-copy-id
    yum -y install openssh-clients

    生成id_rsa 和 id_rsa.pubd
    ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

    复制hadoop2公钥到远程主机

        cd  (cd .ssh)
        ssh-copy-id root@172.17.0.3
        ssh-copy-id -i  /root/.ssh/id_rsa.pub  root@172.17.0.4
    

    还需要把公约拷贝到自身(hadoop2: ssh hadoop2也是需要输入密码的,我们希望自己ssh访问自己不需要密码, 同理hadoop3和hadoop4):
    ssh-copy-id hadoop2

    同理,需要在hadoop3上上传公钥, 然后复制到hadoop2和hadoop4
    同理,需要在hadoop4上上传公钥, 然后复制到hadoop2和hadoop3

    最后查看hadoop2生成的authorized_keys文件 (分别是hadoop3和hadoop4生成的内容)
    [root@f71da2a2f780 .ssh]#cat authorized_keys

    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQCvvYSOd2BbJtzA4E1YnNXFQ78rbRX+bPR9WZARFNQ0Cyh187Sc97W+Bcn1qkxCDHQyI3mwJV/9w66x/Sg7qcBZpjOFpf1F1jT+CUwVaVwhWLj0PhdkvyUYvlMTRkVdl4JYkWezw97p5Sd0OjJ0Lirp92xzByr5Lt128ZqMfvWYE1elR/ZGfcv361U3A6agZyZEV3tvXZ/acqaPfXzbCcP2YGqd8jUKi42rx50dOVvXinSuD8v9mA+LJ0prKzB2dh0PIkCZ9mUBPu8IgyVtYmzZpdNc3bzcaeBjRUOKnlSVTGnssuHl89+mspETgk5y+huLqQ+3XK1aoMXXm0St9CAzujPlwv2kvjcZWSeyAci6/i2KKvML4or42kDZz1nYtzUhcMGoOZrjVMxoLgzs9eUUA4jIZazPf8FX8I5oh7Kpd5HY8XC6B63pFhWpAzlyyW2cq7j9wQDzb2dktzNtrOqEsylCKYMs8cbRXXSaZ2+3UILevtXwt5rI0AYyOwpNdqk= root@e9c4e3e03433
    ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDJ51DLAzR1kKHI+wrE//oiCT1blAymqEzICbXHrfYM0UcnNCW7J14Pab7KsdaVWhWz0qEPbRInJKqw53klgFK2o3lb7dPCYCeacPl90uI1RxupzEZwrzlZwAeuFgzZeoH4G7I/mCsxmK8upYVeYdoX2BhjbOJQVacRXpdtuMMTTo0VgeXxllWoxR9lJGVwXUn8dbIaQCr5HMGrqwiCuHpPw/zi31TN9V20ubb833eCzXmY+DgtVRoka01Ir8fnuqVAbd404SzwxN9bvM6oyoozK/23UT8tJJNFy2FvzO6trp+2LS+m7IPZN/eSvb0XgQOWV3RiF8e/pYkQ6ep1DA+XnCQsY1qVBk1X7zCr4zC14ounIvnYdHs00+uoAoSRK74uitr4i+GiJaONsN+6n2MQceT3HR1K0KSDhrOZb77EGi4eVH3PaM3+0mmvjVlNoc7gNKtWMe0h8sB6ROGVAAsKxlRjjVxcgtNtVr1nw01KV5HwHBmtWt2BpFMOms0arB8= root@6e59d53ca6b9
    

    重点:群起集群
    1.配置slaves
    cd /usr/local/hadoop/hadoop-2.8.5/etc/hadoop/
    vim slaves
    增加内容如下(配置不允许有空格和空行):

            hadoop2
            hadoop3
            hadoop4
    

    (同理如上配置hadoo3核hadoop4)

    在hadoop2上集群启动:
    (启动hdfs的节点,包括了namenode, datanode,secondarynamenode)

    [root@f71da2a2f780 hadoop-2.8.5]# sbin/start-dfs.sh
    Starting namenodes on [hadoop2]
    hadoop2: starting namenode, logging to /usr/local/hadoop/hadoop-2.8.5/logs/hadoop-root-namenode-f71da2a2f780.out
    hadoop3: datanode running as process 3551. Stop it first.
    hadoop2: starting datanode, logging to /usr/local/hadoop/hadoop-2.8.5/logs/hadoop-root-datanode-f71da2a2f780.out
    hadoop4: datanode running as process 3877. Stop it first.
    Starting secondary namenodes [hadoop4]
    hadoop4: secondarynamenode running as process 3970. Stop it first.
    

    整体启动YARN

    在hadoop3上面启动(停止)yarn:

    [root@e9c4e3e03433 hadoop-2.8.5]# sbin/stop-yarn.sh
    stopping yarn daemons
    stopping resourcemanager
    hadoop4: stopping nodemanager
    hadoop2: stopping nodemanager
    hadoop3: stopping nodemanager
    hadoop4: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
    hadoop2: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
    hadoop3: nodemanager did not stop gracefully after 5 seconds: killing with kill -9
    
    • 注意:NameNode(hadoop2) 和 ResourceManager(hadoop3)如果不是同一台机器, 不能在NameNode上启动YARN,
      应该在ResourceManager(hadoop3)所在的机器启动YARN
    • 注意:在hadoop2上面启动的hadoop3和hadoop4的节点,需要在hadoop2上面停止,如果直接在hadoop3上执行
    [root@e9c4e3e03433 hadoop-2.8.5]# sbin/hadoop-daemon.sh stop datanode
                                       no datanode to stop
    

    这是因为cluster_id对应不上导致的, 如下:

    [root@f71da2a2f780 hadoop-2.8.5]# sbin/stop-dfs.sh
    Stopping namenodes on [hadoop2]
    hadoop2: no namenode to stop
    hadoop2: stopping datanode
    hadoop4: stopping datanode
    hadoop3: stopping datanode
    Stopping secondary namenodes [hadoop4]
    hadoop4: stopping secondarynamenode
    

    集群基本测试:
    [root@f71da2a2f780 hadoop-2.8.5]# bin/hdfs dfs -put wcinput/wc.input /

    文件存储的路径:
    /usr/local/hadoop/hadoop-2.8.5/tmp/dfs/data/current/BP-365936823-172.17.0.2-1568430331460/current/finalized/subdir0/subdir0/

    命令记录:
    1> 整体启动/停止HDFS
    sbin/start-dfs.sh / sbin/stop-dfs.sh

    2>整体启动/停止YARN
    sbin/start-yarn.sh / sbin/stop-yarn.sh

    • -----------------集群时间同步----------------------
      1.安装crontab
      yum -y install vixie-cron crontabs
    选项              功能
    -e               编辑crontab定时任务
    -l               查询crontab任务
    -r               删除当前用户所有的crontab任务
    

    编辑:crontan -e
    * * * * * 执行的任务

    项目 含义 范围
    1.第一个"" 一小时当中的第几分钟 0~59
    2.第二个"
    " 一天当中的第几个小时 0~23
    3.第三个"" 一个月当中的第几天 1~31
    4.第四个"
    " 一年当中的第几个月 1~12
    5.第五个"*" 一周当中的星期几 0~7 (0和7都代表星期日)

    特殊符号 含义
    1 * 代表任何时间,比如第一个*就代表一小时中每分钟都执行一次的意思
    2 , 代表不连续的时间,比如"0 8,12,16 * * *", 代表在每天的8点0分,12点0分,16点0分都执行一次命令
    3 - 代表连续的时间范围,比如 "05 * * * 1-6", 代表在周一到周六的凌晨5点0分执行命令
    4 */n 代表每隔多久执行一次, 比如"/10 * * * *", 代表每隔10分钟执行一次

    • 操作实践


      1513F998-C008-40A7-8540-A265517E385E.png

    :

    相关文章

      网友评论

          本文标题:HADOOP

          本文链接:https://www.haomeiwen.com/subject/pibhectx.html