美文网首页
使用docker搭建hadoop伪分布式

使用docker搭建hadoop伪分布式

作者: niracler | 来源:发表于2019-04-10 11:01 被阅读0次

    这篇文章应该算是我大数据系列的第一篇文章吧,路漫漫其修远兮,我本来不想这么快就往大数据方向前进的,只是这边项目需要,不得不学。本文章对新手极不友好。前置技能有很多,你总得安装个docker吧,还要了解Docker常用命令, 以及Dockerfile,docker-compose的使用。我的不通用的Python程序员练级攻略里有相应的学习资料。而该项目的地址是 https://github.com/Niracler/bigdata-exercise

    这篇文章主要讲的是使用docker搭建hadoop伪分布式的细节,假如觉得麻烦的话,有更简便的方法。我这里先说为敬,或者你们可以先试试成功的喜悦。

    将这个项目克隆下来,我的Dockerfile都是基于这个项目改写的:

    $git clone https://github.com/Niracler/bigdata-exercise.git
    $cd bigdata-exercise/docker-hadoop/
    

    启动

    $sudo su
    $docker-compose -f docker-compose-local.yml up -d
    

    访问http://localhost:50070/ 效果

    Screenshot_20190329_100435.png

    进入容器

    $docker exec -it namenode bash
    

    上传文件测试

    $touch test
    $hdfs dfs -put test  /
    $hdfs dfs -ls /
    
    Screenshot_20190410_104008.png

    然后?就没有然后了,下面是dockerfile等文件的细节

    构建镜像

    目录结构

    你要先在该目录结构下

    Screenshot_20190329_080604.png

    构建hadoop-base镜像

    hadoop-base的Dockerfile,之后具体的Dockerfile都是基于该Dockerfile,该Dockerfile的主要工作是下载jdk与hadoop

    FROM openjdk:8
    MAINTAINER Niracler <niracler@gmail.com>
    
    # 下载并安装hadoop http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz
    ENV HADOOP_VERSION 2.7.7
    ENV HADOOP_URL http://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
    
    # 解压,放到指定位置,并删除
    RUN set -x \
        && curl -fSL "$HADOOP_URL" -o /tmp/hadoop.tar.gz \
        && tar -xvf /tmp/hadoop.tar.gz -C /opt/ \
        && rm /tmp/hadoop.tar.gz*
    
    RUN ln -s /opt/hadoop-$HADOOP_VERSION/etc/hadoop /etc/hadoop
    RUN cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml
    RUN mkdir /opt/hadoop-$HADOOP_VERSION/logs
    
    RUN mkdir /hadoop-data
    
    # 设置环境变量
    ENV HADOOP_PREFIX=/opt/hadoop-$HADOOP_VERSION
    ENV HADOOP_CONF_DIR=/etc/hadoop
    ENV MULTIHOMED_NETWORK=1
    
    ENV USER=root
    ENV PATH $HADOOP_PREFIX/bin/:$PATH
    
    # 添加启动文件
    ADD entrypoint.sh /entrypoint.sh
    RUN chmod a+x /entrypoint.sh
    
    ENTRYPOINT ["/entrypoint.sh"]
    

    启动文件entrypoint.sh

    #!/bin/bash
    
    # Set some sensible defaults
    export CORE_CONF_fs_defaultFS=${CORE_CONF_fs_defaultFS:-hdfs://`hostname -f`:8020}
    
    function addProperty() {
      local path=$1
      local name=$2
      local value=$3
    
      local entry="<property><name>$name</name><value>${value}</value></property>"
      local escapedEntry=$(echo $entry | sed 's/\//\\\//g')
      sed -i "/<\/configuration>/ s/.*/${escapedEntry}\n&/" $path
    }
    
    function configure() {
        local path=$1
        local module=$2
        local envPrefix=$3
    
        local var
        local value
    
        echo "Configuring $module"
        for c in `printenv | perl -sne 'print "$1 " if m/^${envPrefix}_(.+?)=.*/' -- -envPrefix=$envPrefix`; do
            name=`echo ${c} | perl -pe 's/___/-/g; s/__/@/g; s/_/./g; s/@/_/g;'`
            var="${envPrefix}_${c}"
            value=${!var}
            echo " - Setting $name=$value"
            addProperty /etc/hadoop/$module-site.xml $name "$value"
        done
    }
    
    configure /etc/hadoop/core-site.xml core CORE_CONF
    configure /etc/hadoop/hdfs-site.xml hdfs HDFS_CONF
    configure /etc/hadoop/yarn-site.xml yarn YARN_CONF
    configure /etc/hadoop/httpfs-site.xml httpfs HTTPFS_CONF
    configure /etc/hadoop/kms-site.xml kms KMS_CONF
    
    if [ "$MULTIHOMED_NETWORK" = "1" ]; then
        echo "Configuring for multihomed network"
    
        # HDFS
        addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.rpc-bind-host 0.0.0.0
        addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.servicerpc-bind-host 0.0.0.0
        addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.http-bind-host 0.0.0.0
        addProperty /etc/hadoop/hdfs-site.xml dfs.namenode.https-bind-host 0.0.0.0
        addProperty /etc/hadoop/hdfs-site.xml dfs.client.use.datanode.hostname true
        addProperty /etc/hadoop/hdfs-site.xml dfs.datanode.use.datanode.hostname true
    
        # YARN
        addProperty /etc/hadoop/yarn-site.xml yarn.resourcemanager.bind-host 0.0.0.0
        addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
        addProperty /etc/hadoop/yarn-site.xml yarn.nodemanager.bind-host 0.0.0.0
        addProperty /etc/hadoop/yarn-site.xml yarn.timeline-service.bind-host 0.0.0.0
    
        # MAPRED
        addProperty /etc/hadoop/mapred-site.xml yarn.nodemanager.bind-host 0.0.0.0
    fi
    
    if [ -n "$GANGLIA_HOST" ]; then
        mv /etc/hadoop/hadoop-metrics.properties /etc/hadoop/hadoop-metrics.properties.orig
        mv /etc/hadoop/hadoop-metrics2.properties /etc/hadoop/hadoop-metrics2.properties.orig
    
        for module in mapred jvm rpc ugi; do
            echo "$module.class=org.apache.hadoop.metrics.ganglia.GangliaContext31"
            echo "$module.period=10"
            echo "$module.servers=$GANGLIA_HOST:8649"
        done > /etc/hadoop/hadoop-metrics.properties
    
        for module in namenode datanode resourcemanager nodemanager mrappmaster jobhistoryserver; do
            echo "$module.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31"
            echo "$module.sink.ganglia.period=10"
            echo "$module.sink.ganglia.supportsparse=true"
            echo "$module.sink.ganglia.slope=jvm.metrics.gcCount=zero,jvm.metrics.memHeapUsedM=both"
            echo "$module.sink.ganglia.dmax=jvm.metrics.threadsBlocked=70,jvm.metrics.memHeapUsedM=40"
            echo "$module.sink.ganglia.servers=$GANGLIA_HOST:8649"
        done > /etc/hadoop/hadoop-metrics2.properties
    fi
    
    exec $@
    

    构建hadoop-base镜像

    $docker build -t="hadoop-base" ./base
    

    构建hadoop-namenode

    hadoop-namenode的Dockerfile

    FROM hadoop-base
    MAINTAINER niracler <niracler@gmail.com>
    
    HEALTHCHECK CMD curl -f http://localhost:50070/ || exit 1
    
    ENV HDFS_CONF_dfs_namenode_name_dir=file:///hadoop/dfs/name
    RUN mkdir -p /hadoop/dfs/name
    VOLUME /hadoop/dfs/name
    
    ADD run.sh /run.sh
    RUN chmod a+x /run.sh
    
    EXPOSE 50070
    
    CMD ["/run.sh"]
    

    启动文件run.sh

    #!/bin/bash
    
    namedir=`echo $HDFS_CONF_dfs_namenode_name_dir | perl -pe 's#file://##'`
    if [ ! -d $namedir ]; then
      echo "Namenode name directory not found: $namedir"
      exit 2
    fi
    
    if [ -z "$CLUSTER_NAME" ]; then
      echo "Cluster name not specified"
      exit 2
    fi
    
    if [ "`ls -A $namedir`" == "" ]; then
      echo "Formatting namenode name directory: $namedir"
      $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode -format $CLUSTER_NAME
    fi
    
    $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR namenode
    

    构建hadoop-namenode镜像

    $docker build -t="hadoop-namenode" ./namenode
    

    构建hadoop-datanode

    hadoop-datanode的Dockerfile

    FROM hadoop-base
    MAINTAINER niracler <niracler@gmail.com>
    
    HEALTHCHECK CMD curl -f http://localhost:50075/ || exit 1
    
    ENV HDFS_CONF_dfs_datanode_data_dir=file:///hadoop/dfs/data
    RUN mkdir -p /hadoop/dfs/data
    VOLUME /hadoop/dfs/data
    
    ADD run.sh /run.sh
    RUN chmod a+x /run.sh
    
    EXPOSE 50075
    
    CMD ["/run.sh"]
    

    启动文件run.sh

    #!/bin/bash
    
    datadir=`echo $HDFS_CONF_dfs_datanode_data_dir | perl -pe 's#file://##'`
    if [ ! -d $datadir ]; then
      echo "Datanode data directory not found: $datadir"
      exit 2
    fi
    
    $HADOOP_PREFIX/bin/hdfs --config $HADOOP_CONF_DIR datanode
    

    构建hadoop-datanode镜像

    $docker build -t="hadoop-datanode" ./datanode
    

    构建hadoop-nodemanager

    hadoop-nodemanager的Dockerfile

    FROM hadoop-base
    MAINTAINER Niracler <niracler@gmail.com>
    
    HEALTHCHECK CMD curl -f http://localhost:8042/ || exit 1
    
    ADD run.sh /run.sh
    RUN chmod a+x /run.sh
    
    EXPOSE 8042
    
    CMD ["/run.sh"]
    

    启动文件run.sh

    #!/bin/bash
    
    $HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR nodemanager
    

    构建hadoop-nodemanager

    $docker build -t="hadoop-nodemanager" ./nodemanager
    

    构建hadoop-resourcemanager

    hadoop-resourcemanager的Dockerfile

    FROM hadoop-base
    MAINTAINER niracler <niracler@gmail.com>
    
    HEALTHCHECK CMD curl -f http://localhost:8088/ || exit 1
    
    ADD run.sh /run.sh
    RUN chmod a+x /run.sh
    
    EXPOSE 8088
    
    CMD ["/run.sh"]
    

    启动文件run.sh

    #!/bin/bash
    
    $HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR resourcemanager
    
    $docker build -t="hadoop-resourcemanager" ./resourcemanager
    

    构建hadoop-historyserver

    hadoop-historyserver的Dockerfile

    FROM hadoop-base
    MAINTAINER niracler <niracler@gmail.com>
    
    HEALTHCHECK CMD curl -f http://localhost:8188/ || exit 1
    
    ENV YARN_CONF_yarn_timeline___service_leveldb___timeline___store_path=/hadoop/yarn/timeline
    RUN mkdir -p /hadoop/yarn/timeline
    VOLUME /hadoop/yarn/timeline
    
    ADD run.sh /run.sh
    RUN chmod a+x /run.sh
    
    EXPOSE 8188
    
    CMD ["/run.sh"]
    

    启动文件run.sh

    #!/bin/bash
    
    $HADOOP_PREFIX/bin/yarn --config $HADOOP_CONF_DIR historyserver
    
    $docker build -t="hadoop-historyserver" ./historyserver
    

    配置文件

    CORE_CONF_fs_defaultFS=hdfs://namenode:8020
    CORE_CONF_hadoop_http_staticuser_user=root
    CORE_CONF_hadoop_proxyuser_hue_hosts=*
    CORE_CONF_hadoop_proxyuser_hue_groups=*
    
    HDFS_CONF_dfs_webhdfs_enabled=true
    HDFS_CONF_dfs_permissions_enabled=false
    
    YARN_CONF_yarn_log___aggregation___enable=true
    YARN_CONF_yarn_resourcemanager_recovery_enabled=true
    YARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
    YARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstate
    YARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logs
    YARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/
    YARN_CONF_yarn_timeline___service_enabled=true
    YARN_CONF_yarn_timeline___service_generic___application___history_enabled=true
    YARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=true
    YARN_CONF_yarn_resourcemanager_hostname=resourcemanager
    YARN_CONF_yarn_timeline___service_hostname=historyserver
    YARN_CONF_yarn_resourcemanager_address=resourcemanager:8032
    YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030
    YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
    

    启动

    这里使用docker-compose,docker-compose-local.yml内容如下:

    version: "2"
    
    services:
      namenode:
        image: hadoop-namenode
        hostname: namenode
        container_name: namenode
        volumes:
          - ./data/namenode:/hadoop/dfs/name
        environment:
          - CLUSTER_NAME=test
        env_file:
          - ./hadoop.env
        ports:
          - "50070:50070"
    
      resourcemanager:
        image: hadoop-resourcemanager
        hostname: resourcemanager
        container_name: resourcemanager
        depends_on:
          - "namenode"
        links:
          - "namenode"
        env_file:
          - ./hadoop.env
    
      historyserver:
        image: hadoop-historyserver
        hostname: historyserver
        container_name: historyserver
        volumes:
          - ./data/historyserver:/hadoop/yarn/timeline
        depends_on:
          - "namenode"
        links:
          - "namenode"
        env_file:
          - ./hadoop.env
    
      nodemanager1:
        image: hadoop-nodemanager
        hostname: nodemanager1
        container_name: nodemanager1
        depends_on:
          - "namenode"
          - "resourcemanager"
        links:
          - "namenode"
          - "resourcemanager"
        env_file:
          - ./hadoop.env
    
      datanode1:
        image: hadoop-datanode
        hostname: datanode1
        container_name: datanode1
        depends_on:
          - "namenode"
        links:
          - "namenode"
        volumes:
          - ./data/datanode1:/hadoop/dfs/data
        env_file:
          - ./hadoop.env
    
      datanode2:
        image: hadoop-datanode
        hostname: datanode2
        container_name: datanode2
        depends_on:
          - "namenode"
        links:
          - "namenode"
        volumes:
          - ./data/datanode2:/hadoop/dfs/data
        env_file:
          - ./hadoop.env
    
      datanode3:
        image: hadoop-datanode
        hostname: datanode3
        container_name: datanode3
        depends_on:
          - "namenode"
        links:
          - "namenode"
        volumes:
          - ./data/datanode3:/hadoop/dfs/data
        env_file:
          - ./hadoop.env
    

    启动

    $docker-compose -f docker-compose-local.yml up -d
    

    访问 http://localhost:50070/ 效果

    Screenshot_20190329_100435.png

    参考文章

    相关文章

      网友评论

          本文标题:使用docker搭建hadoop伪分布式

          本文链接:https://www.haomeiwen.com/subject/wstniqtx.html