###大数据介绍
1 大数据的由来
计算机技术的发展,互联网普及,信息积累,呈现爆炸式增长,收集,检索,统计信息越发困难,必须使用新技术来解决问题。
2 什么是大数据
一定时间范围内没办法用常规软件工具去捕捉,管理和处理的数据集合,只能通过新处理模式,使得信息资产多样化。。
能从各种类型的数据中,快速获得有价值的信息。
它能做什么?
利用相关数据分析,帮助企业降低成本,提高效率,开发新产品,更明智的业务决策。
信息和数据的相关性—》觉察商业趋势,判断研究质量,避免疾病扩散,打击犯罪等
大规模并行处理数据库,数据挖掘电网,分布式文件系统/数据库,运计算平和可扩展的存储系统等。
3 大数据特性 5V
大体量 Volume
多样性 Variety
实效性 Velocity
准确性 Veracity
大价值 Value
4 大数据与Hadoop
Hadoop是什么
软件平台:分析和处理
开源软件:使用JAVA开发
分布式基础架构
特点:
高可靠性,高扩展性,高效性,高容错性,低成本
官方解释:
Hadoop YARN – introduced in 2012 is a platform responsible for managing computing resources in clusters and using them for scheduling users' applications; [9] [10] Hadoop MapReduce – an implementation of the MapReduce programming model for large-scale data processing. The term Hadoop has come to refer not just to the aforementioned base modules and sub-modules, but also to the ecosystem , [11] or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig , Apache Hive , Apache HBase , Apache Phoenix , Apache Spark , Apache ZooKeeper , Cloudera Impala , Apache Flume , Apache Sqoop , Apache Oozie , and Apache Storm .
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failure.
###Hadooop
1 历史起源
Google ---> 闭源
2 组件:
常用组件
核心组件
生态系统
HDFS结构
HDFS角色及概念
NameNode Clietn SecondaryNode datanode
MapReduce结构
MapReduce结构及概念
Yarn结构
Yarn 角色及概念
###Hadoop安装
1 yum java-1.8.0-openjdk --> java -version , jps
2 hadoop dir: /usr/local/hadoop
3 edit the config doc: vim /uar/local/hadoop/etc/hadoop/hadoop-env.sh
$ export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64/jre ----> rpm -qc java-1.8.0-openjdk
$ export HADOOP_CONF_DIR=/user/local/hadoop/etc/hadoop
4 use the command:
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/hadoop-mapreduce-examples-2.7.6.jar wordcount /usr/local/hadoop/ss/ /usr/local/hadoop/yy
###Hadoop配置
6大配置文件:
core-site.xml 全局配置 自己写,官网有模板
hadoop-env.sh
hdfs-site.xml 自己写,官网有模板
mapred-site.xml
yarn-site.xml
slaves 节点配置文件 自己写
• 文件格式
– Hadoop-env.sh
JAVA_HOME
HADOOP_CONF_DIR
– xml文件配置格式
<property>
<name>关键字</name>
<value>变量值</value>
<description> 描述 </description> 可省略
</property>
配置文件样板:
<configuration>
<property>
<name></name>
<value></value>
</property>
<property>
<name></name>
<value></value>
</property>
<property>
<name></name>
<value></value>
</property>
</configuration>
搭建HDFS完全分布式
环境准备:
- selinux=disabled
- 禁用firewalld systemctl stop/mask firewalld
- 全部主机安装 yum -y install java-1.8.0-openjdk-devel
1. hostname
2. /etc/hosts 反向DNS解析
192.168.1.10 hadoop-nn01
192.168.1.11 node1
192.168.1.12 node2
192.168.1.13 node3
rsync -av /etc/hosts node1:/etc/
3. 互相能够面密码登陆,而且SSH第一登陆的时候不需要输入“yes”
vim /etc/ssh/ssh_config #注意这里不是主配置文件: sshd_config
Host *
GSSAPIAuthentication yes
StrictHostKeyChecking no #严格检查主机: no #copy注释里的StrictHostKeyChecking
rm -rf /root/.ssh/known_hosts
cd /root/.ssh
ssh-keygen -t rsa -b 2048 -N ''
ssh-copy-id -i id_rsa.pub node1/node2/node3/192.168.1.10 #!!! 记得要传一份给自己
4. 修改配置文件
官方参考文件:
https://hadoop.apache.org/docs/r2.7.6/hadoop-project-dist/hadoop-common/core-default.xml
vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64/jre #用命令 rpm -ql java-1.8.0-openjdk 找到的
export HADOOP_CONF_DIR="/usr/local/hadoop/etc/hadoop" #按照实际情况来填写主配置文件路径
fs.defaultFS:文件系统配置参苏
hadoop.tmp.dir 所有数据的/目录 ----> /var/hadoop ---> 相当于/var/lib/mysql (存储所有数据的目录)
vim /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hdoop-nn01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop</value>
</property>
</configuration>
dfs.namenode.http-address #地址声明
dfs.namenode.secondary.http-address #地址声明
dfs.replication #文件冗余份数
vim /usr/local/hadoop/etc/hadoop/hdfs-default.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop-nn01:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-nn01:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
vim /usr/local/hadoop/etc/hadoop/slaves
node1
node2
node3
5. 在所有节点上,同步所有配
rsync -aSH --delete /usr/local/hadoop node1:/usr/local/ &
rsync -aSH --delete /usr/local/hadoop node2:/usr/local/ &
rsync -aSH --delete /usr/local/hadoop node3:/usr/local/ &
6. 创建配置文件中所写的,数据存放目录
mkdir /var/hadoop
7. 在namenode,格式化操作
./bin/ndfs namenode -format
8. 启动集群
./sbin/start-dfs.sh
9. 检查集群角色否正确
jps
10. 检查HDF集群状态
./bin/hdfs dfsadmin -report
------------------------------------------------------------------------
节点管理 NFS网关
**如果数据损坏,如何修复:
rm -rf /var/hadoop/* namenode --- datenodes1-3
/etc/hadoop
mapred-site.xml
mapreduce.framework.name
vim /usr/local/hadoop/etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
</property>
</configuration>
yarn.resourcemanager.hostname
yarn.nodemanager.aux-services
vim /usr/local/hadoop/etc/hadoop/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-nn01</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
for i in node{1..3}
do
rsync -aSH --delete /usr/local/hadoop/etc/ ${i}:/usr/local/hadoop/etc/ -e 'ssh'
done
./sbin/start-yarn.sh
./bin/yarn node -list
WEB验证:
namenode
192.168.1.10:50070
secondary namenode
192.168.1.10:50090
datanode
192.168.1.11:50075
resourcemanager
192.168.1.10:8088
nodemanager
192.168.1.11:8042
./bin/hadoop fs ls / 集群文件的根目录
./bin/hadoop fs ls dhfs://hadoop-nn01:9001/
3个命令和SHELL 不一样
touchz
put
get
./bin/hadoop fs -mkdir /abc
./bin/hadoop fs -touchz /abc/yyy
./bin/hadoop fs -ls /abc
./bin/hadoop fs -put /etc/passwd /bac/
./bin/hadoop fs -get /abc/*.txt /dev/shm
利用集群分析数据:
#如何是小数量分析需求,倒不如直接用一台机器分析来得更快,只有大数据的分析需求对于集群利用价值比较大
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount hdfs://hadoop-nn01:9000/abc/*.txt hdfs://hadoop-nn01:9000/output
可以在WEB端下载数据
http://192.168.1.10:50070/explorer.html#/output
增加子节点:
前提:干掉firewalld, selinux=trusted
1 新建虚拟机: 修改IP,hostname
2 ssh免密 ssh-copy-id -i id_rsa.pub 192.168.1.15
3 拷贝/usr/local/hadoop 到新增子节点 rsync -aSH --delete /usr/local/hadoop/ 192.168.1.15:/usr/local/hadoop
4 同步/etc/hosts & /usr/local/hadoop/etc/hadoop/slaves
ansible all -m copy -a 'src=/etc/hosts dest=/etc/hosts'
ansible all -m copy -a 'src=/usr/local/hadoop/etc/hadoop/slaves dest=/usr/local/hadoop/etc/hadoop/slaves'
5 在新节点上启用新增节点:
./sbin/hadoop-daemon.sh start datanode
6 设置同步宽带,同步数据
./bin/hdfs dfsadmin -setBalancerBandwidth X千万
./sbin/start-balancer.sh
7. 验证,查看集群状态
./bin/hdfs dfsadmin -report
ansible all -m shell -a 'jps'
修复子节点:
删除节点
测试前准备: lftp get XG file
./bin/dhfs fs -put xx.G hdfs://hadoop-nn01:9000/xx
vim /usr/local/hadoop/etc/hadoop/hdfs-site.xm
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop-nn01:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-nn01:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.hosts.exclude</name>
<value>/usr/local/hadoop/etc/hadoop/exclude</value>
</property>
</configuration>
touch /usr/local/hadoop/etc/hadoop/exclude
newnode
Normal
Decommission in progress
Decommissioned
原数据 & 占用空间可以不一样
yarn节点管理
增加节点:
./sbin/yarn-daemon.sh start nodemanager
删除节点
./sbin/yarn-daemon.sh stop nodemanager
查看节点
./bin/yarn node -list
##NFS网关
DRBD + heartbeat
网络raid1
只同步差量
应用:web集群的后端
Namenode虚拟机
1. 同步/etc/hosts
2. namenode & nfsgw 创建一个gid, uid ,group 都一样的用户
3. 停止所有集群 ./sbin/stop-all.sh
4. 修改配置文件
vim /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop-nn01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/var/hadoop</value>
</property>
<property>
<name>hadoop.proxyuser.nfsuser.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.nfsuser.hosts</name>
<value>*</value>
</property>
</configuration>
5. 开启 & 检查验证
./sbin/start-dfs.sh
./bin/hdfs dfsadmin -report
虚拟机:nfswg
1. 删除原来的/usr/local/hadoop
2. 修改配置文件:
vim /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoop-nn01:50070</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-nn01:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>nfs.exports.allowed.hosts</name>
<value>* rw</value>
</property>
<property>
<name>nfs.dump.dir</name>
<value>/var/nfstmp</value>
</property>
</configuration>
3. 新建文档,设置权限nfsuser
chown 200.200 /var/nfstmp
setfacl -m u:nfsuser:rwx /usr/local/hadoop/logs/
4. 启动服务: 注意顺序
./sbin/hadoop-daemon.sh --script ./bin/hdfs start portmap #以root身份来操作
su - nfsuser
./sbin/hadoop-daemon.sh --script ./bin/hdfs start Nfs3
新虚拟机: 作为客户端
1. 安装软件,挂载
yum -y install nfs-utils
mount -t nfs -o vers=3,proto=tcp,noatime,nolock,sync 192.168.1.15:/ /mnt
网友评论