HDFS

作者: lufaqiang | 来源:发表于2017-08-29 15:35 被阅读0次

包括三个作业：
1.hadoop 常用shell整理。包含shell命令、使用方法、示例。
命令名称使用方法含义示例

image.png

常用命令操作
ls 查看文件夹下内容

hadoop fs -ls /user

image.png

mkdir 创建目录
hadoop fs -mkdir /user/hdfs/lufaqiang

image.png

copyFromLocal/put 移动文件夹

image.png

cat 将路径指定中的文件输出到 stdout

image.png

tail 将文件尾部 1k字节的内容输出到 stdout （显示出来）

image.png

copyToLocal/get 复制文件到本地文件系统
cp 复制

image.png

du 使用方法：hadoop fs -du URI [URI …]
显示目录中所有文件的大小，或者当只指定一个文件时，显示此文件的大小

image.png

dus 显示文件的大小

image.png

stat 显示文件的显示信息
使用方法：hadoop fs -stat URI [URI …]
返回指定路径的统计信息。

image.png
test
使用方法：hadoop fs -test -[ezd] URI
选项
-e 检查文件是否存在。如果存在则返回0。
-z 检查文件是否是0字节。如果是则返回0。
-d 如果路径是个目录，则返回1，否则返回0。
touchz
使用方法：hadoop fs -touchz URI [URI …]
创建一个0字节的空文件。

image.png

hadoop fs -cat /user/hdfs/*/xingzuo
![image.png](https://img.haomeiwen.com/i6891055/91798b5e23431973.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

用通配符 * 代替所有人显示所有的人的星座
排序

hadoop fs -cat /user/hdfs/*/xingzuo | sort

image.png

排序+统计相同的个数：

hadoop fs -cat /user/hdfs/*/xingzuo | sort | uniq -c

image.png

加上数字标识统计后反排序：

hadoop fs -cat /user/hdfs/*/xingzuo | sort | uniq -c | sort -r

image.png

hadoop fs -cat /user/hdfs/*/xingbie | awk '{print $1}'

image.png

统计男生的比例：

hadoop fs -cat /user/hdfs/*/xingbie|awk '{if($1=="nan") ++sum_nan; ++sum}END {print "nv: " sum_nan/sum}'

uniq
命令用于报告或忽略文件中的重复行，一般与sort命令结合使用来自:
选项
-c或——count：在每列旁边显示该行重复出现的次数；
-d或--repeated：仅显示重复出现的行列；
-f<栏位>或--skip-fields=<栏位>：忽略比较指定的栏位；
-s<字符位置>或--skip-chars=<字符位置>：忽略比较指定的字符；
-u或——unique：仅显示出一次的行列；
-w<字符位置>或--check-chars=<字符位置>：指定要比较的字符。
tr
命令可以对来自标准输入的字符进行替换、压缩和删除。它可以将一组字符变成另一组字符，经常用来编写优美的单行命令，作用很强大。
选项
-c或——complerment：取代所有不属于第一字符集的字符；
-d或——delete：删除所有属于第一字符集的字符；
-s或--squeeze-repeats：把连续重复的字符以单独一个字符表示；
-t或--truncate-set1：先删除第一字符集较第二字符集多出的字符。
sort
sort命令是在Linux里非常有用，它将文件进行排序，并将排序结果标准输出。sort命令既可以从特定的文件，也可以从stdin中获取输入
选项
-b：忽略每行前面开始出的空格字符；
-c：检查文件是否已经按照顺序排序；
-d：排序时，处理英文字母、数字及空格字符外，忽略其他的字符；
-f：排序时，将小写字母视为大写字母；
-i：排序时，除了040至176之间的ASCII字符外，忽略其他的字符；
-m：将几个排序号的文件进行合并；
-M：将前面3个字母依照月份的缩写进行排序；
-n：依照数值的大小排序；
-o<输出文件>：将排序后的结果存入制定的文件；
-r：以相反的顺序来排序；
-t<分隔字符>：指定排序时所用的栏位分隔字符；
+<起始栏位>-<结束栏位>：以指定的栏位来排序，范围由起始栏位到结束栏位的前一栏位。
head
head命令用于显示文件文字区块
-q 隐藏文件名

-v 显示文件名

-c<字节> 显示字节数

-n<行数> 显示的行数
2.HDFS API接口与编程
a)远程读取bigdata@47.94.18.202机器hdfs目录/user/tanqi/when_you_old.txt文件（保存一首英文诗）
b)统计文件中各个单词出现次数最多的5个（不区分大小写）
c)将统计的单词和次数写入到bigdata@47.94.18.202机器hdfs目录/user/{yourname}/top.txt
最终数据类似：
how=20
and=15
or=14
about=13
end=10
3.HDFS shell操作示例
a)将2中统计用hdfs shell实现
b)将最终统计结果保存在/user/{yourname}/top_shell.txt中

image.png

远程读可以参考：

import java.net.URISyntaxException;
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;

public class HdfsOperation {
    public static void main(String[] args) throws IOException, URISyntaxException {
        Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://nns");
        conf.set("dfs.nameservices", "nns");
        conf.set("dfs.ha.namenodes.nns", "bigdata001,bigdata002");
        conf.set("dfs.namenode.rpc-address.nns.bigdata001", "47.94.18.202:9000");
        conf.set("dfs.namenode.rpc-address.nns.bigdata002", "47.94.3.55:9000");
        conf.set("dfs.client.failover.proxy.provider.nns", "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider");

        FileSystem fs = null;
        fs = FileSystem.get(conf);
        fs = FileSystem.get(conf);
        FileStatus[] list = fs.listStatus(new Path("/user/tanqi/"));
        for (FileStatus file : list) {
            System.out.println(file.getPath().getName());
        }
        fs.close();
    }
}

作业分析：
1、读hdfs文件
2、统计

大小写转换
拆成单词用split函数
以单词为key统计
排序求 top N
3、写到hdfs 文件中
Configuration conf = new Configuration()
连接hdfs 的配置文件
FileSystem fs = FileSystem.get(URI,create(file),conf)
hasMap类型数据排序

远程连接 hadoop

准备工作
1、在win7中，找一个目录，解压hadoop-2.6.5，
如：E:\Hadoopsrc\hadoop-2.6.5
2 、在win7中添加几个环境变量
HADOOP_HOME= E:\Hadoopsrc\hadoop-2.6.5
HADOOP_BIN_PATH=%HADOOP_HOME%\bin
HADOOP_PREFIX= E:\Hadoopsrc\hadoop-2.6.5
另外，PATH变量在最后追加;%HADOOP_HOME%\bin
eclipse远程调试
1、hadoop-eclipse-plugin是一个专门用于eclipse的hadoop插件，可以直接在IDE环境中查看hdfs的目录和文件内容。
将下载后的hadoop-eclipse-plugin-2.6.0.jar复制到eclipse/plugins目录下，然后重启eclipse就完事了
2、在hadoop2.6.0源码的
hadoop-common-project\hadoop-common\src\main\winutils下
将winutils.exe复制到$HADOOP_HOME\bin目录
hadoop.dll复制到%windir%\system32目录即C:\Windows\System32
3、配置hadoop-eclipse-plugin插件
启动eclipse，windows->show view->other

image.png

window->preferences->hadoop map/reduce 指定win7上的hadoop根目录

image.png

然后在Map/Reduce Locations 面板中，点击小象图标

image.png

添加一个Location

image.png

查看端口号

image.png

看到8020为远程连接端口
查看配置hadoop时的端口号

image.png

远程调用地址

image.png

这个界面很重要
Location name 这里就是起个名字，随便起
Map/Reduce(V2) Master Host 这里就是虚拟机里hadoop master对应的IP地址，下面的端口对应 hdfs-site.xml里dfs.datanode.ipc.address属性所指定的端口
DFS Master Port：这里的端口，对应core-site.xml里fs.defaultFS所指定的端口
最后的user name要跟虚拟机里运行hadoop的用户名一致，我是用hadoop身份安装运行hadoop 2.6.5的，所以这里填写hadoop，如果你是用root安装的，相应的改成root
这些参数指定好以后，点击Finish，eclipse就知道如何去连接hadoop了，一切顺利的话，在Project Explorer面板中，就能看到hdfs里的目录和文件了

image.png

可以在文件上右击，选择删除试下，通常第一次是不成功的，会提示一堆东西，大意是权限不足之类，原因是当前的win7登录用户不是虚拟机里hadoop的运行用户，解决办法有很多，比如你可以在win7上新建一个hadoop的管理员用户，然后切换成hadoop登录win7，再使用eclipse开发，但是这样太烦，最简单的办法:
hdfs-site.xml里添加

<property>
     <name>dfs.permissions</name>
     <value>false</value>
  </property>

然后在虚拟机里，运行hadoop dfsadmin -safemode leave
保险起见，再来一个 hadoop fs -chmod 777 /
总而言之，就是彻底把hadoop的安全检测关掉（学习阶段不需要这些，正式生产上时，不要这么干），最后重启hadoop，再到eclipse里，重复刚才的删除文件操作试下，应该可以了。
4、创建读取hdfs内文件示例项目
新建一个项目，选择Map/Reduce Project

image.png

新建完成后系统会自动导入项目所需的jar包

image.png

作业中的代码

public class HdfsOperation {
    //指定文件读取且统计
    public static String  ReadStatHDFS(String file,Integer top)throws IOException{
        //读取统计
        HashMap<String,Integer> hasWord = new HashMap<String,Integer>();
        Configuration conf=new Configuration();
        FileSystem fs=FileSystem.get(URI.create(file),conf) ;
        FSDataInputStream hdfsInStream =fs.open(new Path(file));
        BufferedReader br =new BufferedReader(new InputStreamReader(hdfsInStream));
        try{
            String line=br.readLine();
            while(line !=null){
                String[] arrLine = line.toLowerCase().trim().split(",| ");
                for(int i=0;i<arrLine.length;i++){
                    String word= arrLine[i].trim();
                    if(word == null || word.equals("")){
                        continue;
                    }
                    if(!hasWord.containsKey(word)){//若无此单词
                        hasWord.put(word, 1);
                        
                    }else{//若有此单词，就将次数加1
                        Integer nCounts =hasWord.get(arrLine[i]);
                        hasWord.put(word, nCounts +1);
                    }
                }
                line=br.readLine();
            }
        }catch(Exception e){
            e.printStackTrace();
        }finally{
            br.close();
            fs.close();
        }
        //排序
        List<Map.Entry<String,Integer>> mapList = new ArrayList<Map.Entry<String,Integer>>(hasWord.entrySet());
        Collections.sort(mapList,new Comparator<Map.Entry<String,Integer>>() {
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String,Integer> o2) {
                return o2.getValue() - o1.getValue();
            }
        });
        //排序后
        String top_line = "";
        for (int i = 0; i < Math.min(mapList.size(), top); i++) {
            top_line = top_line + mapList.get(i).toString() + "\n";
        }

        return top_line;

    }

    //在指定位置新建一个文件，并写入字符
    public static void WriteToHDFS(String file, String words) throws IOException, URISyntaxException
    {
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(file), conf);
        Path path = new Path(file);
        FSDataOutputStream out = fs.create(path);   //创建文件

        //两个方法都用于文件写入，好像一般多使用后者
        out.write(words.getBytes("UTF-8"));
        out.close();
    }

    public static void main(String [] args) throws IOException, URISyntaxException
    {
        //读取并统计各个单词出现次数top5
        String fileRead = "hdfs://192.168.119.132:9000/user/lufaqiang/bbb.txt";
        String statLine = ReadStatHDFS(fileRead, 5);

        System.out.println(statLine);

        String fileWrite = "hdfs://192.168.119.132:9000/user/lufaqiang/ccc.txt";
        WriteToHDFS(fileWrite, statLine);
    }
}

HDFS

远程连接 hadoop

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读