美文网首页数据工程师
使用Hadoop Streaming运行Python版Wordc

使用Hadoop Streaming运行Python版Wordc

作者: 苟雨 | 来源:发表于2017-07-14 08:56 被阅读72次

    编写map函数
    wordcount_mapper.py

    #!/usr/bin/env python   
    
    # ---------------------------------------------------------------
    #This mapper code will input a line of text and output <word, 1>
    # 
    # ---------------------------------------------------------------
    
    import sys            
    
    for line in sys.stdin:  
        line = line.strip()  
        keys = line.split() 
        for key in keys:    
            value = 1        
            print('{0}\t{1}'.format(key, value) ) #the {} is replaced by 0th,1st items in format list
                           
    
    

    reduce函数
    word count_reducer.py

    #!/usr/bin/env python
    
    # ---------------------------------------------------------------
    #This reducer code will input a line of text and 
    #    output <word, total-count>
    # ---------------------------------------------------------------
    import sys
    
    last_key      = None              
    running_total = 0
    
    # -----------------------------------
    # 使用循环读取输入并计数
    #  --------------------------------
    for input_line in sys.stdin:
        input_line = input_line.strip()
        this_key, value = input_line.split("\t", 1) 
        value = int(value)           
     
        if last_key == this_key:     
            running_total += value   # add value to running total
    
        else:
            if last_key:          
                print( "{0}\t{1}".format(last_key, running_total) )
                                   
            running_total = value    #reset values
            last_key = this_key
    
    if last_key == this_key:
        print( "{0}\t{1}".format(last_key, running_total)) 
    
        ```
    
    
    如果你是Yarn的话,需要另外下载streaming的jar包[参考地址](http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-streaming/2.7.3). input 事先准备一些文件。
    

    streaming 使用绝对地址,output 不能是已经存在的目录 mapper 和reducer使用绝对地址

    hadoop jar /Download/hadoop-streaming-2.7.3.jar
    -input /hello \
    -output /output
    -mapper /usr/local/yarn/hadoop-2.7.3/wordcount/wordcount_mapper.py
    -reducer /usr/local/yarn/hadoop-2.7.3/wordcount/wordcount_reducer.py

    然后查看/output就可以看到结果。

    相关文章

      网友评论

        本文标题:使用Hadoop Streaming运行Python版Wordc

        本文链接:https://www.haomeiwen.com/subject/rsfihxtx.html