数据算法 Hadoop/Spark大数据处理---第十五章

作者: _Kantin | 来源:发表于2018-07-08 11:22 被阅读17次

Spark简述
Hadoop、Spark、Flink概要
想学大数据，应该从什么语言开始学？
大数据概览
Spark 常用算子详解（转换算子、行动算子、控制算子）
Spark知识点总结
Spark 初探总结
数据采集，图像数据处理，Python分布式爬虫，Mahout，T
数据算法 Hadoop/Spark大数据处理---第十五章
十年经验大佬讲述：大数据处理之道大数据技术ELK日志处理方案

本章为情感分析

情感分析算法的思想

判断一句话是好/坏的，是根据这句话的词语决定的。先建立一个好词的集合和一个坏词的集合，然后判断一句话中好坏词占据多少的百分比，从而断定这句话是什么类型的情感

本章实现方式

1.使用mapreduce的伪代码形式。以美国总统选举为例子。

++基于mapreduce的伪代码来实现++

1. map端

setup(){
    positiveWords =<load positive words from distributed cache>
    negativeWords =<load negative words from distributed cache>
    allCandidates =<load all candidates from distributed cache>
}
map(key,value){
    date = key  //日期
    List<String> tweetWords = normalizeAndTokenize(value)
    int positiveCount = 0;
    int negativeCount =0;
    //遍历每个候选人
    for(String candidate : allCandidates){
        if(candidate is in the tweetWords){
            //计算好词和坏词的格式化
            int positiveCount = <count of positive words in tweetWords>
            int negativeCount = <count of positive words in tweetWords>
            //除以总的个数
            double positiveRatio = positiveCount /tweetWords.size();
            double negativeCount = negativeCount/tweetWords.size();
            outputKey = Pair(data,candidate)
            outputValue = positiveRatio - negativeCount;
            emit(outputKey,outputValue)
        }
    }
}

2. reduce端

//key:是候选人  values 表示一个概率列表
reduce(key,values){
    double  sumOfRation = 0.0
    int n=0;
    for(Double value : values){
        n++;
        sumOfRation+=value;
    }
    emit(key,sumOfRatio/n)
}