基于网站访问日志的小实战

作者: _Kantin | 来源:发表于2017-10-16 22:05 被阅读61次

    1.先到GitHub上下载著名的开源UserAgentParser项目,用maven打包

     具体命令如下:mvn clean package -DskipTests 之后可以在项目的target下看到jar包。但是我们要安装到本地的maven中,用mvn clean install-DskipTests
    

    2.注册到maven中的log日志如下

    QQ截图20171016200304.png

    3.在本地maven中的pom.xml的配置

    QQ截图20171016211548.png

    4.基于Hash的本地浏览器数量的代码如下

    public class realWorks {
        UserAgentParser userAgentParser = new UserAgentParser();
        @Test
        public  void testReadFile() throws  Exception{
            //根据这个路径找到C盘的log文件
            String path = "C:/Users/Administrator/Desktop/Document/data source/access.log.10";
            BufferedReader reader = new BufferedReader(
                new InputStreamReader(new FileInputStream(new File(path))));
            String line = "";
            int count = 0;
            Map<String ,Integer> map = new HashMap<String, Integer>();
            while(line!=null){
                count++;
                line=reader.readLine();
                if(StringUtils.isNotBlank( line)){
                  String source = line.substring(getCharacterPosition(line,"\"",5));
                    UserAgent agent = userAgentParser.parse(source);
                    String browser = agent.getBrowser();
                    if(map.get(browser)!=null){
                        map.put(browser,map.get(browser)+1);
                    }else{
                        map.put(browser,1);
                    }
                }
            }
            for(Map.Entry<String,Integer> entry : map.entrySet()){
                System.out.println(entry.getKey()+" "+entry.getValue());
            }
        }
        @Test
        public void testGetCharacterPosition(){
            String value = "60.247.54.4 - - [18/Sep/2013:07:16:09 +0000] \"GET /wp-content/uploads/2013/05/favicon.ico HTTP/1.1\" 200 1150 \"-\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";
            int index = getCharacterPosition(value,"\"",5);
            System.out.println(index);
        }
        private  int getCharacterPosition(String value,String operator,int index){
            Matcher slashMatcher = Pattern.compile(operator).matcher(value);
            int mIdx = 0;
            while(slashMatcher.find()){
                mIdx++;
                if(mIdx==index){
                    break;
                }
            }
            return slashMatcher.start();
        }
    }
    

    5.基于MapReduce执行大数据统计

    5.1 添加maven打包的插件,以后用maven assembly:assembly打包
      build>
        <plugins>
          <plugin>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
              <archive>
                <manifest>
                  <mainClass></mainClass>
                </manifest>
              </archive>
              <descriptorRefs>
                <descriptorRef>jar-with-dependencies</descriptorRef>
              </descriptorRefs>
            </configuration>
          </plugin>
        </plugins>
      </build>
    
    5.2 先把log -put 上HDFS ,之后执行sh脚本

    hadoop jar /var/tmp/hadoop_train-1.0-SNAPSHOT-jar-with-dependencies.jar com.lzk.hadoop.LogApp /access.log.10 /browser

    5.3 贴上代码
    public class LogApp {
        public  static  class  MyMapper extends Mapper<LongWritable,Text,Text,LongWritable>{
            private UserAgentParser userAgentParser;
            LongWritable  one = new LongWritable(1);
    
            @Override
            protected void setup(Context context) throws IOException, InterruptedException {
                userAgentParser = new UserAgentParser();
            }
    
            @Override
            protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                String line = value.toString();
                String source = line.substring(getCharacterPosition(line,"\"",5));
                UserAgent agent = userAgentParser.parse(source);
                String browser = agent.getBrowser();
                context.write(new Text(browser),one);
            }
    
            @Override
            protected void cleanup(Context context) throws IOException, InterruptedException {
                userAgentParser=null;
            }
        }
        public  static  class  MyReducer extends Reducer<Text,LongWritable,Text,LongWritable>{
            @Override
            protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
                long sum = 0;
                for(LongWritable value:values){
                    sum+=value.get();
                }
                context.write(key,new LongWritable(sum));
            }
        }
    
        public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    
            Configuration configuration = new Configuration();
    
            Job job = Job.getInstance(configuration, "LogApp");
            job.setJarByClass(LogApp.class);
            FileInputFormat.setInputPaths(job,new Path(args[0]));
            //设置map相关参数
            job.setMapperClass(LogApp.MyMapper.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(LongWritable.class);
    
            //设置reduce相关参数
            job.setReducerClass(LogApp.MyReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(LongWritable.class);
    
            //设置作业处理的输出路径
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
    
        private  static  int getCharacterPosition(String value,String operator,int index){
            Matcher slashMatcher = Pattern.compile(operator).matcher(value);
            int mIdx = 0;
            while(slashMatcher.find()){
                mIdx++;
                if(mIdx==index){
                    break;
                }
            }
            return slashMatcher.start();
        }
    }
    

    6.本地端和HDFS端的日志输出

    QQ截图20171016221033.png QQ截图20171016221039.png

    相关文章

      网友评论

        本文标题:基于网站访问日志的小实战

        本文链接:https://www.haomeiwen.com/subject/etiluxtx.html