一、数据来源:
对技术论坛网站的tomcat access log日志进行分析,计算该论坛的一些关键指标,供运营者进行决策时参考。
开发该系统的目的是为了获取一些业务相关的指标,这些指标在第三方工具中无法获得的;
该日志数据的记录格式,其中每行记录有5部分组成:访问者IP、访问时间、访问资源、访问状态(HTTP状态码)、本次访问流量。
以下是部分数据:
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/faq.gif HTTP/1.1" 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1" 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_1.gif HTTP/1.1" 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/hot_2.gif HTTP/1.1" 200 682
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/filetype/common.gif HTTP/1.1" 200 90
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1" 200 1482
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1" 200 2331
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /source/plugin/wsh_wx/img/wx_jqr.gif HTTP/1.1" 200 1770
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/recommend_1.gif HTTP/1.1" 200 1030
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/logo.png HTTP/1.1" 200 4542
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /data/attachment/common/c8/common_2_verify_icon.png HTTP/1.1" 200 582
110.52.250.126 - - [30/May/2013:17:38:20 +0800] "GET /static/js/logging.js?y7a HTTP/1.1" 200 603
8.35.201.144 - - [30/May/2013:17:38:20 +0800] "GET /uc_server/avatar.php?uid=29331&size=middle HTTP/1.1" 301 -
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /data/cache/common_smilies_var.js?y7a HTTP/1.1" 200 3184
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/pn.png HTTP/1.1" 200 592
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/common/swfupload.swf?preventswfcaching=1369906718144 HTTP/1.1" 200 13333
27.19.74.143 - - [30/May/2013:17:38:20 +0800] "GET /static/image/editor/editor.gif HTTP/1.1" 200 13648
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/94/42_avatar_middle.jpg HTTP/1.1" 200 6153
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/03/13/51_avatar_middle.jpg HTTP/1.1" 200 5087
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/87/94_avatar_middle.jpg HTTP/1.1" 200 5117
8.35.201.165 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/01/03_avatar_middle.jpg HTTP/1.1" 200 5844
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/04/12/85_avatar_middle.jpg HTTP/1.1" 200 3174
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=53635&size=middle HTTP/1.1" 301 -
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arw_r.gif HTTP/1.1" 200 65
8.35.201.166 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/px.png HTTP/1.1" 200 210
8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/pmto.gif HTTP/1.1" 200 152
8.35.201.161 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/search.png HTTP/1.1" 200 3047
8.35.201.163 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/avatar.php?uid=57232&size=middle HTTP/1.1" 301 -
8.35.201.164 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/05/83/35_avatar_middle.jpg HTTP/1.1" 200 7171
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /uc_server/data/avatar/000/01/54/22_avatar_middle.jpg HTTP/1.1" 200 5396
8.35.201.166 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arrow_top.gif HTTP/1.1" 200 51
8.35.201.160 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/arw_l.gif HTTP/1.1" 200 844
8.35.201.144 - - [30/May/2013:17:38:21 +0800] "GET /static/image/common/qmenu.png HTTP/1.1" 200 1744
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/smile.gif HTTP/1.1" 200 1662
27.19.74.143 - - [30/May/2013:17:38:21 +0800] "GET /static/image/smiley/default/sad.gif HTTP/1.1" 200 1237
二、数据清洗
1、将数据清洗成以下格式:
110.52.250.126 20130530173820 data/cache/style_1_widthauto.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wsh_zk.css
110.52.250.126 20130530173820 data/cache/style_1_forum_index.css?y7a
110.52.250.126 20130530173820 source/plugin/wsh_wx/img/wx_jqr.gif
27.19.74.143 20130530173820 data/attachment/common/c8/common_2_verify_icon.png
27.19.74.143 20130530173820 data/cache/common_smilies_var.js?y7a
2、编写Map Reduce清理程序
2.1 工具类
package com.neusoft;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
/**
* Created by Administrator on 2019/1/8.
*/
public class LogParser {
public static final SimpleDateFormat FORMAT = new SimpleDateFormat(
"d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(
"yyyyMMddHHmmss");
public static void main(String[] args) throws ParseException {
final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";
LogParser parser = new LogParser();
final String[] array = parser.parse(S1);
System.out.println("样例数据: " + S1);
System.out.format(
"解析结果: ip=%s, time=%s, url=%s, status=%s, traffic=%s",
array[0], array[1], array[2], array[3], array[4]);
}
/**
* 解析英文时间字符串
*
* @param string
* @return
* @throws ParseException
*/
private Date parseDateFormat(String string) {
Date parse = null;
try {
parse = FORMAT.parse(string);
} catch (ParseException e) {
e.printStackTrace();
}
return parse;
}
/**
* 解析日志的行记录
*
* @param line
* @return 数组含有5个元素,分别是ip、时间、url、状态、流量
*/
public String[] parse(String line) {
String ip = parseIP(line);
String time = parseTime(line);
String url = parseURL(line);
String status = parseStatus(line);
String traffic = parseTraffic(line);
return new String[] { ip, time, url, status, traffic };
}
private String parseTraffic(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String traffic = trim.split(" ")[1];
return traffic;
}
private String parseStatus(String line) {
final String trim = line.substring(line.lastIndexOf("\"") + 1)
.trim();
String status = trim.split(" ")[0];
return status;
}
private String parseURL(String line) {
final int first = line.indexOf("\"");
final int last = line.lastIndexOf("\"");
String url = line.substring(first + 1, last);
return url;
}
private String parseTime(String line) {
final int first = line.indexOf("[");
final int last = line.indexOf("+0800]");
String time = line.substring(first + 1, last).trim();
Date date = parseDateFormat(time);
return dateformat1.format(date);
}
private String parseIP(String line) {
String ip = line.split("- -")[0].trim();
return ip;
}
}
2.2 Mapper程序
package com.neusoft;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class CleanMapper extends Mapper<LongWritable,Text,LongWritable,Text>
{
//hello world
LogParser logParser = new LogParser();
Text outputValue = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException, IOException {
final String[] parsed = logParser.parse(value.toString());
// step1.过滤掉静态资源访问请求
if (parsed[2].startsWith("GET /static/")
|| parsed[2].startsWith("GET /uc_server")
|| parsed[2].endsWith(".css")
|| parsed[2].endsWith(".js")) {
return;
}
// step2.过滤掉开头的指定字符串
if (parsed[2].startsWith("GET /")) {
parsed[2] = parsed[2].substring("GET /".length());
} else if (parsed[2].startsWith("POST /")) {
parsed[2] = parsed[2].substring("POST /".length());
}
// step3.过滤掉结尾的特定字符串
if (parsed[2].endsWith(" HTTP/1.1")) {
parsed[2] = parsed[2].substring(0, parsed[2].length()
- " HTTP/1.1".length());
}
if (parsed[2].contains(".css")
|| parsed[2].contains(".js")
|| parsed[2].contains(".jpg")
|| parsed[2].contains(".png")
|| parsed[2].contains(".gif")
|| parsed[2].contains(".jpeg")) {
return;
}
// step4.只写入前三个记录类型项
outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);
context.write(key, outputValue);
}
}
2.3 Reduce程序
package com.neusoft;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class CleanReducer extends Reducer<LongWritable,Text,Text,NullWritable>
{
@Override
protected void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
context.write(values.iterator().next(),NullWritable.get());
}
}
2.4 Driver程序
package com.neusoft;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class CleanDriver {
public static void main(String[] args) throws Exception {
System.setProperty("HADOOP_USER_NAME", "root") ;
System.setProperty("hadoop.home.dir", "e:/hadoop-2.8.3");
if (args == null || args.length == 0) {
return;
}
com.neusoft.FileUtil.deleteDir(args[1]);
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
//jar
job.setJarByClass(CleanDriver.class);
job.setMapperClass(CleanMapper.class);
job.setReducerClass(CleanReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job,new Path(args[0]));
FileInputFormat.setMaxInputSplitSize(job, 1024*1024);
FileOutputFormat.setOutputPath(job,new Path(args[1]));
boolean bResult = job.waitForCompletion(true);
System.out.println("--------------------------------");
System.exit(bResult ? 0 : 1);
}
}
2.5 删除文件夹工具类
package com.neusoft;
import java.io.File;
/**
* Created by bee on 3/25/17.
*/
public class FileUtil {
public static boolean deleteDir(String path) {
File dir = new File(path);
if (dir.exists()) {
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
deleteDir(f.getAbsolutePath());
} else {
f.delete();
}
}
dir.delete();
return true;
} else {
System.out.println("文件(夹)不存在!");
return false;
}
}
}
三、 用清洗完的数据进行分析
1、建表
1.1 在hdfs上创建一个分区,用来建外部表
hadoop dfs -mkdir -p /project/techbbs/cleaned
1.2 创建外部表
进入hive,使用hive床架一个外部表
CREATE EXTERNAL TABLE techbbs(ip string, atime string, url string) PARTITIONED BY (logdate string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/project/techbbs/cleaned';
1.3 建立分区
建立了分区表之后,就需要增加一个分区,增加分区的语句如下:(这里主要针对20150425这一天的日志进行分区)
hive>ALTER TABLE techbbs ADD PARTITION(logdate='2015_04_25') LOCATION '/project/techbbs/cleaned/2015_04_25';
1.4 写入数据
将清洗后的数据写入刚刚创建好的表
0: jdbc:hive2://localhost:10000> load data local inpath '/root/cleaned' into table techbbs3 partition(logdate='2015_04_25');
2 分析统计数据
2.1 PV量
页面浏览量即为PV(Page View),是指所有用户浏览页面的总和,一个独立用户每打开一个页面就被记录1 次。这里,我们只需要统计日志中的记录个数即可,HQL代码如下:
0: jdbc:hive2://localhost:10000> SELECT COUNT(1) AS PV FROM techbbs WHERE logdate='2015_04_25';
2.2 注册用户数
该论坛的用户注册页面为member.php,而当用户点击注册时请求的又是member.php?mod=register的url。因此,这里我们只需要统计出日志中访问的URL是member.php?mod=register的即可,HQL代码如下:
0: jdbc:hive2://localhost:10000> select count(*) from techbbs where url like '%member.php?mod=register%';
2.3 独立IP数
一天之内,访问网站的不同独立 IP 个数加和。其中同一IP无论访问了几个页面,独立IP 数均为1。因此,这里我们只需要统计日志中处理的独立IP数即可,在SQL中我们可以通过DISTINCT关键字,在HQL中也是通过这个关键字:
0: jdbc:hive2://localhost:10000> SELECT COUNT(DISTINCT ip) AS IP FROM techbbs WHERE logdate='2015_04_25';
2.4 跳出用户数
只浏览了一个页面便离开了网站的访问次数,即只浏览了一个页面便不再访问的访问次数。这里,我们可以通过用户的IP进行分组,如果分组后的记录数只有一条,那么即为跳出用户。将这些用户的数量相加,就得出了跳出用户数,HQL代码如下:
0: jdbc:hive2://localhost:10000> select count(*) from (select ip,count(ip) as num from techbbs group by ip) as tmpTable where tmpTable.num = 1;
PS:跳出率是指只浏览了一个页面便离开了网站的访问次数占总的访问次数的百分比,即只浏览了一个页面的访问次数 / 全部的访问次数汇总。这里,我们可以将这里得出的跳出用户数/PV数即可得到跳出率。
网友评论