1 需求分析
站长工具: http://seo.chinaz.com/
项目模块
- 用户基本信息分析
- 浏览器分析(广告)
- 地域信息分析:
按照ip地址分析 -> 解析成省市区 - 用户浏览深度分析:
同一会话内用户访问的页面个数
只通过会话维度统计不够准确,可以统计一个用户在某时间段内访问的页面数 - 外链数据分析
用于app推广 - 订单分析
- 事件分析
预留模块
用户基本信息分析
新增访客 活跃访客 总访客
新增会员 活跃会员 总会员
指标不能脱离维度
浏览器分析
在用户基本信息分析基础上增加浏览器维度 -> 维度组合
某指标在多个模块出现 -> 合并MR
地域信息分析
请求中包含ip -> 解析 -> 描点
手机定位(可能被禁止)
订单分析
总数 成功数 退款数...
2 数据源
分析数据不能影响web系统
js埋点
nginx拿ip地址,请求封装
数据最终落地hdfs
jsdk执行流程
数据
en 事件名称
ver 版本号
pl 平台
sdk sdk类型
b_rst 浏览器分辨率
b_iev 浏览器信息
u_ud 用户\访客唯一标示
l 客户端语言
u_mid 会员id
u_sd 会话id
c_time 客户端时间
p_url 当前页面url
p_ref 上一页面url
tt 当前页面标题
ca event事件的catagory名称
ac event事件的action名称
kv_* event事件的自定义属性
du event事件的持续时间
oid 订单id
on 订单名称
cua 支付金额
launch事件
页面加载时触发
新增用户使用
pageview事件
用户访问或刷新页面时触发
chargeRequest事件
下订单时程序主动调用
event事件
访客或用户调用业务事件触发
javasdk
支付成功事件等
web系统和nginx系统可能不在同一机器,走网络可能网络延迟断开等导致web系统阻塞 -> 异步
3 架构图
7 edraw max画架构图
start uml画时序图
etl -> 存储 -> 分析(mr hive) -> mysql
架构流程图
java-sdk异步发送,队列保证顺序,队列满阻塞问题
nginx获取ip保存到本地log
flume监控本地log -> hdfs
hdfs按时间目录存放,flume可以自动在hdfs中创建文件夹
ip解析可以放到etl中
第三方工具分析ua获得浏览器信息
rowkey设计
筛选hbase数据进行mr分析
mr实现OutputFormat类写入mysql
sqoop可以把mysql和hive的数据双向导入,基于mr
hive和hbase进行列映射,hive直接操作hbase避免数据冗余
4 nginx记录日志
<<js编程指南>>
安装nginx
tar -zxvf tengine-2.1.0.tar.gz
yum install gcc pcre-devel openssl-devel -y
./configure
make && make install
cd /usr/local/nginx/sbin
./nginx
# 修改配置文件
------------
http {
...
log_format my_format
...
server {
...
location = /log.gif {
default_type image/gif;
access_log /opt/data/access.log my_format;
}
------------
cd /etc/rc.d/init.d
vi nginx
------------
#!/bin/sh
#
# nginx - this script starts and stops the nginx daemon
#
# chkconfig: - 85 15
# description: Nginx is an HTTP(S) server, HTTP(S) reverse \
# proxy and IMAP/POP3 proxy server
# processname: nginx
# config: /etc/nginx/nginx.conf
# config: /etc/sysconfig/nginx
# pidfile: /var/run/nginx.pid
# Source function library.
. /etc/rc.d/init.d/functions
# Source networking configuration.
. /etc/sysconfig/network
# Check that networking is up.
[ "$NETWORKING" = "no" ] && exit 0
nginx="/usr/local/nginx/sbin/nginx"
prog=$(basename $nginx)
NGINX_CONF_FILE="/usr/local/nginx/conf/nginx.conf"
[ -f /etc/sysconfig/nginx ] && . /etc/sysconfig/nginx
lockfile=/var/lock/subsys/nginx
make_dirs() {
# make required directories
user=`nginx -V 2>&1 | grep "configure arguments:" | sed 's/[^*]*--user=\([^ ]*\).*/\1/g' -`
options=`$nginx -V 2>&1 | grep 'configure arguments:'`
for opt in $options; do
if [ `echo $opt | grep '.*-temp-path'` ]; then
value=`echo $opt | cut -d "=" -f 2`
if [ ! -d "$value" ]; then
# echo "creating" $value
mkdir -p $value && chown -R $user $value
fi
fi
done
}
start() {
[ -x $nginx ] || exit 5
[ -f $NGINX_CONF_FILE ] || exit 6
make_dirs
echo -n $"Starting $prog: "
daemon $nginx -c $NGINX_CONF_FILE
retval=$?
echo
[ $retval -eq 0 ] && touch $lockfile
return $retval
}
stop() {
echo -n $"Stopping $prog: "
killproc $prog -QUIT
retval=$?
echo
[ $retval -eq 0 ] && rm -f $lockfile
return $retval
}
restart() {
configtest || return $?
stop
sleep 1
start
}
reload() {
configtest || return $?
echo -n $"Reloading $prog: "
killproc $nginx -HUP
RETVAL=$?
echo
}
force_reload() {
restart
}
configtest() {
$nginx -t -c $NGINX_CONF_FILE
}
rh_status() {
status $prog
}
rh_status_q() {
rh_status >/dev/null 2>&1
}
case "$1" in
start)
rh_status_q && exit 0
$1
;;
stop)
rh_status_q || exit 0
$1
;;
restart|configtest)
$1
;;
reload)
rh_status_q || exit 7
$1
;;
force-reload)
force_reload
;;
status)
rh_status
;;
condrestart|try-restart)
rh_status_q || exit 0
;;
*)
echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload|configtest}"
exit 2
esac
------------
chmod +x nginx
chkconfig --add nginx
systemctl restart nginx &
5 flume日志收集
http://flume.apache.org/
flume source可以配多个web节点,flume会有单点故障问题
可以向不同目的地发送数据(配置kafka)
一般与kafka组合使用
使用rpc从java中取数据
单节点配置
tar -zxvf apache-flume-1.6.0-bin.tar.gz
cp flume-env.sh.template flume-env.sh
vi flume-env.sh
---------
export JAVA_HOME=/opt/java
--------
vi /etc/profile
flume-ng version
# 创建flume配置文件
vi /root/flumedir/option
--------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = node-02
a1.sources.r1.port = 44444
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
--------
# node-03测试flume
yum install -y telnet
telnet node-02 44444
配置双节点
# 修改node-02 option
--------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = netcat
a1.sources.r1.bind = node-02
a1.sources.r1.port = 44444
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-03
a1.sinks.k1.port = 10086
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
--------
# 修改node-03 option
--------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = avro
a1.sources.r1.bind = node-03
a1.sources.r1.port = 10086
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
--------
监控文件配置
# vi option3
------------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/dfun.log
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = node-03
a1.sinks.k1.port = 10086
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
-----------
touch /root/dfun.log
# dfun.log复制多行
:.,$y
# 粘贴
p
从目录读取
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = spooldir
a1.sources.r1.channels = ch-1
a1.sources.r1.spoolDir = /root/abc
a1.sources.r1.fileHeader = false
a1.sinks.k1.type = logger
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
flume-ng agent --conf-file option --name a1 -Dflume.root.logger=INFO,console
上传到hdfs
-----------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/dfun.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
------------
# 追加文件内容,查看hdfs创建情况
echo "hello dfun hello flume" >> dfun.log
6 ETL
nginx log写入hdfs
# flume配置
--------------
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/data/access.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://dfun/log/%Y%m%d
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10240
a1.sinks.k1.hdfs.idleTimeout = 5
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
-----------
# 浏览器访问页面,日志写入hdfs
ip免费地址库:
http://ip.taobao.com/
系统不要依赖于第三方网站服务,可以把数据库down下来
网友评论