离线实战-网络日志监控分析（六）:ETL的入库操作

作者: 做个合格的大厂程序员 | 来源:发表于2020-07-02 20:30 被阅读0次

离线实战-网络日志监控分析（六）:ETL的入库操作
Spark 2.x项目实战视频教程（实时统计、离线分析和实时ET
离线实战-网络日志监控分析（七）:各种指标的分析
Delta Lake在Soul的应用实践
离线实战-网络日志监控分析（三）:ETL的第一步：清洗工作1
离线实战-网络日志监控分析（四）:ETL的第一步：清洗工作2
离线实战-网络日志监控分析（五）:ETL的第一步：清洗工作3
Spark core完成ETL项目
Spark SQL完成ETL项目
01.Flink实时数据管理—自定义MysqlSource +

1.了解表中字段和解析

事实表设计：

Xnip2020-06-29_13-45-05

维度表设计

Xnip2020-06-29_13-47-43

维度表的数据一般要结合业务情况自己写脚本按照规则生成，也可以使用工具生成，方便后续的关联分析。比如一般会事前生成时间维度表中的数据，跨度从业务需要的日期到当前日期即可.具体根据你的分析粒度,可以生成年，季，月，周，天，时等相关信息，用于分析。

创建 ODS 层数据表

我们在hive中新建表来承接

1.原始日志数据表

drop table if exists ods_weblog_origin;
create table ods_weblog_origin(
valid string,
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
http_referer string,
http_user_agent string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

2.点击流模型pageviews表

drop table if exists ods_click_pageviews;
create table ods_click_pageviews(
session string,
remote_addr string,
remote_user string,
time_local string,
request string,
visit_step string,
page_staylong string,
http_referer string,
http_user_agent string,
body_bytes_sent string,
status string)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

3.点击流 visit 模型表

drop table if exists ods_click_stream_visit;
create table ods_click_stream_visit(
session     string,
remote_addr string,
inTime      string,
outTime     string,
inPage      string,
outPage     string,
referal     string,
pageVisits  int)
partitioned by (datestr string)
row format delimited
fields terminated by '\001';

然后分别导入数据到这些表中。就完成了第一步。

2.明细表、宽表、窄表

概念

事实表的数据中，有些属性共同组成了一个字段（糅合在一起），比如年月日时分秒构成了时间,当需要根据某一属性进行分组统计的时候，需要截取拼接之类的操作，效率极低。

image

为了分析方便，可以事实表中的一个字段切割提取多个属性出来构成新的字段，因为字段变多了，所以称为宽表，原来的成为窄表。

又因为宽表的信息更加清晰明细，所以也可以称之为明细表。

明细表（宽表）实现

建明细表 dw_weblog_detail:

drop table dw_weblog_detail;
create table dw_weblog_detail(
valid           string, --有效标识
remote_addr     string, --来源IP
remote_user     string, --用户标识
time_local      string, --访问完整时间
daystr          string, --访问日期
timestr         string, --访问时间
month           string, --访问月
day             string, --访问日
hour            string, --访问时
request         string, --请求的url
status          string, --响应码
body_bytes_sent string, --传输字节数
http_referer    string, --来源url
ref_host        string, --来源的host
ref_path        string, --来源的路径
ref_query       string, --来源参数query
ref_query_id    string, --来源参数query的值
http_user_agent string --客户终端标识
)
partitioned by(datestr string);

通过查询插入数据到明细宽表 dw_weblog_detail中

insert into table dw_weblog_detail partition(datestr='20181101')
select c.valid,c.remote_addr,c.remote_user,c.time_local,
substring(c.time_local,0,10) as daystr,
substring(c.time_local,12) as tmstr,
substring(c.time_local,6,2) as month,
substring(c.time_local,9,2) as day,
substring(c.time_local,12,2) as hour,
c.request,c.status,c.body_bytes_sent,c.http_referer,c.ref_host,c.ref_path,c.ref_query,c.ref_query_id,c.http_user_agent
from
(SELECT 
a.valid,a.remote_addr,a.remote_user,a.time_local,
a.request,a.status,a.body_bytes_sent,a.http_referer,a.http_user_agent,b.ref_host,b.ref_path,b.ref_query,b.ref_query_id 
FROM ods_weblog_origin a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""), 'HOST', 'PATH','QUERY', 'QUERY:id') b as ref_host, ref_path, ref_query, ref_query_id) c;

这里就涉及到了表的嵌套查询操作。需要注意的是首先分析最里层的查询，再逐级分解。用里层表的字段名称对应外层的字段名称进行对比。

流量分析常见分类

指标是网站分析的基础，用来记录和衡量访问者在网站自的各种行为。比如我们经常说的流量就是一个网站指标，它是用来衡量网站获得的访问量。在进行流量分析之前，我们先来了解一些常见的指标。

基础级指标

PageView 浏览次数（PV）:用户每打开 1 个网站页面，记录 1 个 PV。用户多次打开同一页面 PV 累计多次。通俗解释就是页面被加载的总次数。

Unique Visitor 独立访客（UV）:1 天之内，访问网站的不重复用户数（以浏览器 cookie 为依据），一天内同一访客多次访问网站只被计算 1 次。

访问次数（VV）：访客从进入网站到离开网站的一系列活动记为一次访也称会话(session),1 次访问(会话)可能包含多个 PV。

IP：1 天之内，访问网站的不重复 IP 数。一天内相同 IP 地址多次访问网站只被计算 1 次。曾经 IP 指标可以用来表示用户访问身份，目前则更多的用来获取访问者的地理位置信息。

复合级指标

平均访问频度:平均每个独立访客一天内访问网站的次数（产生的 session 个数）, 平均访问频度=访问次数/独立访客数(vv \ uv)

人均浏览页数（平均访问深度）：平均每个独立访客产生的浏览次数。人均浏览页数=浏览次数/独立访客。(pv / uv)

平均访问时长：平均每次访问（会话）在网站上的停留时间。体现网站对访客的吸引程度。平均访问时长=访问总时长/访问次数。

跳出率:跳出率是指用户到达你的网站上并在你的网站上仅浏览了一个页面就离开的访问次数与所有访问次数的百分比。是评价网站性能的重要指标。

网友评论

本文标题：离线实战-网络日志监控分析（六）:ETL的入库操作

本文链接：https://www.haomeiwen.com/subject/dtldqktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！