网络数据统计分析笔记|| 网络流数据分析

作者: 周运来就是我 | 来源:发表于2020-10-02 11:21 被阅读0次

网络数据统计分析笔记|| 网络流数据分析
网络数据统计分析笔记|| 操作网络数据
网络数据统计分析笔记|| 网络数据可视化
通宵整理的140种Python标准库、第三方库和外部工具都有！
140种Python标准库、第三方库和外部工具
通宵整理的140种Python标准库、第三方库和外部工具都有！
网络数据统计分析笔记|| 动态网络
网络数据统计分析笔记|| 网络数据的描述性分析
读《吴军·阅读与写作课50讲》|| 暨简书九十万字纪实
【网络管理】网络监视与数据存储

前情回顾：

Gephi网络图极简教程
 Network在单细胞转录组数据分析中的应用
 网络数据统计分析笔记|| 为什么研究网络
 网络数据统计分析笔记|| 操作网络数据
 网络数据统计分析笔记|| 网络数据可视化
 网络数据统计分析笔记|| 网络数据的描述性分析
 网络数据统计分析笔记||网络图的数学模型
 网络数据统计分析笔记|| 网络图的统计模型
 网络数据统计分析笔记|| 网络拓扑结构推断

网络流(network-flows)是一种类比水流的解决问题方法，与线性规划密切相关。网络流的理论和应用在不断发展，出现了具有增益的流、多终端流、多商品流以及网络流的分解与合成等新课题。网络流的应用已遍及通讯、运输、电力、工程规划、任务分派、设备更新以及计算机辅助设计等众多领域。

网络流(Network Flow)的相关定义：

源点：有n个点，有m条有向边，有一个点很特殊，只出不进，叫做源点。
汇点：另一个点也很特殊，只进不出，叫做汇点。
容量和流量：每条有向边上有两个量，容量和流量，从i到j的容量通常用c[i,j]表示,流量则通常是f[i,j].

通常可以把这些边想象成道路，流量就是这条道路的车流量，容量就是道路可承受的最大的车流量。很显然的，流量<=容量。而对于每个不是源点和汇点的点来说，可以类比的想象成没有存储功能的货物的中转站，所有“进入”他们的流量和等于所有从他本身“出去”的流量。
最大流：把源点比作工厂的话，问题就是求从工厂最大可以发出多少货物，是不至于超过道路的容量限制，也就是，最大流。先认识一下S(source)和T(sink)的概念。S就是常说的源点,T就是汇点(也就是起点和重点,这个跟最短路的概念是一样的)。我们有一张图,要求从源点流向汇点的最大流量(可以有很多条路到达汇点),就是我们的最大流问题(max flow)。

library(sand)
data(calldata)
names(calldata)
# ---
## [1] "Orig"    "Dest"    "DistEuc" "DistRd"  "O.GRP"
## [6] "D.GRP"   "Flow"
# ---

head(calldata)
      Orig             Dest DistEuc DistRd   O.GRP     D.GRP     Flow
1 Bruck/Le             Wien   36.14     37 7828880 285193984 57.61270
2 Bruck/Le       Mistelbach   61.46     81 7828880   7685130  1.39963
3 Bruck/Le      Wr.Neustadt   46.50     56 7828880  27369776 19.90791
4 Bruck/Le St.P<U+00BA>lten   88.60     96 7828880  24601360  3.90906
5 Bruck/Le           Zwettl  135.13    145 7828880   8143690  1.38712
6 Bruck/Le       Hollabrunn   78.75     85 7828880   6385840  1.49779

min.call <- min(calldata$Flow)
calldata$FlowCnt <- round(5 * calldata$Flow / min.call)

W <- xtabs(FlowCnt ~ Orig + Dest, calldata)
head(W)

            Dest
Orig         Amstetten BadIschl Bischofsh. Bruck/Le Bruck/Mur Feldkirch Freistadt   Graz Hartberg
  Amstetten          0     4222       6731      640      7235      6342      7160  10389     3950
  BadIschl        2514        0       3695     1184      2981      2192       990  13802     1194
  Bischofsh.      3113     3812          0     1095      5807      4926       649  17626     2259
  Bruck/Le        2812     1137       1902        0      2715      3865       261   8717     7369
  Bruck/Mur       7485     5337       7184      842         0     14447       529  38500     7360
  Feldkirch       3304      850        774     7593     71570         0       235 127013    31020
            Dest
Orig         Hollabrunn Innsbruck Judenburg Kirch/Krems Klagenfurt Landeck Leibnitz Lienz/Os Liezen
  Amstetten         522      4346      2610        4186       6767     715     1883     1452   9408
  BadIschl          620      3363      2232        3083       2640     512     1095      715   5157
  Bischofsh.       1195      6711      6011        1165      11946    1172     2378     1826  11635
  Bruck/Le         2925      2246      1486         836       4012     773     1339     1795   1006
  Bruck/Mur         494      8665      9391        2129      23218    1875     7887     1841     60
  Feldkirch        1217      7470     10397        1073       6040    1570    11627     3904  19740
            Dest
Orig           Linz Mistelbach Reutte Ried/Inn Salzburg Sp/Drau        St.P�lten        V�cklabruck
  Amstetten   92097       4455   1054     8286     6356    2365             2761              10190
  BadIschl    28618        338    196     6619    16401     653             3132              42022
  Bischofsh.  14742        844    556     7667    43725    6740             4770               7043
  Bruck/Le     7268       2734    413     1890     2363     750             7635               2559
  Bruck/Mur   18372        458   1639     2731     6052    3964             5402               4311
  Feldkirch   15229       6164   1423     4688     1110     806             9586               4748
            Dest
Orig           Wien Wolfsberg        W�rgl Wr.Neustadt ZellamSee Zwettl
  Amstetten   24920       692         2937        6439      3218   2393
  BadIschl    43925       264         1681        2968      1771    666
  Bischofsh.  12277      1915         5585        4153       811   1647
  Bruck/Le   112525       453         3502       38883       795   2709
  Bruck/Mur    9941      4789         6385        6326      3855    992
  Feldkirch  169658       348         3360       10088      6953   1498


g.cd <- graph_from_adjacency_matrix(W, weighted=TRUE)

in.flow <- strength(g.cd, mode="in")
out.flow <- strength(g.cd, mode="out")
vsize <- sqrt(in.flow + out.flow) / 100
pie.vals <- lapply((1:vcount(g.cd)),
   function(i) c(in.flow[i], out.flow[i]))
ewidth <- E(g.cd)$weight / 10^5
set.seed(42)
plot(g.cd, vertex.size=vsize, vertex.shape="pie",
   vertex.pie=pie.vals, edge.width=ewidth,
   edge.arrow.size=0.1)

calldata$lFlowCnt <- log(calldata$FlowCnt, 10)
calldata$lO.GRP <- log(calldata$O.GRP, 10)
calldata$lD.GRP <- log(calldata$D.GRP, 10)
calldata$lDistRd <- log(calldata$DistRd, 10)

library(car)
scatterplotMatrix( ~ lFlowCnt + lO.GRP + lD.GRP + 
   lDistRd, data=calldata, regLine=list(col="red"),
   smooth=list(spread=FALSE,col.smooth="goldenrod"),
   col="powderblue")

以上可以看成是特征选择

formula.s <- FlowCnt ~ lO.GRP + lD.GRP + lDistRd


formula.g <- FlowCnt ~ Orig + Dest + lDistRd


gm.s <- glm(formula.s, family="poisson", data=calldata)
gm.g <- glm(formula.g, family="poisson", data=calldata)


summary(gm.s)

Call:
glm(formula = formula.s, family = "poisson", data = calldata)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-475.06   -54.16   -29.20    -2.09  1149.93  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.149e+01  5.394e-03   -2131   <2e-16 ***
lO.GRP       1.885e+00  4.306e-04    4376   <2e-16 ***
lD.GRP       1.670e+00  4.401e-04    3794   <2e-16 ***
lDistRd     -2.191e+00  7.909e-04   -2770   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 45490237  on 991  degrees of freedom
Residual deviance: 10260808  on 988  degrees of freedom
AIC: 10270760

Number of Fisher Scoring iterations: 5

gm.g$aic
# ---
## [1] 5466814
# ---
gm.s$aic
# ---
## [1] 10270760
# ---

# CHUNK 11
plot(calldata$lFlowCnt,log(gm.g$fitted.values,10),
   cex.lab=1.5,
   xlab=expression(Log[10](paste("Flow Volume"))),
   col="green", cex.axis=1.5, ylab="", ylim=c(2, 5.75))
mtext(expression(Log[10](paste("Fitted Value"))), 2,
   outer=T, cex=1.5, padj=1)
clip(0.5,5.75,2,5.75)
abline(0, 1, lwd=2, col="darkgoldenrod1")

res <- residuals.glm(gm.g, type="response")
relres <- res/calldata$FlowCnt
lrelres <- log(abs(relres),10)
res.sgn <- (relres>=0)

plot(calldata$lFlowCnt[res.sgn], lrelres[res.sgn],
   xlim=c(0.5,5.75), ylim=c(-3.5,3.5),
   xlab=expression(Log[10](paste("Flow Volume"))),
   cex.lab=1.5, cex.axis=1.5, ylab="", col="lightgreen")
mtext(expression(Log[10](paste("Relative Error"))), 2,
   outer=T, cex=1.5, padj=1)
par(new=T)
plot(calldata$lFlowCnt[!res.sgn], lrelres[!res.sgn],
   xlim=c(0.5,5.75), ylim=c(-3.5, 3.5),
   xlab=expression(Log[10](paste("Flow Volume"))),
   cex.lab=1.5, cex.axis=1.5, ylab="", col="darkgreen")
mtext(expression(Log[10](paste("Relative Error"))), 2,
   outer=T, cex=1.5, padj=1)
clip(0.5,5.75,-3.5,3.5)
abline(h=0, lwd=2, col="darkgoldenrod2")

网络流的预测：流量矩阵估计

有了网络链路上的流量测量，再加上力量在起讫点间链路上分布方式的知识，对于我们准确预测起讫点间的流量就足够了。这被称为流量矩阵预测（traffic matrix estimation ）。

流量矩阵是一个二维矩阵与其邻接矩阵元素,tij决定的流量采购从节点我和退出节点j。tij价值也被称为交通需求和每个需求代表的数量每一对网络节点之间的数据传输。在由4个节点组成的网络中，每个节点是一个流量源或一个流量汇聚，流量矩阵包含12个需求(图5a)。当节点1、2代表流量源，节点3、4代表流量目的地时，只有4个需求(图5b)。

library(networkTomography)
data(bell.labs)


g.bl <- graph_from_literal(fddi:switch:local:corp 
                            ++ Router)
plot(g.bl)

B <- bell.labs$A
Z <- bell.labs$X
x <- bell.labs$Y

# CHUNK 16
library(lattice)
traffic.in <- c("dst fddi","dst switch",
   "dst local","dst corp")
traffic.out <- c("src fddi","src switch",
   "src local","src corp")
my.df <- bell.labs$df
my.df$t <- unlist(lapply(my.df$time, function(x) {
     hrs <- as.numeric(substring(x, 11, 12))
     mins <- as.numeric(substring(x, 14, 15))
     t <- hrs + mins/60
     return(t)}))

# Separate according to whether data
# are incoming or outgoing.
my.df.in <- subset(my.df, nme %in% traffic.in)
my.df.out <- subset(my.df, nme %in% traffic.out)


# Set up trellis plots for each case.
p.in <- xyplot(value / 2^10 ~ t | nme, data=my.df.in,
   type="l", col.line="goldenrod",
   lwd=2, layout=c(1,4),
   xlab="Hour of Day", ylab="Kbytes/sec")
p.out <- xyplot(value / 2^10 ~ t | nme, data=my.df.out,
   type="l", col.line="red",
   lwd=2, layout=c(1,4),
   xlab="Hour of Day", ylab="Kbytes/sec")
 
# Generate trellis plots.
print(p.in, position=c(0,0.5,1,1), more=TRUE)
print(p.out, position=c(0,0,1,0.5))

B.full <- rbind(B, 2 - colSums(B))
write.table(format(B.full),
   row.names=F, col.names=F, quote=F)
# ---
## 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
## 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
## 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
## 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
## 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0
## 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
## 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
# ---

层析引力方法（tomogravity algorithm ）

计算机通信网络的建模和分析产生了各种有趣的统计问题。本文讨论了网络的tomog- raphy问题，tomog- raphy用于描述两类大容量逆问题。第一个处理被动层析成像，即在单个路由器/节点级别收集聚集数据，其目标是恢复p级信息。这里的主要问题是估计出发地-目的地转换矩阵。第二种方法是主动层析成像，它利用重构链路级信息fr的方法进行数据分析

x.full <- Z %*% t(B.full)
tomo.fit <- tomogravity(x.full, B.full, 0.01)
zhat <- tomo.fit$Xhat

# CHUNK 19
nt <- nrow(Z); nf <- ncol(Z)
t.dat <- data.frame(z = as.vector(c(Z) / 2^10),
   zhat = as.vector(c(zhat) / 2^10),
   t <- c(rep(as.vector(bell.labs$tvec), nf)))

od.names <- c(rep("fddi->fddi", nt),
   rep("fddi->local", nt),
   rep("fddi->switch", nt), rep("fddi->corp",nt),
   rep("local->fddi", nt), rep("local->local",nt),
   rep("local->switch", nt), rep("local->corp",nt),
   rep("switch->fddi", nt), rep("switch->local",nt),
   rep("switch->switch", nt), rep("switch->corp",nt),
   rep("corp->fddi", nt), rep("corp->local",nt),
   rep("corp->switch", nt), rep("corp->corp",nt))

t.dat <- transform(t.dat, OD = od.names)

xyplot(z~t | OD, data=t.dat,
   panel=function(x, y, subscripts){
     panel.xyplot(x, y, type="l", col.line="blue")
     panel.xyplot(t.dat$t[subscripts], 
                  t.dat$zhat[subscripts], 
                  type="l", col.line="green")
   }, as.table=T, subscripts=T, xlim=c(0,24),
   xlab="Hour of Day", ylab="Kbytes/sec",
   layout=c(4,4))

https://blog.csdn.net/BigFatSheep/article/details/78771897
https://blog.csdn.net/dengtiaolu0407/article/details/102199361?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromMachineLearnPai2-1.channel_param
https://www.cnblogs.com/FibonacciHeap/articles/9691400.html
基于引力模型的区域物流需求预测研究
https://sites.cs.ucsb.edu/~suri/cs231/Flows.pdf
http://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/y2h.pdf
https://www.noction.com/knowledge-base/network-capacity-planning
Traffic matrix estimation: A neural network approach with extended input and expectation maximization iteration

https://rdrr.io/cran/networkTomography/
https://arxiv.org/pdf/0708.0945.pdf

网络数据统计分析笔记|| 网络流数据分析
前情回顾： Gephi网络图极简教程[https://www.jianshu.com/p/86145943695a...
网络数据统计分析笔记|| 操作网络数据
前情回顾：Gephi网络图极简教程[https://www.jianshu.com/p/86145943695a]...
网络数据统计分析笔记|| 网络数据可视化
前情回顾：Gephi网络图极简教程[https://www.jianshu.com/p/86145943695a]...
通宵整理的140种Python标准库、第三方库和外部工具都有！
这些库可用于文件读写、网络抓取和解析、数据连接、数清洗转换、数据计算和统计分析、图像和视频处理、音频处理、数据挖掘...
140种Python标准库、第三方库和外部工具
这些库可用于文件读写、网络抓取和解析、数据连接、数清洗转换、数据计算和统计分析、图像和视频处理、音频处理、数据挖掘...
通宵整理的140种Python标准库、第三方库和外部工具都有！
这些库可用于文件读写、网络抓取和解析、数据连接、数清洗转换、数据计算和统计分析、图像和视频处理、音频处理、数据挖掘...
网络数据统计分析笔记|| 动态网络
前情回顾： Gephi网络图极简教程[https://www.jianshu.com/p/86145943695a...
网络数据统计分析笔记|| 网络数据的描述性分析
前情回顾：Gephi网络图极简教程[https://www.jianshu.com/p/86145943695a]...
读《吴军·阅读与写作课50讲》|| 暨简书九十万字纪实
公元2020年国庆，在写完《网络数据的统计分析》笔记之后，我在简书的官方统计写作量达到九十万字。 2017年加入简...
【网络管理】网络监视与数据存储
网络监视和管理工具用于采集数据流并提供分析能力的工具叫做网络监视器；监视网络通信的目的是对数据流进行分析，发现数...