Track IP using Hadoop

作者: 98Future | 来源:发表于2017-11-25 14:08 被阅读0次

Track IP using Hadoop
JNCIE-SEC补充学习资料
HBase数据表的导入和导出
大数据与Hadoop 、分布式文件系统、分布式Hadoop
⁠11.5. Configuring Static Routes
七月笔记
[Ubuntu16.04] - SSH连接树莓派
如何加入一个计算节点
外刊阅读：Using apps to track COVID-1
Hadoop分布式集群部署

16年课上做的一个Hadoop Project，目标是跟踪所有用户看了那些网站。这学期在学习Network 安全重新温习一下：

首先理解一下什么是Tor：

Tor是一个免费的软件帮助你不被一些安全部门监控网上的活动，防止被人知道自己看了什么网站，身处什么地点等等。不过我们还是要跟踪Tor用户.

"The Tor network disguises your identity by moving your traffic across different Tor servers, and encrypting that traffic so it isn't traced back to you. Anyone who tries would see traffic coming from random nodes on the Tor network, rather than your computer. (For a more in-depth explanation,check out this postfrom our sister blog, Gizmodo)."

不过呢:

"Tor is handy, but it's far from perfect. Don't think just because you're using Tor that you're perfectly anonymous. Someone like the NSA can tell if you're a Tor user and that makes them more likely to target you. With a enough work, the government can figure out who you are."

大部分数据都是没用的，但是又不敢扔掉因为不知道也许哪天有用。但是安全部门又想随时轻松的知道给定的IP干了什么。

Query Focused Datasets solve this problem. In a parallel structure like Hadoop, access time is dominated by disk latency and network latency. By splitting the data into buckets, this can be efficiently parallelized (with each bucket potentially written by a separate process) and easily communicated in parallel. Then when any particular entry is needed, only that particular bucket needs to be searched for the resulting value.

Within a bucket we don't bother sorting: since most buckets will never be read, and since accessing a bucket is dominated by the time it takes to load a bucket from disk (rather than to search for an entry within a bucket) it is better from a total cost perspective to not bother sorting the data.

Then the TOTALFAIL analysis is simply taking advantage of this in a parallel structure. With the initial QFDs in place, it's easy to do a query for "all Tor users" based on the IPs used by Tor, and then do a query for "each of these users, what is their activity." Now when a No Such Agency analyst wants to find out more about a particular Tor user, they can easily discover all their activity.

Each record includes both a timestamp and the network 4-tuple (source IP, source port, destination IP, destination port). For requests, it includes any tracking cookie seen in the HTTP request, while web replies include any extracted username. Requests also include the full HTTP headers (which include other features), but for this purpose they are simply treated as an opaque blob of data and excluded from the analysis.

第一个Map-Reduce 识别出 Matching request/reply pairs for each user。找到每个用户的请求和回答

第二个Map-reduce Pair把匹配的pair 拿来建造QueryFocusedDataSet Objects 然后序列化存入HDFS里.

下面代码是根据每个request, 找到跟其TimeStamp 10秒内的reply。【代码里有很多地方写错了。。。Emit 的key, values Key应该是WTRkey based on srcIP, destIP, cookie, value是matching pair<srcIP, destIP>】不是合并3个key是各自创建一份

<srcIP,...>

<destIP,...>

<cookie, >

存入HDFS的格式: <<request, reply>, Null> 然后从HDFS取出，load入reducer

这里hash一下Name，创造一个path 存入path File里。

就是： srcIP-DestIP-Cookie-Hash/存放Matching pair<srcIP, destIP>

TotalFailJob.java 这里根据一个已知IP地址for a Tor exit Node 来查询它的srcIP QFDs 来查找这个IP地址有关的所有的Cookies.

把前一步拿到的Cookies 用来查询QFDs to find 所有的该用户的request/reply pairs.

最后output结果. 逻辑可能有点绕，主要由于Map-Reduce的工作有点复杂。。

简单来说就是先找到<ipA-request1, reply1> ， <ipA-request2, reply2>...

由于每个pair里都有cookie信息。

我们可以用所以ipA的cookie信息找到所有ipA 联系过的destination 也就是找到那些ip_A得到的reply.

由于Tor的工作原理：

用户-->Tor-Node-->网站

网站-->Tor Node-->用户。

所以只要知道Tor Exit Node出来的所有request，就能先知道所有的Tor用户们的IP。然后Query一下这些IP们干过什么事情。

比如用户A通过Tor Node去4399.com Tor Node和用户A最后都会得到4399.com的cookie, 并且用户A和Tor也是一组 matching pair关系。 Tor有点中间人的感觉，所以最后可以把Tor无视，两个持有4399 cookie的IP为真正的matching pair： <用户A, 4399.com>。

TotalFailJob Mapper

用Tor IP来Query user source IP。然后从user Source IP得到这个IP所有的Cookies，然后用cookies来找所有的Request。最后生成<srcIP, destIP> for TorUsers.