HTML解析

作者: kim_xx | 来源:发表于2018-05-12 11:05 被阅读17次

记一次jsoup的使用
lxml的使用方法
阻塞解析与阻塞渲染
重绘和重排（回流）
虚拟DOM是啥？
浏览器渲染原理
python bs4的坑
android利用jsoup抓取数据
浏览器渲染机制、重绘、重排
defer和async的区别

这是什么？

通常服务器给客户端返回的数据是JSON或是XML，对于JSON数据解析，或许我们已经轻车熟路了，XML可能要麻烦一点，但毕竟XML格式的数据已经存在很久了，业内也有很多解决方案的。但之前自己写的一个项目，数据需要自己去网页抓包来获取，也就是如何通过HTML来解析数据。

神奇的第三方库

之前对于这种方式完全没有听过，这是什么啊？？不过经过多方查找，发现还有xpath这种语法（前端的大神不要打我哦），身为一个iOS的开发人员，感觉发现了新大陆，然后在GitHub上找到mattt大神（著名的AFNetworking作者）写的一个库Ono，简直太棒了！！

如何使用？

新建一个工程，并将Ono使用pod导入进来，注意,需要在 Info.plist 中添加两行 App Transport Security Settings,和 Allow Arbitrary Loads YES, 来允许 HTTP 传输.

Demo中的网址是在搜狐体育抓到的一个HTML文件 http://cbadata.sports.sohu.com/iframes/top_sch_2015.html

在解析之前需要先知道什么是xpath--传送门

ok,进入正文

首先需要将HTML数据转成NSData类型，然后生成ONOXMLDocument对象

NSMutableArray *array = [NSMutableArray array];
    NSData *data = [NSData dataWithContentsOfURL:[NSURL URLWithString:kUrlStr]]; // 下载网页数据
    
    NSError *error = nil;
    ONOXMLDocument *doc = [ONOXMLDocument HTMLDocumentWithData:data error:&error];

解析数据的时候需要知道节点的xpath的路径，这里可以使用Chrom浏览器，打开开发者模式，然后找到需要解析的节点，右键复制xpath路径

抓取xpath

找到我们想要的xpath路径后就可以解析到我们想要的数据了

// 遍历其子节点
    NSMutableArray *timesArray = [NSMutableArray array];
    [postsParentElement.children enumerateObjectsUsingBlock:^(ONOXMLElement * _Nonnull obj, NSUInteger idx, BOOL * _Nonnull stop) {
        NSLog(@"%@", obj.stringValue);
        [timesArray addObject:[[obj.stringValue stringByReplacingOccurrencesOfString:@" " withString:@""] stringByReplacingOccurrencesOfString:@"\n" withString:@""]];
    }];
    
    for (int i = 0; i < timesArray.count; i++) {
        Post *post = [Post new];
        post.time = timesArray[i];
        
        // team1
        //*[@id="List1_1"]/div/table/tr[1]/td[1]
        ONOXMLElement *team1 = [doc firstChildWithXPath:[NSString stringWithFormat:@"//*[@id=\"List1_%d\"]/div/table/tr[1]/td[1]", i+1]];
        post.team1 = team1.stringValue;
        //*[@id="List1_1"]/div/table/tr[1]/td[3]
        ONOXMLElement *team2 = [doc firstChildWithXPath:[NSString stringWithFormat:@"//*[@id=\"List1_%d\"]/div/table/tr[1]/td[3]", i+1]];
        post.team2 = team2.stringValue;
        //*[@id="List1_1"]/div/table/tr[2]/td[1]
        ONOXMLElement *score1 = [doc firstChildWithXPath:[NSString stringWithFormat:@"//*[@id=\"List1_%d\"]/div/table/tr[2]/td[1]", i+1]];
        post.score1 = score1.stringValue;
        //*[@id="List1_1"]/div/table/tr[2]/td[3]
        ONOXMLElement *score2 = [doc firstChildWithXPath:[NSString stringWithFormat:@"//*[@id=\"List1_%d\"]/div/table/tr[2]/td[3]", i+1]];
        post.score2 = score2.stringValue;
        //*[@id="List1_1"]/div/table/tr[3]/td/a[1]
        ONOXMLElement *link = [doc firstChildWithXPath:[NSString stringWithFormat:@"//*[@id=\"List1_%d\"]/div/table/tr[3]/td/a[1]", i+1]];
        post.link = [link valueForAttribute:@"href"];
        
        
        [array addObject:post];
    }

这里可能会遇到一个坑，就是根据复制过来的xpath路径解析不出来数据，这个时候需要去检查xpath路径是否正确，我之前碰到的是//[@id="List1_2"]/div/table/tbody/tr[1]/td[1]/a，但是真正解析的是//[@id="List1_2"]/div/table/tr[2]/td[1]**，这个可能需要自己去测试了吧。

另外对于CSS的解析方式并没有去实现，有兴趣的可以自己去实现。

本文Demo 链接:https://pan.baidu.com/s/11pNP14sZhVTliuP3js_ibA 密码:abeo