python爬虫day-8（urllib库-解析链接01）

作者: 南音木 | 来源:发表于2019-04-11 17:52 被阅读0次

python爬虫day-8（urllib库-解析链接01）
Python爬虫|使用urllib库爬取百度新闻首页的标题信息
tenliu的爬虫-抓包分析
tenliu的爬虫-python的urllib库
tenliu的爬虫-python库urllib、urllib2、
tenliu的爬虫-urllib2学习
tenliu的爬虫-requests学习
Python基本库使用(二)
python爬虫第1章 urllib库（一） urllib库概述
爬虫学习(一)网络请求

个人学习笔记，方便自己查阅，仅供参考，欢迎交流

解析链接

urllib库里提供parse模块,它定义了处理URL 的标准接口,实现 URL 各部
分的抽取、合并以及链接转换。

1. urlparse()

该方法可以实现 URL 的识别和分段，一个标准的URL都会符合这个规则，利用urlparse（）方法可以将它拆分开来。

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result),result)

运行结果：
<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

API 用法

urllib.parse.urlparse(urlsting,scheme='',allow_fragments=True)

1.urlstring：这是必填项，即待解析的 URL。
2.scheme：它是默认的协议（比如 http或https 等）假如这个链接没有带协议信息，会将这个作为默认的协议。

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)

运行结果：
ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user', query='id=5', fragment='comment')

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)

运行结果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

3.allow_fragments：即是否忽略fragment。如果它被设置为False，fragment 部分就会被忽略，它会被解析为path、parameters或者query的一部分，而fragment 部分为空。

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',allow_fragments=False)
print(result)

运行结果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5#comment', fragment='')

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result)

运行结果：
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')

from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html#comment',allow_fragments=False)
print(result.scheme,result[0],result.netloc,result[1],sep='\n')

运行结果：
http
http
www.baidu.com
www.baidu.com

2. urlunparse()

有了urlparse（），就有它的对立方法 urlunparse（）。它接受的参数是一个可迭代对象，但是它的长度必须是6，否则抛出参数数量不足或者过多的问题。

from urllib.parse import urlunparse

data =['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

*运行结果：
http://www.baidu.com/index.html;user?a=6#comment

网友评论

Python爬虫笔记

本文标题：python爬虫day-8（urllib库-解析链接01）

本文链接：https://www.haomeiwen.com/subject/yigtwqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python爬虫day-8（urllib库-解析链接01）

解析链接

1. urlparse()

2. urlunparse()

相关文章