【爬虫】parse.urljoin（）把相对地址转化为绝对地址

作者: 上弦同学 | 来源:发表于2018-11-27 17:51 被阅读0次

【爬虫】parse.urljoin（）把相对地址转化为绝对地址
E战到底——认识函数(进阶用法)
golang 解析html时把url转绝对路径
认识函数二
164.相对地址与绝对地址
前端02day
7/21认识函数进阶用法（基础知识、绝对引用）
koa框架-实战（三）
第四章存储器管理
认识函数(函数知识、绝对引用)

from urllib import parse
url=parse.urljoin(response.url, post_url)

作用 : 把一个基地址和相对地址智能连接成一个绝对地址

这是一个很神奇而智能的函数，大概感受一下

urljoin("http://www.google.com/1/aaa.html","bbbb.html")
'http://www.google.com/1/bbbb.html'
urljoin("http://www.google.com/1/aaa.html","2/bbbb.html")
'http://www.google.com/1/2/bbbb.html'
urljoin("http://www.google.com/1/aaa.html","/2/bbbb.html")
'http://www.google.com/2/bbbb.html'
urljoin("http://www.google.com/1/aaa.html","http://www.google.com/3/ccc.html")
'http://www.google.com/3/ccc.html'
urljoin("http://www.google.com/1/aaa.html","http://www.google.com/ccc.html")
'http://www.google.com/ccc.html'
urljoin("http://www.google.com/1/aaa.html","javascript:void(0)")
'javascript:void(0)'

在一个爬取伯乐在线文章的项目中用到以下代码：

print(response.url)  # http://blog.jobbole.com/all-posts/
print(post_url)  # http://blog.jobbole.com/114523/
print(parse.urljoin(response.url, post_url) )  # http://blog.jobbole.com/114523/
yield Request(url=parse.urljoin(response.url, post_url), meta={"cover_img": img_url},callback=self.parse_detail)

这里post_url本身就是有效的绝对地址，因而和基地址response.url使用urljoin后还是原地址。

但是为了避免取出的文章url是相对地址的情况，还是使用
url=parse.urljoin(response.url, post_url)的形式。