python3的爬虫笔记15——Scrapy选择器

作者: X_xxieRiemann | 来源:发表于2019-04-07 16:04 被阅读0次

用Scrapy shell（提供交互式测试）和Scrapy文档服务器中的示例页面：

[https://docs.scrapy.org/en/latest/_static/selectors-sample1.html]

完整的HTML代码：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

笔者习惯用css选择器，所以这里没有提及xpath。

# 返回结果是Selector，可以继续进行css、xpath、re选择；   ::text用于提取标签文本内容。
In [3]: response.css('title::text')
Out[3]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]

# 获得标签的文本信息，extract()返回的是全部列表
In [4]: response.css('title::text').extract()
Out[4]: ['Example website']

# 获得标签的文本信息，extract()返回的是全部列表
In [5]: response.css('a::text').extract()
Out[5]: 
['Name: My image 1 ',
 'Name: My image 2 ',
 'Name: My image 3 ',
 'Name: My image 4 ',
 'Name: My image 5 ']

# 用::attr(xxxx)获得标签中xxxx的属性内容
In [7]: response.css('a::attr(href)').extract()
Out[7]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']


# extract_first()返回第一个元素
In [8]: response.css('a::attr(href)').extract_first()
Out[8]: 'image1.html'

# 正则表达式，使用正则表达式后返回的是列表，说明不能继续迭代了
In [10]: response.css('a::text').re('Name:(.*)')
Out[10]: 
[' My image 1 ',
 ' My image 2 ',
 ' My image 3 ',
 ' My image 4 ',
 ' My image 5 ']

# 正则表达式，返回列表中的第一个
In [11]: response.css('a::text').re_first('Name:(.*)')
Out[11]: ' My image 1 '

# 查找a标签中href=image2.html的文本内容，这里注意.要用\.进行转义
In [18]: response.css('a[href=image2\.html] ::text').extract()
Out[18]: ['Name: My image 2 ']


# 查找a标签中href中含有image，该a标签下img标签的src属性内容
In [19]: response.css('a[href*=image] img::attr(src)').extract()
Out[19]: 
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

参考：https://doc.scrapy.org/en/latest/topics/selectors.html

网友评论

本文标题：python3的爬虫笔记15——Scrapy选择器

本文链接：https://www.haomeiwen.com/subject/jogriqtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python3的爬虫笔记15——Scrapy选择器

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读