Python Web Scraping ———07.31.201

作者: 腾腾4ever | 来源:发表于2017-08-01 12:01 被阅读0次

Python Web Scraping ———07.31.201
《web scraping with python》（用Pyth
Python 网络爬虫学习笔记.CH1-2 HTML解析
Python 网络爬虫学习笔记.CH3 采集数据
一些课程或书
Python 3 爬虫学习笔记 7 N-grams，openRe
Python 3 爬虫学习笔记 9 存储链接至mysql并检索任
Python 3 爬虫学习笔记 6 StringIO， Dict
Python 3 爬虫学习笔记 8 马尔科夫模型
Python 3 爬虫学习笔记 5 urlretrieve，存

Three different methods in data-scaring: six.urllib, beautifulsoup, RE-xpath

Just write down what I've learned about web data scraping so that I won't forget everything and start all over next time I need to use the technique.

To work easier with python 2.x, try use lib "six":

from six.moves import urllib

Typical request format would be:

url = ...

hdr = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36', 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'} This depends on your laptop spec

req = urllib.request.Request(url, headers=hdr)

doc = urllib.request.urlopen(req).read() This gives you a file of unicodes

Now, it all comes to the choice among different parsing tools that you like to use, beautifulsoup/regular expression… What I have tried is RE-xpath, RE-pattern matching, and beautifulsoup.

For RE-xpath:

IP_ADDRESS_PATH = '//td[2]/text()'

PORT_ADDRESS_PATH = '//tr/td[3]/text()'

You need to understand the html file and know how to construct the xpath towards the notes you like to extract. So for the above IP_ADDRESS_PATH, it's actually saying that starting from the root, find the text of all the third td.

IP_list = list(set(re.findall(IP_ADDRESS_PATH, doc)))

Then use the re.findall() method to find all the contents of nodes you want. Set() makes elements unique and list() turns it back to the list.

** Not sure why this wasn't working, but pretty sure the xpath was constructed correctly since it's verified by some html tester.

For RE-pattern match:

prep = re.compile(r"""<tr\s.*>….\n....</tr>""", re.VERBOSE)

\s means a space in the xpath, \n means a return in the xpath, .* means it represents whatever (could be anything). This summarizes the pattern of the specific block that might be repeated for many times and is under your interest.

proxy_list = prep.findall(doc)

proxy_list = list(set(proxy_list))

proxy_list now contains all the block of codes that have the same pattern.

For beautifulsoup:

You still need six.moves urllib to open up the url.