Week 1_Practice 1.4_Crawl Images

作者: Li_Tang | 来源:发表于2016-12-26 14:35 被阅读0次

Week 1_Practice 1.4_Crawl Images
Week 1_Practice 1.1_A Simple We
Week 1_Practice 1.3_Crawling Hou
Week 1_Practice 1.2_Crawling Ite
MapboxGL.Images
Images
images
images
images
images

This time I am learning to write a crawling program to crawl the async loading page and download all the images from the website into the local PC end. By this way, the contents of the page will be loaded dynamically, which means the new content will the loaded every time you scrolled and touch the bottom of the page.

Here is the website I am crawling:

Image_Web

Here is the code:

Code

From this session, I have learnt some skills as below:

1)

A whole picture of the coding structure is very critical to write good codes. The whole picture of the main program could be dividend into three parts:

- import all the modules either from the third party or your own modules

- compose main codes including all the functions to form a working flow, and then form a main() function

- " if __name__ == "__main()__": main() to start the whole program

2)

A proxy or even user agent should be inserted into the crawling program to ensure the crawling process:

"r = requests.get(url, proxies=proxies, headers=headers)"

"headers" come from the html file, and proxies come in the form of "proxies = {"http": "127.0.0.1:8888}". For my program, I am using a public university VPN so I realized I don't need to use any proxy and user agent.

However, I ran into a trouble that the crawling process was terminated after some time. I need to solve this problem in the future.

3)

I learnt how to download an image into the local PC. The critical code should be:

"""

for page in range(1,10):

url = "base_url{}".format(page)

if r.status_code != 200:

continue

soup = BeautifulSoap(r.text, 'html.parser')

imgs = soup.select('css selector')

for img in imgs:

src = soup.select('css selector)

download(src)

"""

def download(url):

r = requests.get(url, proxies=proxies, headers=headers)

if r.status_code != 200:

return

filename = url.split('?')[0].split('/')[-2]

target = "./{}.jpg".format(filename)

with open(target, 'wb') as fs:

fs.write(r.content)

print("%s => %s" %(url,target))

"""

网友评论

本文标题：Week 1_Practice 1.4_Crawl Images

本文链接：https://www.haomeiwen.com/subject/woodvttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Week 1_Practice 1.4_Crawl Images

Here is the website I am crawling:

Here is the code:

From this session, I have learnt some skills as below:

1)

2)

3)

相关文章