美文网首页
Week 1_Practice 1.4_Crawl Images

Week 1_Practice 1.4_Crawl Images

作者: Li_Tang | 来源:发表于2016-12-26 14:35 被阅读0次

This time I am learning to write a crawling program to crawl the async loading page and download all the images from the website into the local PC end. By this way, the contents of the page will be loaded dynamically, which means the new content will the loaded every time you scrolled and touch the bottom of the page. 

Here is the website I am crawling: 

Image_Web

Here is the code:

Code

From this session, I have learnt some skills as below: 

1) 

A whole picture of the coding structure is very critical to write good codes. The whole picture of the main program could be dividend into three parts: 

- import all the modules either from the third party or your own modules

- compose main codes including all the functions to form a working flow, and then form a main() function

- " if __name__ == "__main()__":   main() to start the whole program 

2) 

A proxy or even user agent should be inserted into the crawling program to ensure the crawling process:

"r = requests.get(url, proxies=proxies, headers=headers)"

"headers" come from the html file, and proxies come in the form of "proxies = {"http": "127.0.0.1:8888}". For my program, I am using a public university VPN so I realized I don't need to use any proxy and user agent.

However, I ran into a trouble that the crawling process was terminated after some time. I need to solve this problem in the future. 

3) 

I learnt how to download an image into the local PC. The critical code should be:

"""

for page in range(1,10):

    url = "base_url{}".format(page)

    if r.status_code != 200:

        continue

    soup = BeautifulSoap(r.text, 'html.parser')

    imgs = soup.select('css selector')

    for img in imgs:

        src = soup.select('css selector)

        download(src)

"""

"""

def download(url):

    r = requests.get(url, proxies=proxies, headers=headers)

    if r.status_code != 200:

        return

    filename = url.split('?')[0].split('/')[-2]

    target = "./{}.jpg".format(filename)

    with open(target, 'wb') as fs:

        fs.write(r.content)

    print("%s => %s" %(url,target))

"""

相关文章

网友评论

      本文标题:Week 1_Practice 1.4_Crawl Images

      本文链接:https://www.haomeiwen.com/subject/woodvttx.html