美文网首页
python爬取豆瓣电影top250

python爬取豆瓣电影top250

作者: 485b1aca799e | 来源:发表于2017-06-18 10:45 被阅读0次
    • 爬取豆瓣电影top250比猫眼top100复杂了一点点,这里主要使用到的工具是BeautifulSoup网页解析库和正则表达式,个人认为,对于静态网页的爬取,Xpath查询语句和正则表达式是最有力的武器。
    • 另外,对于python中文乱码现象,必要的时候需要考虑encode("UTF-8")编码和decode("GBK")解码
    import requests
    from bs4 import BeautifulSoup
    import re
    import pandas as pd
    import time
    
    film_url="https://movie.douban.com/top250"
    
    url_set=["https://movie.douban.com/top250"]#第一页网站
    url_setx=["https://movie.douban.com/top250"]#用于测试
    
    for i in range(25,250,25):
        url_set.append(film_url+"?start="+str(i)+"&filter=")
    print(url_set)
    
    name=[]#film name
    director=[]
    star=[]#film star
    date=[]# film date
    score=[]#film score
    
    
    for url in url_set:
        html=requests.get(url).content
        x=BeautifulSoup(html)
        
        y=x.find_all(name="img",attrs={"class":"","src":re.compile(".*jpg$")})
        #print(y.string)
        for i in y:
            name.append(i.attrs["alt"])
        
        y1=x.find_all(name="p",attrs={"class":""})
        for i in y1:
            n=re.search(pattern="导演: (.*)主(.*)",string=i.text.encode("UTF-8"))
            if n is not None and n.group(1) is not None:
                director.append(n.group(1))
            else:
                director.append(None)
            if n is not None and n.group(2) is not None:
                #star.append(n.group(2))
                tmp=re.sub(string=n.group(2).encode("UTF-8"),pattern="演: ",repl="")
                star.append(tmp)
            else:
                star.append(None)
            m=re.search(pattern="[0-9]{4}",string=i.text.encode("UTF-8"))
            if m is not None:
                date.append(m.group(0))
            else:
                date.append(None)
                
        y2=x.find_all(name="span",attrs={"class":"rating_num","property":"v:average"})
        for i in y2:
            if i is not None:
                score.append(float(i.string))
            else:
                score.append(None)
        time.sleep(2)
    
    #cbind into a DataFrame
    data={"name":name,"director":director,"star":star,"date":date,"score":score}
    x=pd.DataFrame(data)
    print(x)
    
    
    

    相关文章

      网友评论

          本文标题:python爬取豆瓣电影top250

          本文链接:https://www.haomeiwen.com/subject/cteiqxtx.html