美文网首页python入门基础学习
悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能

悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能

作者: 编程新视野 | 来源:发表于2018-12-04 14:23 被阅读5次

    用Python实现的抓取腾讯视频所有电影的爬虫

    悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能看

    当然在学习Python的道路上肯定会困难,没有好的学习资料,怎么去学习呢?

    所以小编准备了一份零基础入门Python的学习资料。添加小编学习群813542856即可获得10套PDF以及python全套学习资料即可领取!

    话不多说,直接上代码:

    悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能看

    <pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def get_pages(tag_url):
    tag_html = gethtml(tag_url)

    div class="paginator

    soup = BeautifulSoup(tag_html) #过滤出标记页面的html

    print soup

    <div class="mod_pagenav" id="pager">

    div_page = soup.find_all('div', {'class' : 'mod_pagenav', 'id' : 'pager'})

    print div_page #len(div_page), div_page[0]

    <a class="c_txt6" href="http://v.qq.com/list/1_2_-1_-1_1_0_24_20_0_-1_0.html" title="25"><span>25</span></a>

    re_pages = r'<a class=.+?><span>(.+?)</span></a>'
    p = re.compile(re_pages, re.DOTALL)
    pages = p.findall(str(div_page[0]))

    print pages

    if len(pages) > 1:
    return pages[-2]
    else:
    return 1

    def getmovielist(html):
    soup = BeautifulSoup(html)

    <ul class="mod_list_pic_130">

    divs = soup.find_all('ul', {'class' : 'mod_list_pic_130'})

    print divs

    for div_html in divs:
    div_html = str(div_html).replace('
    ', '')

    print div_html

    getmovie(div_html)
    </pre>

    悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能看

    <pre style="-webkit-tap-highlight-color: transparent; box-sizing: border-box; font-family: Consolas, Menlo, Courier, monospace; font-size: 16px; white-space: pre-wrap; position: relative; line-height: 1.5; color: rgb(153, 153, 153); margin: 1em 0px; padding: 12px 10px; background: rgb(244, 245, 246); border: 1px solid rgb(232, 232, 232); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def getmovie(html):
    global NUM
    global m_type
    global m_site

    re_movie = r'<li><a class="mod_poster_130" href="(.+?)" target="_blank" title="(.+?)"><img.+?</li>'
    p = re.compile(re_movie, re.DOTALL)
    movies = p.findall(html)
    if movies:
    conn = pymongo.Connection('localhost', 27017)
    movie_db = conn.dianying
    playlinks = movie_db.playlinks

    print movies

    for movie in movies:

    print movie

    NUM += 1
    print "%s : %d" % ("=" * 70, NUM)
    values = dict(
    movie_title = movie[1],
    movie_url = movie[0],
    movie_site = m_site,
    movie_type = m_type
    )
    print values
    playlinks.insert(values)
    print "_" * 70
    NUM += 1
    print "%s : %d" % ("=" * 70, NUM)

    else:

    print "Not Find"

    def getmovieinfo(url):
    html = gethtml(url)
    soup = BeautifulSoup(html)

    pack pack_album album_cover

    divs = soup.find_all('div', {'class' : 'pack pack_album album_cover'})

    print divs[0]

    <a href="http://www.tudou.com/albumplay/9NyofXc_lHI/32JqhiKJykI.html" target="new" title="《血滴子》独家纪录片" wl="1"> </a>

    re_info = r'<a href="(.+?)" target="new" title="(.+?)" wl=".+?"> </a>'
    p_info = re.compile(re_info, re.DOTALL)
    m_info = p_info.findall(str(divs[0]))
    if m_info:
    return m_info
    else:
    print "Not find movie info"

    return m_info

    def insertdb(movieinfo):
    global conn
    movie_db = conn.dianying_at
    movies = movie_db.movies
    movies.insert(movieinfo)

    if name == "main":
    global conn

    tags_url = "http://v.qq.com/list/1_-1_-1_-1_1_0_0_20_0_-1_0.html"

    print tags_url

    tags_html = gethtml(tags_url)

    print tags_html

    tag_urls = gettags(tags_html)

    print tag_urls

    </pre>

    悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能看

    相关文章

      网友评论

        本文标题:悄悄告诉你们哦,用python爬虫抓取某腾所有电影,不开会员就能

        本文链接:https://www.haomeiwen.com/subject/gpbmcqtx.html