股票数据定向爬虫

作者: NiceBlueChai | 来源:发表于2017-11-22 15:59 被阅读47次

    爬虫功能
    获取上交所和深交所所有股票的名称和交易信息
    保存到文件中

    技术路线
    requests-bs4-re

    候选数据网站
    新浪股票:http://finance.sina.com.cn/stock/
    百度股票:https://gupiao.baidu.com/stock
    选取原则
    股票信息保存在HTML页面中,不是js动态生成的,没有robots协议的限制

    数据网站的确定
    获取股票列表;
    东方财富网:http://quote.eastmoney.com/stocklist.html
    获取个股信息:
    百度股票:https://gupiao.baidu.com/stock/
    单个股票:https://gupiao.baidu.com/stock/sz002439.html

    程序的结构设计
    从东方财富网获取股票列表
    根据股票列表逐个到百度股票获取个股信息
    将获取的信息保存到文件中


    个股信息采用键值对维护

    main()

    import re,request
    from bs4 import BeaytifulSoup
    import traceback
    
    def getHTMLText(url):
        return ""
    
    def getStockList(lst,stockURL):
        return ""
    
    def getStockInfo(lst,stockURL,fpath):
        return ""
    
    def main():
        stock_list_url="http://quote.eastmoney.com/stocklist.html"
        stock_info_url="https://gupiao.baifu.com/stock/"
        output_file="D://BaiduStockInfo.txt"
        slist=[]
        getStockList(slist,stock_list_url)
        getStockInfo(slist,stock_info_url,output_file)
    
    main()
    

    geteHTMLText()

    def getHTMLText(url):
        try:
            r=requests.get(url)
            r.raise_for_status()
            r.encoding=r.apparent_encoding
        except:
            return ""
    

    getStockList()

    def getStockList(lst,stockURL):
        html=getHTMLText(stockURL)
        soup=BeautifulSoup(html,'html.parser')
        a=soup('a')
        for i in a:
            try:
                href=i.attrs['href']
                lst.append(re.findall(r"[s][zh]\d{6}",href)[0])
            except:
                continue       
    

    getStockInfo()

    def getStockInfo(lst, stockURL, fpath):
        for stock in lst:
            url = stockURL + stock + ".html"
            html = getHTMLText(url)
            try:
                if html=="":
                    continue
                infoDict = {}
                soup = BeautifulSoup(html, 'html.parser')
                stockInfo = soup.find('div',attrs={'class':'stock-bets'})
     
                name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
                infoDict.update({'股票名称': name.text.split()[0]})
                 
                keyList = stockInfo.find_all('dt')
                valueList = stockInfo.find_all('dd')
                for i in range(len(keyList)):
                    key = keyList[i].text
                    val = valueList[i].text
                    infoDict[key] = val
                 
                with open(fpath, 'a', encoding='utf-8') as f:
                    f.write( str(infoDict) + '\n' )
            except:
                traceback.print_exc()
                continue
    

    实例优化——提高用户体验

    速度提高:编码识别的优化
    r.apparent_encoding需要分析文本,运行较慢,可辅助人工分析

    def getHTMLText(url, code="utf-8"):
        try:
            r = requests.get(url)
            r.raise_for_status()
            r.encoding = code
            return r.text
        except:
            return ""
     
    def getStockList(lst, stockURL):
        html = getHTMLText(stockURL, "GB2312")
        soup = BeautifulSoup(html, 'html.parser') 
        a = soup.find_all('a')
        for i in a:
            try:
                href = i.attrs['href']
                lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
            except:
                continue
    

    体验提高:增加动态进度显示

    相关文章

      网友评论

        本文标题:股票数据定向爬虫

        本文链接:https://www.haomeiwen.com/subject/pjhnvxtx.html