selenium可以模仿浏览器打开搜索操作,pyquery可以用来解析源代码,基本思路如下:
- 首先用Chrome打开https://www.jingdong.com
- 在搜索框中输入美食
- 点击搜索
- 解析源代码,获取本页所需的商品信息
- 然后点击翻页
- 继续解析源代码,获取本页所需的商品的信息
- 制造循环,重复上述操作
利用selenium,执行浏览器搜索的操作
![](https://img.haomeiwen.com/i16825884/dfebce6315a42198.png)
![](https://img.haomeiwen.com/i16825884/17830f67a639af3b.png)
![](https://img.haomeiwen.com/i16825884/4e5080d6dee1b7ad.png)
到到这里为止搜索没问题了
还是利用selenium执行浏览器的翻页操作
![](https://img.haomeiwen.com/i16825884/6521e93bdc0b14e0.png)
利用pyquery获取商品信息
![](https://img.haomeiwen.com/i16825884/b9104142329a9870.png)
![](https://img.haomeiwen.com/i16825884/e7f04ceb0907c1e2.png)
![](https://img.haomeiwen.com/i16825884/2d08ba6cc46c9135.png)
![](https://img.haomeiwen.com/i16825884/bc0c74bb0275e72a.png)
![](https://img.haomeiwen.com/i16825884/f8a4b54bddd007e2.png)
![](https://img.haomeiwen.com/i16825884/ca4ea0cfad314574.png)
下一步就是写个循环,然后存储到MONGODB即可。
下面奉上完整代码:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
from config import *
import pymongo
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
browser = webdriver.Chrome()
wait = WebDriverWait(browser, 10)
def search():
print('正在搜索。。。')
try:
url = 'https://www.jingdong.com'
browser.get(url)
input = wait.until(EC.presence_of_element_located
((By.CSS_SELECTOR, '#key')))
submit = wait.until(EC.element_to_be_clickable
((By.CSS_SELECTOR, '#search > div > div.form > button')))
input.send_keys('美食')
submit.click()
total = wait.until(EC.presence_of_element_located
((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > em:nth-child(1) > b')))
get_products()
return int(total.text)
except TimeoutException:
return search()
def next_page(page_number):
print('正在翻页。。。', page_number)
try:
input = wait.until(EC.presence_of_element_located
((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input')))
submit = wait.until(EC.element_to_be_clickable
((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a')))
input.clear()
input.send_keys(page_number)
submit.click()
wait.until(EC.text_to_be_present_in_element
((By.CSS_SELECTOR, '#J_bottomPage > span.p-num > a.curr'),str(page_number)))
get_products()
except TimeoutException:
next_page(page_number)
def get_products():
wait.until(EC.presence_of_element_located
((By.CSS_SELECTOR, '#J_goodsList .gl-warp .gl-item')))
html = browser.page_source
doc = pq(html, parser='html')
items = doc('#J_goodsList .gl-warp .gl-item').items()
for item in items:
if item.find('.p-img img').attr('src'):
image = item.find('.p-img img').attr('src')
else:
image = item.find('.p-img img').attr('data-lazy-img')
product = {
'image':image,
'price':item.find('.p-price ').text(),
'title':item.find('.p-name').text(),
'commit':item.find('.p-commit').text(),
'shop':item.find('.p-shop').text()
}
# print(item)
print(product)
save_to_mongo(product)
def save_to_mongo(result):
try:
if db[MONGO_TABLE].insert(result):
print('存储到MONGODB成功')
except Exception:
print('存储到MONGODB失败', result)
def main():
total = search()
for i in range(2, total+1):
next_page(i)
browser.close()
if __name__ == '__main__':
main()
![](https://img.haomeiwen.com/i16825884/62c2d53f93d8dc88.png)
其中有个坑就是
![](https://img.haomeiwen.com/i16825884/2fd9da66f486d2f7.png)
pyquery默认是xhtml,容易产生乱码,或者拿不到属性,所以记得加上parser = 'html'
谢谢大家,代码要常写,我一个月不写就忘了,加油,共勉
网友评论