我正在尝试使用 phantomjs 和 python 来获取亚马逊的价格。我想用漂亮的汤来解析它,以获得新书和旧书的价格,问题是:当我传递请求的来源时,我用 phantomjs 做的价格只是 0,00,代码就是这个简单的测试。
我不明白亚马逊是否有措施避免刮价格,或者我做错了,因为我正在尝试其他更简单的页面,我可以获得我想要的数据。
PD 我在一个不支持使用亚马逊 API 的国家/地区,这就是为什么需要爬虫
import re
import urlparse
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
link = 'http://www.amazon.com/gp/offer-listing/1119998956/ref=dp_olp_new?ie=UTF8&condition=new'#'http://www.amazon.com/gp/product/1119998956'
class AmzonScraper(object):
def __init__(self):
self.driver = webdriver.PhantomJS()
self.driver.set_window_size(1120, 550)
def scrape_prices(self):
self.driver.get(link)
s = BeautifulSoup(self.driver.page_source)
return s
def scrape(self):
source = self.scrape_prices()
print source
self.driver.quit()
if __name__ == '__main__':
scraper = TaleoJobScraper()
scraper.scrape()
请您参考如下方法:
首先,按照@Nick Bailey 的评论,研究使用条款 并确保您这边没有违规行为。
要解决它,您需要调整 PhantomJS 所需的功能:
caps = webdriver.DesiredCapabilities.PHANTOMJS
caps["phantomjs.page.settings.userAgent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 (KHTML, like Gecko) Chrome/15.0.87"
self.driver = webdriver.PhantomJS(desired_capabilities=caps)
self.driver.maximize_window()
而且,为了让它防弹,你可以制作一个Custom Expected Condition并等待价格变为非零值:
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class wait_for_price(object):
def __init__(self, locator):
self.locator = locator
def __call__(self, driver):
try :
element_text = EC._find_element(driver, self.locator).text.strip()
return element_text != "0,00"
except StaleElementReferenceException:
return False
用法:
def scrape_prices(self):
self.driver.get(link)
WebDriverWait(self.driver, 200).until(wait_for_price((By.CLASS_NAME, "olpOfferPrice")))
s = BeautifulSoup(self.driver.page_source)
return s
