电子商务网站使用验证码保护产品页面,以防止自动价格抓取。 CaptchaAI 可让您构建可靠的价格监控系统,自动应对这些挑战。
问题
价格监控机器人在主要平台上面临验证码:
| 平台 | 验证码类型 | 扳机 |
|---|---|---|
| 亚马逊 | 图像验证码、reCAPTCHA | 高请求量 |
| 沃尔玛 | Cloudflare Turnstile | 机器人检测 |
| 易趣 | reCAPTCHA v2 | 可疑的图案 |
| 百思买 | Cloudflare 验证流程 | 所有自动交通 |
| Shopify 商店 | reCAPTCHA v3 | 因商店配置而异 |
如果没有验证码处理,您的监控管道会默默地失败,从而在数据中留下定价差距。
建筑学
Scheduler (every 30 min)
→ URL Queue
→ Scraper Workers (5-10 concurrent)
→ Fetch page
→ CAPTCHA detected?
→ Yes → CaptchaAI → Solve → Retry page
→ No → Parse prices
→ Store in database
→ Alert on price changes
执行
价格监控器(Python)
import requests
import time
import re
import json
import os
from datetime import datetime
API_KEY = os.environ["CAPTCHAAI_API_KEY"]
BASE_URL = "https://ocr.captchaai.com"
def solve_captcha(method, params):
params["key"] = API_KEY
params["method"] = method
resp = requests.get(f"{BASE_URL}/in.php", params=params)
if not resp.text.startswith("OK|"):
raise Exception(f"Submit failed: {resp.text}")
task_id = resp.text.split("|")[1]
for _ in range(60):
time.sleep(5)
result = requests.get(f"{BASE_URL}/res.php", params={
"key": API_KEY, "action": "get", "id": task_id,
})
if result.text == "CAPCHA_NOT_READY":
continue
if result.text.startswith("OK|"):
return result.text.split("|", 1)[1]
raise Exception(f"Solve failed: {result.text}")
raise TimeoutError("CAPTCHA solve timed out")
def fetch_with_captcha(url, session):
"""Fetch a page, solving CAPTCHAs if encountered."""
resp = session.get(url)
# Check for reCAPTCHA
match = re.search(r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', resp.text)
if match:
site_key = match.group(1)
token = solve_captcha("userrecaptcha", {
"googlekey": site_key,
"pageurl": url,
})
resp = session.post(url, data={"g-recaptcha-response": token})
# Check for Turnstile
match = re.search(
r'class="cf-turnstile"[^>]*data-sitekey=["\']([^"\']+)', resp.text
)
if match:
site_key = match.group(1)
token = solve_captcha("turnstile", {
"sitekey": site_key,
"pageurl": url,
})
resp = session.post(url, data={"cf-turnstile-response": token})
return resp
def extract_price(html, selectors):
"""Extract price from HTML using regex patterns."""
for pattern in selectors:
match = re.search(pattern, html)
if match:
price_str = match.group(1).replace(",", "")
return float(price_str)
return None
def monitor_prices(products):
"""Monitor prices for a list of products."""
session = requests.Session()
session.headers["User-Agent"] = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 Chrome/120.0.0.0"
)
results = []
for product in products:
try:
resp = fetch_with_captcha(product["url"], session)
price = extract_price(resp.text, product["selectors"])
results.append({
"name": product["name"],
"url": product["url"],
"price": price,
"timestamp": datetime.utcnow().isoformat(),
"status": "ok",
})
print(f" {product['name']}: ${price}")
except Exception as e:
results.append({
"name": product["name"],
"url": product["url"],
"price": None,
"timestamp": datetime.utcnow().isoformat(),
"status": f"error: {e}",
})
print(f" {product['name']}: ERROR - {e}")
return results
# Define products to monitor
products = [
{
"name": "Wireless Headphones",
"url": "https://example.com/product/headphones",
"selectors": [
r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
r'itemprop="price" content="([\d.]+)"',
],
},
{
"name": "Bluetooth Speaker",
"url": "https://example.com/product/speaker",
"selectors": [
r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
],
},
]
print("Starting price check...")
results = monitor_prices(products)
# Save results
with open("prices.json", "w") as f:
json.dump(results, f, indent=2)
Node.js 实现
const axios = require("axios");
const cheerio = require("cheerio");
const API_KEY = process.env.CAPTCHAAI_API_KEY;
async function solveCaptcha(method, params) {
params.key = API_KEY;
params.method = method;
const submit = await axios.get("https://ocr.captchaai.com/in.php", {
params,
});
const taskId = String(submit.data).split("|")[1];
for (let i = 0; i < 60; i++) {
await new Promise((r) => setTimeout(r, 5000));
const poll = await axios.get("https://ocr.captchaai.com/res.php", {
params: { key: API_KEY, action: "get", id: taskId },
});
const text = String(poll.data);
if (text === "CAPCHA_NOT_READY") continue;
if (text.startsWith("OK|")) return text.split("|").slice(1).join("|");
throw new Error(text);
}
throw new Error("Timeout");
}
async function monitorPrice(url) {
const resp = await axios.get(url);
const $ = cheerio.load(resp.data);
// Check for reCAPTCHA
const siteKey = $(".g-recaptcha").attr("data-sitekey");
if (siteKey) {
const token = await solveCaptcha("userrecaptcha", {
googlekey: siteKey,
pageurl: url,
});
// Re-fetch with token
const formResp = await axios.post(url, { "g-recaptcha-response": token });
return cheerio.load(formResp.data);
}
const price = $('[itemprop="price"]').attr("content") || $(".price").text();
return parseFloat(price.replace(/[^0-9.]/g, ""));
}
调度
使用 cron 每 30 分钟运行一次检查:
# crontab -e
*/30 * * * * cd /opt/monitor && python price_monitor.py >> /var/log/prices.log 2>&1
或者使用Python的schedule库:
import schedule
schedule.every(30).minutes.do(lambda: monitor_prices(products))
while True:
schedule.run_pending()
time.sleep(60)
成本估算
| 体积 | 验证码/Day | 预计。每日费用 |
|---|---|---|
| 50 个产品,每 30 分钟一次 | 〜2,400 | 〜$2-5 |
| 200 个产品,每 15 分钟一次 | 〜19,200 | 〜$15-30 |
| 每小时 1000 个产品 | 〜24,000 | 〜$20-40 |
并非所有页面加载都会触发验证码。实际成本可能会低 50-70%。
常问问题
如何检测价格变化?
将当前价格与存储值进行比较。对 >5% 的变化发出警报有助于过滤微小波动中的噪音。
即使使用验证码解决,我也会被阻止吗?
QA 测试会话和用户代理以最大程度地减少阻塞。随着时间的推移进行空间请求,而不是突发获取。
我可以监控多种货币的价格吗?
是的。解析价格旁边的货币符号。 CaptchaAI 在全球范围内工作,无论目标站点的区域设置如何。