应用场景

通过验证码处理进行电子商务价格监控

电子商务网站使用验证码保护产品页面,以防止自动价格抓取。 CaptchaAI 可让您构建可靠的价格监控系统,自动应对这些挑战。

问题

价格监控机器人在主要平台上面临验证码:

平台 验证码类型 扳机
亚马逊 图像验证码、reCAPTCHA 高请求量
沃尔玛 Cloudflare Turnstile 机器人检测
易趣 reCAPTCHA v2 可疑的图案
百思买 Cloudflare 验证流程 所有自动交通
Shopify 商店 reCAPTCHA v3 因商店配置而异

如果没有验证码处理,您的监控管道会默默地失败,从而在数据中留下定价差距。

建筑学

Scheduler (every 30 min)
    → URL Queue
        → Scraper Workers (5-10 concurrent)
            → Fetch page
            → CAPTCHA detected?
                → Yes → CaptchaAI → Solve → Retry page
                → No → Parse prices
            → Store in database
        → Alert on price changes

执行

价格监控器(Python)

import requests
import time
import re
import json
import os
from datetime import datetime

API_KEY = os.environ["CAPTCHAAI_API_KEY"]
BASE_URL = "https://ocr.captchaai.com"


def solve_captcha(method, params):
    params["key"] = API_KEY
    params["method"] = method

    resp = requests.get(f"{BASE_URL}/in.php", params=params)
    if not resp.text.startswith("OK|"):
        raise Exception(f"Submit failed: {resp.text}")

    task_id = resp.text.split("|")[1]

    for _ in range(60):
        time.sleep(5)
        result = requests.get(f"{BASE_URL}/res.php", params={
            "key": API_KEY, "action": "get", "id": task_id,
        })
        if result.text == "CAPCHA_NOT_READY":
            continue
        if result.text.startswith("OK|"):
            return result.text.split("|", 1)[1]
        raise Exception(f"Solve failed: {result.text}")

    raise TimeoutError("CAPTCHA solve timed out")


def fetch_with_captcha(url, session):
    """Fetch a page, solving CAPTCHAs if encountered."""
    resp = session.get(url)

    # Check for reCAPTCHA
    match = re.search(r'data-sitekey=["\']([A-Za-z0-9_-]+)["\']', resp.text)
    if match:
        site_key = match.group(1)
        token = solve_captcha("userrecaptcha", {
            "googlekey": site_key,
            "pageurl": url,
        })
        resp = session.post(url, data={"g-recaptcha-response": token})

    # Check for Turnstile
    match = re.search(
        r'class="cf-turnstile"[^>]*data-sitekey=["\']([^"\']+)', resp.text
    )
    if match:
        site_key = match.group(1)
        token = solve_captcha("turnstile", {
            "sitekey": site_key,
            "pageurl": url,
        })
        resp = session.post(url, data={"cf-turnstile-response": token})

    return resp


def extract_price(html, selectors):
    """Extract price from HTML using regex patterns."""
    for pattern in selectors:
        match = re.search(pattern, html)
        if match:
            price_str = match.group(1).replace(",", "")
            return float(price_str)
    return None


def monitor_prices(products):
    """Monitor prices for a list of products."""
    session = requests.Session()
    session.headers["User-Agent"] = (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 Chrome/120.0.0.0"
    )

    results = []
    for product in products:
        try:
            resp = fetch_with_captcha(product["url"], session)
            price = extract_price(resp.text, product["selectors"])

            results.append({
                "name": product["name"],
                "url": product["url"],
                "price": price,
                "timestamp": datetime.utcnow().isoformat(),
                "status": "ok",
            })
            print(f"  {product['name']}: ${price}")

        except Exception as e:
            results.append({
                "name": product["name"],
                "url": product["url"],
                "price": None,
                "timestamp": datetime.utcnow().isoformat(),
                "status": f"error: {e}",
            })
            print(f"  {product['name']}: ERROR - {e}")

    return results


# Define products to monitor
products = [
    {
        "name": "Wireless Headphones",
        "url": "https://example.com/product/headphones",
        "selectors": [
            r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
            r'itemprop="price" content="([\d.]+)"',
        ],
    },
    {
        "name": "Bluetooth Speaker",
        "url": "https://example.com/product/speaker",
        "selectors": [
            r'class="price"[^>]*>\$?([\d,]+\.?\d*)',
        ],
    },
]

print("Starting price check...")
results = monitor_prices(products)

# Save results
with open("prices.json", "w") as f:
    json.dump(results, f, indent=2)

Node.js 实现

const axios = require("axios");
const cheerio = require("cheerio");

const API_KEY = process.env.CAPTCHAAI_API_KEY;

async function solveCaptcha(method, params) {
  params.key = API_KEY;
  params.method = method;

  const submit = await axios.get("https://ocr.captchaai.com/in.php", {
    params,
  });
  const taskId = String(submit.data).split("|")[1];

  for (let i = 0; i < 60; i++) {
    await new Promise((r) => setTimeout(r, 5000));
    const poll = await axios.get("https://ocr.captchaai.com/res.php", {
      params: { key: API_KEY, action: "get", id: taskId },
    });
    const text = String(poll.data);
    if (text === "CAPCHA_NOT_READY") continue;
    if (text.startsWith("OK|")) return text.split("|").slice(1).join("|");
    throw new Error(text);
  }
  throw new Error("Timeout");
}

async function monitorPrice(url) {
  const resp = await axios.get(url);
  const $ = cheerio.load(resp.data);

  // Check for reCAPTCHA
  const siteKey = $(".g-recaptcha").attr("data-sitekey");
  if (siteKey) {
    const token = await solveCaptcha("userrecaptcha", {
      googlekey: siteKey,
      pageurl: url,
    });
    // Re-fetch with token
    const formResp = await axios.post(url, { "g-recaptcha-response": token });
    return cheerio.load(formResp.data);
  }

  const price = $('[itemprop="price"]').attr("content") || $(".price").text();
  return parseFloat(price.replace(/[^0-9.]/g, ""));
}

调度

使用 cron 每 30 分钟运行一次检查:

# crontab -e
*/30 * * * * cd /opt/monitor && python price_monitor.py >> /var/log/prices.log 2>&1

或者使用Python的schedule库:

import schedule

schedule.every(30).minutes.do(lambda: monitor_prices(products))

while True:
    schedule.run_pending()
    time.sleep(60)

成本估算

体积 验证码/Day 预计。每日费用
50 个产品,每 30 分钟一次 〜2,400 〜$2-5
200 个产品,每 15 分钟一次 〜19,200 〜$15-30
每小时 1000 个产品 〜24,000 〜$20-40

并非所有页面加载都会触发验证码。实际成本可能会低 50-70%。

常问问题

如何检测价格变化?

将当前价格与存储值进行比较。对 >5% 的变化发出警报有助于过滤微小波动中的噪音。

即使使用验证码解决,我也会被阻止吗?

QA 测试会话和用户代理以最大程度地减少阻塞。随着时间的推移进行空间请求,而不是突发获取。

我可以监控多种货币的价格吗?

是的。解析价格旁边的货币符号。 CaptchaAI 在全球范围内工作,无论目标站点的区域设置如何。

相关指南

该文章已禁用评论。