您的验证码解决工作器看起来还活着——进程正在运行——但它在 10 分钟内没有成功解决任务。 API 密钥已耗尽,或者工作人员陷入循环。如果没有健康检查,您的编排器就会不断地将工作路由给已死亡的工作人员。运行状况端点让负载均衡器和 Kubernetes 检测问题并重新路由。
三种类型的健康检查
| 查看 | 问题 | 失败响应 |
|---|---|---|
| 活跃度 | 进程是否正在运行? | 重启容器 |
| 准备情况 | 可以接受工作吗? | 停止路由流量 |
| 依赖 | 上游服务还好吗? | 优雅地降级 |
Python:Flask 健康端点
import requests
import time
import threading
from flask import Flask, jsonify
from dataclasses import dataclass, field
API_KEY = "YOUR_API_KEY"
RESULT_URL = "https://ocr.captchaai.com/res.php"
app = Flask(__name__)
@dataclass
class WorkerHealth:
"""Tracks worker health metrics."""
started_at: float = field(default_factory=time.monotonic)
last_solve_at: float = 0.0
total_solved: int = 0
total_failed: int = 0
consecutive_failures: int = 0
balance: float | None = None
balance_checked_at: float = 0.0
_lock: threading.Lock = field(default_factory=threading.Lock)
def record_success(self):
with self._lock:
self.total_solved += 1
self.last_solve_at = time.monotonic()
self.consecutive_failures = 0
def record_failure(self):
with self._lock:
self.total_failed += 1
self.consecutive_failures += 1
@property
def success_rate(self) -> float:
total = self.total_solved + self.total_failed
return self.total_solved / total if total > 0 else 1.0
@property
def seconds_since_last_solve(self) -> float:
if self.last_solve_at == 0:
return time.monotonic() - self.started_at
return time.monotonic() - self.last_solve_at
health = WorkerHealth()
# Thresholds
MAX_CONSECUTIVE_FAILURES = 10
MAX_SECONDS_WITHOUT_SOLVE = 600 # 10 minutes
MIN_BALANCE = 1.0
def check_balance() -> float | None:
"""Check CaptchaAI balance."""
now = time.monotonic()
# Cache balance for 60 seconds
if health.balance is not None and now - health.balance_checked_at < 60:
return health.balance
try:
resp = requests.get(RESULT_URL, params={
"key": API_KEY, "action": "getbalance", "json": 1,
}, timeout=10).json()
health.balance = float(resp.get("request", 0))
health.balance_checked_at = now
return health.balance
except Exception:
return health.balance # Return cached value on error
@app.route("/health/live")
def liveness():
"""Liveness probe — is the process responsive?"""
return jsonify({"status": "ok", "uptime_s": int(time.monotonic() - health.started_at)}), 200
@app.route("/health/ready")
def readiness():
"""Readiness probe — can the worker accept tasks?"""
issues = []
# Check consecutive failures
if health.consecutive_failures >= MAX_CONSECUTIVE_FAILURES:
issues.append(f"consecutive_failures={health.consecutive_failures}")
# Check time since last solve
if health.total_solved > 0 and health.seconds_since_last_solve > MAX_SECONDS_WITHOUT_SOLVE:
issues.append(f"no_solve_for={int(health.seconds_since_last_solve)}s")
# Check balance
balance = check_balance()
if balance is not None and balance < MIN_BALANCE:
issues.append(f"low_balance=${balance:.2f}")
if issues:
return jsonify({
"status": "not_ready",
"issues": issues,
"stats": {
"solved": health.total_solved,
"failed": health.total_failed,
"success_rate": round(health.success_rate, 3),
},
}), 503
return jsonify({
"status": "ready",
"stats": {
"solved": health.total_solved,
"failed": health.total_failed,
"success_rate": round(health.success_rate, 3),
"balance": balance,
},
}), 200
@app.route("/health/dependencies")
def dependencies():
"""Check upstream dependencies."""
checks = {}
# CaptchaAI API reachability
try:
resp = requests.get(RESULT_URL, params={
"key": API_KEY, "action": "getbalance", "json": 1,
}, timeout=10)
checks["captchaai_api"] = {
"status": "ok" if resp.status_code == 200 else "degraded",
"response_ms": int(resp.elapsed.total_seconds() * 1000),
}
except Exception as e:
checks["captchaai_api"] = {"status": "down", "error": str(e)}
all_ok = all(c["status"] == "ok" for c in checks.values())
return jsonify({
"status": "ok" if all_ok else "degraded",
"checks": checks,
}), 200 if all_ok else 503
# --- Worker loop (runs in background) ---
def worker_loop():
"""Simulated CAPTCHA solving worker."""
while True:
try:
# ... solve CAPTCHA logic ...
health.record_success()
except Exception:
health.record_failure()
time.sleep(1)
threading.Thread(target=worker_loop, daemon=True).start()
JavaScript:表达健康端点
const express = require("express");
const API_KEY = "YOUR_API_KEY";
const RESULT_URL = "https://ocr.captchaai.com/res.php";
const app = express();
const health = {
startedAt: Date.now(),
lastSolveAt: 0,
totalSolved: 0,
totalFailed: 0,
consecutiveFailures: 0,
balance: null,
balanceCheckedAt: 0,
recordSuccess() {
this.totalSolved++;
this.lastSolveAt = Date.now();
this.consecutiveFailures = 0;
},
recordFailure() {
this.totalFailed++;
this.consecutiveFailures++;
},
get successRate() {
const total = this.totalSolved + this.totalFailed;
return total > 0 ? this.totalSolved / total : 1;
},
};
async function checkBalance() {
if (health.balance !== null && Date.now() - health.balanceCheckedAt < 60000) {
return health.balance;
}
try {
const url = `${RESULT_URL}?key=${API_KEY}&action=getbalance&json=1`;
const resp = await (await fetch(url)).json();
health.balance = parseFloat(resp.request);
health.balanceCheckedAt = Date.now();
return health.balance;
} catch {
return health.balance;
}
}
app.get("/health/live", (req, res) => {
res.json({ status: "ok", uptimeMs: Date.now() - health.startedAt });
});
app.get("/health/ready", async (req, res) => {
const issues = [];
if (health.consecutiveFailures >= 10) {
issues.push(`consecutive_failures=${health.consecutiveFailures}`);
}
if (health.totalSolved > 0) {
const silentMs = Date.now() - health.lastSolveAt;
if (silentMs > 600_000) {
issues.push(`no_solve_for=${Math.round(silentMs / 1000)}s`);
}
}
const balance = await checkBalance();
if (balance !== null && balance < 1.0) {
issues.push(`low_balance=$${balance.toFixed(2)}`);
}
const stats = {
solved: health.totalSolved,
failed: health.totalFailed,
successRate: Math.round(health.successRate * 1000) / 1000,
balance,
};
if (issues.length > 0) {
return res.status(503).json({ status: "not_ready", issues, stats });
}
res.json({ status: "ready", stats });
});
app.get("/health/dependencies", async (req, res) => {
const checks = {};
try {
const start = Date.now();
const url = `${RESULT_URL}?key=${API_KEY}&action=getbalance&json=1`;
const resp = await fetch(url);
checks.captchaaiApi = {
status: resp.ok ? "ok" : "degraded",
responseMs: Date.now() - start,
};
} catch (e) {
checks.captchaaiApi = { status: "down", error: e.message };
}
const allOk = Object.values(checks).every((c) => c.status === "ok");
res.status(allOk ? 200 : 503).json({
status: allOk ? "ok" : "degraded",
checks,
});
});
app.listen(8080, () => console.log("Health server on :8080"));
Kubernetes 配置
apiVersion: apps/v1
kind: Deployment
metadata:
name: captcha-worker
spec:
replicas: 3
template:
spec:
containers:
- name: worker
image: captcha-worker:latest
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 2
健康检查响应代码
| 端点 | 200 | 503 |
|---|---|---|
/health/live |
流程响应 | 进程冻结 - 重新启动 |
/health/ready |
可以接受工作 | 停止发送任务 |
/health/dependencies |
所有依赖项都OK | 上游降级 |
操作员阈值
- 使用就绪性来阻止新工作,使用活动性来触发重启,并使用平衡警报来捕获吞吐量下降的情况。
- 将运行状况检查与队列深度、最近错误率和依赖项可达性联系起来,而不是仅与进程正常运行时间联系起来。
- 保持值班响应人员可见的阈值,以便健康变化是可操作的而不是嘈杂的。
故障排除
| 问题 | 原因 | 处理方式 |
|---|---|---|
| Worker不断重启 | 活性阈值太低 | 增加 failureThreshold 或 periodSeconds |
| 启动期间工作线程标记为未就绪 | 尚无解决方案被视为“太长” | 仅在第一次解决后检查 seconds_since_last_solve |
| 平衡检查会减慢运行状况端点 | 每个请求都会调用 API | 使用 TTL 缓存余额(建议 60 秒) |
| 健康端点本身崩溃 | 检查中未处理的异常 | 将每个检查包装在 try/except 中;返回降级而不是 500 |
| 依赖性检查的误报 | 余额检查期间出现网络故障 | 通过重新验证时失效的方法使用缓存值 |
常问问题
Kubernetes 应多久探测一次运行状况端点?
活跃度:每 10-30 秒一次,故障阈值为 3。就绪性:每 5-10 秒一次,故障阈值为 2。更频繁的探测可以更快地检测问题,但会增加开销。
运行状况端点是否应该访问 CaptchaAI API?
仅用于就绪性和依赖性检查,并始终缓存结果。活性探针永远不应该发出外部调用——它必须立即响应以证明进程还活着。
如何监控多名员工的健康状况?
以 Prometheus 格式 (/metrics) 与运行状况端点一起公开运行状况指标。使用 Grafana 仪表板聚合各个工作人员以查看整个车队的运行状况。
相关文章
下一步
让您的 CAPTCHA 工作人员做好生产准备 –”获取您的 CaptchaAI API 密钥并添加健康检查端点。
相关指南:
- CAPTCHA API 调用的断路器模式
- 用于验证码解决的隔板模式
- 使用 Prometheus 和 Grafana 监控验证码解决率