Stěhování národů

2026-03-25 19:47:10 +01:00
parent e3b7879eb2
commit 5de0d57612
28 changed files with 152 additions and 122 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,98 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## What This Project Does
+
+Beaky is a CLI tool for verifying the truthfulness of sports betting tickets. It reads ticket URLs from an Excel file, classifies the bets on each ticket (via web scraping or OCR), then resolves each bet against a football statistics API to determine if the ticket is genuine.
+
+## Commands
+
+```bash
+# Install (with dev dependencies)
+pip install -e ".[dev]"
+
+# Install Playwright browser (required for link classifier and screenshotter)
+playwright install chromium
+
+# Run the CLI
+beaky <mode> [--config config/application.yml] [--id <ticket_id>] [--classifier {link,img,both}] [--dump]
+
+# Modes:
+#   screen   - screenshot all ticket URLs to data/screenshots/<id>.png
+#   parse    - print all links loaded from Excel
+#   compare  - classify tickets and print bet comparison table
+#   resolve  - classify via link classifier, then resolve bets against football API
+
+# Run the REST API (default: http://0.0.0.0:8000)
+beaky-api
+
+# Run tests
+pytest
+
+# Lint
+ruff check .
+
+# Format
+ruff format .
+```
+
+## Architecture
+
+Data flows through four stages:
+
+1. **Scanner** (`scanner/scanner.py`) — Reads `data/odkazy.xlsx` and produces `Link` objects (id, url, date).
+
+2. **Classifiers** — Two independent classifiers both produce a `Ticket` (list of typed `Bet` objects):
+   - **Link classifier** (`link_classifier/classifier.py`) — Launches a headless Chromium browser via Playwright, navigates to the ticket URL (a Czech Fortuna betting site), and parses the DOM using CSS selectors to extract bet details.
+   - **Image classifier** (`image_classifier/classifier.py`) — Runs pytesseract OCR on screenshots in `data/screenshots/`, then uses regex to parse the raw text into bets. Block segmentation is driven by date-start and sport-prefix end triggers.
+
+3. **Resolver** (`resolvers/resolver.py`) — Takes a classified `Ticket` and resolves each bet's outcome (WIN/LOSE/VOID/UNKNOWN) by querying the `api-sports.io` football API. Matches fixtures using team name similarity (SequenceMatcher) and date proximity. Results are disk-cached in `data/fixture_cache/` to avoid redundant API calls.
+
+4. **CLI** (`cli.py`) — Ties everything together. Handles `--classifier` and `--dump` flags; renders ANSI-colored comparison tables for side-by-side link-vs-image output.
+
+5. **REST API** (`api/`) — FastAPI app exposing a single endpoint. Runs the full pipeline (screenshot → both classifiers → resolve) for a given URL and returns the verdict. Classifiers and resolver are instantiated once at startup (`app.state`) and reused across requests.
+
+### Core Domain Models (`datamodels/ticket.py`)
+
+`Bet` is an abstract Pydantic dataclass with a `resolve(MatchInfo) -> BetOutcome` method. Concrete subtypes include: `WinDrawLose`, `WinDrawLoseDouble`, `WinLose`, `BothTeamScored`, `GoalAmount`, `GoalHandicap`, `HalfTimeResult`, `HalfTimeDouble`, `HalfTimeFullTime`, `CornerAmount`, `TeamCornerAmount`, `MoreOffsides`, `Advance`, `UnknownBet`. Adding a new bet type requires: a new subclass here, detection regex in both classifiers, and a `resolve()` implementation.
+
+### REST API
+
+**Endpoint:** `POST /api/v1/resolve`
+
+```json
+{ "url": "<fortuna ticket url>", "debug": false }
+```
+
+Response includes `verdict` and per-bet `outcome`/`fixture_id`/`confidence`. With `debug: true` also returns raw `link_ticket`, `img_ticket`, and per-bet `match_info`.
+
+Ticket ID is derived as `md5(url) % 10^9` — stable across restarts. Screenshots are saved to `data/screenshots/{ticket_id}.png`.
+
+**Environment variables** (all optional):
+
+| Var | Default |
+|---|---|
+| `BEAKY_CONFIG` | `config/application.yml` |
+| `BEAKY_HOST` | `0.0.0.0` |
+| `BEAKY_PORT` | `8000` |
+| `LOG_LEVEL` | value from `config/application.yml` → `api.log_level` |
+
+OpenAPI docs available at `/docs` when the server is running.
+
+### Logging
+
+All modules use `logging` (no `print()`). The CLI's user-facing output (`cli.py`) still uses `print`. Resolver debug output (fixture matching, API calls) goes through `_ansi.log()` which emits at `DEBUG` level with ANSI colors preserved. Set `api.log_level: DEBUG` in `config/application.yml` (or `LOG_LEVEL=DEBUG` env var) to see it.
+
+### Configuration
+
+Config is loaded from `config/application.yml` into Pydantic dataclasses (`Config`, `ScreenshotterConfig`, `ResolverConfig`, `ImgClassifierConfig`, `ApiConfig`). Key fields:
+- `path` — path to the input Excel file
+- `resolver.api_key` — api-sports.io API key
+- `resolver.league_map` — maps Czech league name patterns to API league IDs (longest-match wins)
+- `resolver.cache_path` — disk cache directory (default: `data/fixture_cache`)
+- `api.log_level` — logging level for the API server (default: `INFO`)
+
+### Bet text language
+
+All bet type strings are in Czech (from the Fortuna betting platform). Regex patterns in both classifiers match Czech text (e.g. `"Výsledek zápasu"`, `"Počet gólů"`).