Stěhování národů

2026-03-25 19:47:10 +01:00
parent e3b7879eb2
commit 5de0d57612
28 changed files with 152 additions and 122 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,5 @@
 .idea/
-data/
+beaky-backend/data/
 report.xml
 # Byte-compiled / optimized / DLL files
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,98 @@
 # CLAUDE.md
 This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
 ## What This Project Does
 Beaky is a CLI tool for verifying the truthfulness of sports betting tickets. It reads ticket URLs from an Excel file, classifies the bets on each ticket (via web scraping or OCR), then resolves each bet against a football statistics API to determine if the ticket is genuine.
 ## Commands
 ```bash
 # Install (with dev dependencies)
 pip install -e ".[dev]"
 # Install Playwright browser (required for link classifier and screenshotter)
 playwright install chromium
 # Run the CLI
 beaky <mode> [--config config/application.yml] [--id <ticket_id>] [--classifier {link,img,both}] [--dump]
 # Modes:
 #   screen   - screenshot all ticket URLs to data/screenshots/<id>.png
 #   parse    - print all links loaded from Excel
 #   compare  - classify tickets and print bet comparison table
 #   resolve  - classify via link classifier, then resolve bets against football API
 # Run the REST API (default: http://0.0.0.0:8000)
 beaky-api
 # Run tests
 pytest
 # Lint
 ruff check .
 # Format
 ruff format .
 ```
 ## Architecture
 Data flows through four stages:
 1. **Scanner** (`scanner/scanner.py`) — Reads `data/odkazy.xlsx` and produces `Link` objects (id, url, date).
 2. **Classifiers** — Two independent classifiers both produce a `Ticket` (list of typed `Bet` objects):
   - **Link classifier** (`link_classifier/classifier.py`) — Launches a headless Chromium browser via Playwright, navigates to the ticket URL (a Czech Fortuna betting site), and parses the DOM using CSS selectors to extract bet details.
   - **Image classifier** (`image_classifier/classifier.py`) — Runs pytesseract OCR on screenshots in `data/screenshots/`, then uses regex to parse the raw text into bets. Block segmentation is driven by date-start and sport-prefix end triggers.
 3. **Resolver** (`resolvers/resolver.py`) — Takes a classified `Ticket` and resolves each bet's outcome (WIN/LOSE/VOID/UNKNOWN) by querying the `api-sports.io` football API. Matches fixtures using team name similarity (SequenceMatcher) and date proximity. Results are disk-cached in `data/fixture_cache/` to avoid redundant API calls.
 4. **CLI** (`cli.py`) — Ties everything together. Handles `--classifier` and `--dump` flags; renders ANSI-colored comparison tables for side-by-side link-vs-image output.
 5. **REST API** (`api/`) — FastAPI app exposing a single endpoint. Runs the full pipeline (screenshot → both classifiers → resolve) for a given URL and returns the verdict. Classifiers and resolver are instantiated once at startup (`app.state`) and reused across requests.
 ### Core Domain Models (`datamodels/ticket.py`)
 `Bet` is an abstract Pydantic dataclass with a `resolve(MatchInfo) -> BetOutcome` method. Concrete subtypes include: `WinDrawLose`, `WinDrawLoseDouble`, `WinLose`, `BothTeamScored`, `GoalAmount`, `GoalHandicap`, `HalfTimeResult`, `HalfTimeDouble`, `HalfTimeFullTime`, `CornerAmount`, `TeamCornerAmount`, `MoreOffsides`, `Advance`, `UnknownBet`. Adding a new bet type requires: a new subclass here, detection regex in both classifiers, and a `resolve()` implementation.
 ### REST API
 **Endpoint:** `POST /api/v1/resolve`
 ```json
 { "url": "<fortuna ticket url>", "debug": false }
 ```
 Response includes `verdict` and per-bet `outcome`/`fixture_id`/`confidence`. With `debug: true` also returns raw `link_ticket`, `img_ticket`, and per-bet `match_info`.
 Ticket ID is derived as `md5(url) % 10^9` — stable across restarts. Screenshots are saved to `data/screenshots/{ticket_id}.png`.
 **Environment variables** (all optional):
 | Var | Default |
 |---|---|
 | `BEAKY_CONFIG` | `config/application.yml` |
 | `BEAKY_HOST` | `0.0.0.0` |
 | `BEAKY_PORT` | `8000` |
 | `LOG_LEVEL` | value from `config/application.yml` → `api.log_level` |
 OpenAPI docs available at `/docs` when the server is running.
 ### Logging
 All modules use `logging` (no `print()`). The CLI's user-facing output (`cli.py`) still uses `print`. Resolver debug output (fixture matching, API calls) goes through `_ansi.log()` which emits at `DEBUG` level with ANSI colors preserved. Set `api.log_level: DEBUG` in `config/application.yml` (or `LOG_LEVEL=DEBUG` env var) to see it.
 ### Configuration
 Config is loaded from `config/application.yml` into Pydantic dataclasses (`Config`, `ScreenshotterConfig`, `ResolverConfig`, `ImgClassifierConfig`, `ApiConfig`). Key fields:
 - `path` — path to the input Excel file
 - `resolver.api_key` — api-sports.io API key
 - `resolver.league_map` — maps Czech league name patterns to API league IDs (longest-match wins)
 - `resolver.cache_path` — disk cache directory (default: `data/fixture_cache`)
 - `api.log_level` — logging level for the API server (default: `INFO`)
 ### Bet text language
 All bet type strings are in Czech (from the Fortuna betting platform). Regex patterns in both classifiers match Czech text (e.g. `"Výsledek zápasu"`, `"Počet gólů"`).
--- a/beaky-backend/README.md
+++ b/beaky-backend/README.md
--- a/beaky-backend/config/application.yml
+++ b/beaky-backend/config/application.yml
@@ -64,3 +64,7 @@ img_classifier:
  target_path: data/screenshots/
 log_level: INFO  # set to DEBUG to see raw classifier and resolver output
 api:
  host: 0.0.0.0
  port: 8000
--- a/beaky-backend/pyproject.toml
+++ b/beaky-backend/pyproject.toml
@@ -16,6 +16,9 @@ dependencies = [
    "playwright==1.58.0",
    "requests>=2.32.0",
    "diskcache>=5.6",
    "pytesseract==0.3.13",
    "fastapi>=0.115",
    "uvicorn[standard]>=0.34",
 ]
 [project.optional-dependencies]
@@ -30,6 +33,7 @@ dev = [
 [project.scripts]
 beaky = "beaky.cli:main"
 beaky-api = "beaky.api.main:main"
 [tool.ruff]
--- a/beaky-backend/src/beaky/init.py
+++ b/beaky-backend/src/beaky/init.py
--- a/beaky-backend/src/beaky/_ansi.py
+++ b/beaky-backend/src/beaky/_ansi.py
--- a/beaky-backend/src/beaky/cli.py
+++ b/beaky-backend/src/beaky/cli.py
@@ -3,11 +3,8 @@ import re as _re
 import shutil
 from datetime import datetime
 import yaml
 from pydantic import ValidationError
 from beaky import _ansi
-from beaky.config import Config
+from beaky.config import load_config
 from beaky.datamodels.ticket import Bet, Ticket
 from beaky.image_classifier.classifier import img_classify
 from beaky.link_classifier.classifier import LinkClassifier
@@ -191,16 +188,6 @@ def _print_dump(ticket: Ticket, label: str) -> None:
            print(f"    {k}: {val}")
 def load_config(path: str) -> Config | None:
    with open(path) as f:
        config_dict = yaml.safe_load(f)
    try:
        return Config(**config_dict)
    except ValidationError as e:
        print("Bad config")
        print(e)
        return None
 def main() -> None:
    parser = argparse.ArgumentParser(prog="beaky")
    parser.add_argument("--config", help="Path to config file.", default="config/application.yml")
@@ -212,8 +199,10 @@ def main() -> None:
                        help="Dump all bet fields untruncated (compare mode only).")
    args = parser.parse_args()
-    config = load_config(args.config)
+    try:
-    if config is None:
+        config = load_config(args.config)
    except RuntimeError as e:
        print(e)
        return
    # always load testing data, we will modify that later
--- a/beaky-backend/src/beaky/config.py
+++ b/beaky-backend/src/beaky/config.py
@@ -0,0 +1,34 @@
 from dataclasses import field as _field
 import yaml
 from pydantic import ValidationError
 from pydantic.dataclasses import dataclass
 from beaky.image_classifier.config import ImgClassifierConfig
 from beaky.resolvers.config import ResolverConfig
 from beaky.screenshotter.config import ScreenshotterConfig
 def load_config(path: str) -> "Config":
    with open(path) as f:
        data = yaml.safe_load(f)
    try:
        return Config(**data)
    except ValidationError as exc:
        raise RuntimeError(f"Invalid config at {path}: {exc}") from exc
@dataclass
 class ApiConfig:
    host: str = "0.0.0.0"
    port: int = 8000
@dataclass
 class Config:
    path: str
    screenshotter: ScreenshotterConfig
    resolver: ResolverConfig
    img_classifier: ImgClassifierConfig
    log_level: str = "INFO"
    api: ApiConfig = _field(default_factory=ApiConfig)
--- a/beaky-backend/src/beaky/datamodels/init.py
+++ b/beaky-backend/src/beaky/datamodels/init.py
--- a/beaky-backend/src/beaky/datamodels/ticket.py
+++ b/beaky-backend/src/beaky/datamodels/ticket.py
--- a/beaky-backend/src/beaky/image_classifier/init.py
+++ b/beaky-backend/src/beaky/image_classifier/init.py
--- a/beaky-backend/src/beaky/image_classifier/classifier.py
+++ b/beaky-backend/src/beaky/image_classifier/classifier.py
--- a/beaky-backend/src/beaky/image_classifier/config.py
+++ b/beaky-backend/src/beaky/image_classifier/config.py
--- a/beaky-backend/src/beaky/link_classifier/init.py
+++ b/beaky-backend/src/beaky/link_classifier/init.py
--- a/beaky-backend/src/beaky/link_classifier/classifier.py
+++ b/beaky-backend/src/beaky/link_classifier/classifier.py
--- a/beaky-backend/src/beaky/resolvers/init.py
+++ b/beaky-backend/src/beaky/resolvers/init.py
--- a/beaky-backend/src/beaky/resolvers/config.py
+++ b/beaky-backend/src/beaky/resolvers/config.py
--- a/beaky-backend/src/beaky/resolvers/resolver.py
+++ b/beaky-backend/src/beaky/resolvers/resolver.py
@@ -179,7 +179,8 @@ class TicketResolver:
        if cache_key not in self._fixture_cache:
            if cache_key in self._disk_cache and not cache_may_be_stale:
                self._fixture_cache[cache_key] = self._disk_cache[cache_key]
-                _ansi.log(_ansi.gray(f"  │  /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)"))
+                _ansi.log(
                    _ansi.gray(f"  │  /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)"))
            else:
                date_from = (center - timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
                date_to = (center + timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
@@ -197,7 +198,8 @@ class TicketResolver:
                    self._disk_cache[cache_key] = cacheable
                    _ansi.log(_ansi.gray(f"  │     {len(cacheable)} non-NS fixture(s) written to disk cache"))
        else:
-            _ansi.log(_ansi.gray(f"  │  /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory"))
+            _ansi.log(
                _ansi.gray(f"  │  /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory"))
        fixture, name_match, date_prox = _best_fixture_match(
            self._fixture_cache[cache_key], bet.team1Name, bet.team2Name, center
@@ -227,7 +229,8 @@ class TicketResolver:
        if results:
            league_id = results[0]["league"]["id"]
            league_found_name = results[0]["league"]["name"]
-            _ansi.log(_ansi.gray(f"  │     matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)"))
+            _ansi.log(
                _ansi.gray(f"  │     matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)"))
            self._league_cache[key] = (league_id, 0.7)
            return league_id, 0.7
--- a/beaky-backend/src/beaky/scanner/init.py
+++ b/beaky-backend/src/beaky/scanner/init.py
--- a/beaky-backend/src/beaky/scanner/scanner.py
+++ b/beaky-backend/src/beaky/scanner/scanner.py
--- a/beaky-backend/src/beaky/screenshotter/init.py
+++ b/beaky-backend/src/beaky/screenshotter/init.py
--- a/beaky-backend/src/beaky/screenshotter/config.py
+++ b/beaky-backend/src/beaky/screenshotter/config.py
--- a/beaky-backend/src/beaky/screenshotter/screenshotter.py
+++ b/beaky-backend/src/beaky/screenshotter/screenshotter.py
--- a/beaky-backend/test/beaky/sample_test.py
+++ b/beaky-backend/test/beaky/sample_test.py
--- a/beaky-frontend/.gitkeep
+++ b/beaky-frontend/.gitkeep
--- a/data/extract_to_excel.py
+++ b/data/extract_to_excel.py
@@ -1,88 +0,0 @@
 import os
 import re
 import sys
 import argparse
 from datetime import datetime
 import pytz
 from openpyxl import Workbook
 def process_files(starting_id, output_filename="output.xlsx"):
    # Find all txt files in the current directory
    txt_files = [f for f in os.listdir('.') if f.endswith('.txt')]
    if not txt_files:
        print("No .txt files found in the current directory.")
        return
    # Regex patterns for input data
    date_pattern = re.compile(r'\[.*?(\d{1,2})\s+(\d{1,2}),\s+(\d{4})\s+at\s+(\d{1,2}:\d{2})\]')
    url_pattern = re.compile(r'(https?://[^\s]+)')
    # Timezone setup (CET to UTC)
    local_tz = pytz.timezone("Europe/Prague")
    # Set up the Excel Workbook
    wb = Workbook()
    ws = wb.active
    ws.title = "Fortuna Data"
    ws.append(["ID", "URL", "Date_UTC"])  # Add headers
    current_id = starting_id
    success_files = []
    for filename in txt_files:
        try:
            with open(filename, 'r', encoding='utf-8') as f:
                content = f.read()
            dates = date_pattern.findall(content)
            urls = url_pattern.findall(content)
            # Extract and format the data
            for i in range(min(len(dates), len(urls))):
                month, day, year, time_str = dates[i]
                # Parse the datetime from the text file
                dt_str = f"{year}-{month}-{day} {time_str}"
                local_dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M")
                # Convert CET to UTC
                localized_dt = local_tz.localize(local_dt)
                utc_dt = localized_dt.astimezone(pytz.utc)
                # NEW: Format to ISO 8601 with T and Z
                formatted_date = utc_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
                # Add a new row to the Excel sheet
                ws.append([current_id, urls[i], formatted_date])
                current_id += 1
            # Queue file for deletion
            success_files.append(filename)
        except Exception as e:
            print(f"Error processing {filename}: {e}", file=sys.stderr)
    # Save the Excel file
    try:
        wb.save(output_filename)
        print(f"Successfully saved data to {output_filename}")
        # Clean up only if save was successful
        for filename in success_files:
            os.remove(filename)
            print(f"Deleted: {filename}")
    except Exception as e:
        print(f"Failed to save {output_filename}. No text files were deleted. Error: {e}", file=sys.stderr)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Extract URLs to an Excel file with ISO UTC dates.")
    parser.add_argument("start_id", type=int, help="Starting ID for the output")
    parser.add_argument("--output", type=str, default="extracted_data.xlsx",
                        help="Output Excel filename (default: extracted_data.xlsx)")
    args = parser.parse_args()
    process_files(args.start_id, args.output)
--- a/src/beaky/config.py
+++ b/src/beaky/config.py
@@ -1,14 +0,0 @@
 from pydantic.dataclasses import dataclass
 from beaky.image_classifier.config import ImgClassifierConfig
 from beaky.resolvers.config import ResolverConfig
 from beaky.screenshotter.config import ScreenshotterConfig
@dataclass
 class Config:
    path: str
    screenshotter: ScreenshotterConfig
    resolver: ResolverConfig
    img_classifier: ImgClassifierConfig
    log_level: str = "INFO"