Stěhování národů

This commit is contained in:
2026-03-25 19:47:10 +01:00
parent e3b7879eb2
commit 5de0d57612
28 changed files with 152 additions and 122 deletions

2
.gitignore vendored
View File

@@ -1,5 +1,5 @@
.idea/ .idea/
data/ beaky-backend/data/
report.xml report.xml
# Byte-compiled / optimized / DLL files # Byte-compiled / optimized / DLL files

98
CLAUDE.md Normal file
View File

@@ -0,0 +1,98 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## What This Project Does
Beaky is a CLI tool for verifying the truthfulness of sports betting tickets. It reads ticket URLs from an Excel file, classifies the bets on each ticket (via web scraping or OCR), then resolves each bet against a football statistics API to determine if the ticket is genuine.
## Commands
```bash
# Install (with dev dependencies)
pip install -e ".[dev]"
# Install Playwright browser (required for link classifier and screenshotter)
playwright install chromium
# Run the CLI
beaky <mode> [--config config/application.yml] [--id <ticket_id>] [--classifier {link,img,both}] [--dump]
# Modes:
# screen - screenshot all ticket URLs to data/screenshots/<id>.png
# parse - print all links loaded from Excel
# compare - classify tickets and print bet comparison table
# resolve - classify via link classifier, then resolve bets against football API
# Run the REST API (default: http://0.0.0.0:8000)
beaky-api
# Run tests
pytest
# Lint
ruff check .
# Format
ruff format .
```
## Architecture
Data flows through four stages:
1. **Scanner** (`scanner/scanner.py`) — Reads `data/odkazy.xlsx` and produces `Link` objects (id, url, date).
2. **Classifiers** — Two independent classifiers both produce a `Ticket` (list of typed `Bet` objects):
- **Link classifier** (`link_classifier/classifier.py`) — Launches a headless Chromium browser via Playwright, navigates to the ticket URL (a Czech Fortuna betting site), and parses the DOM using CSS selectors to extract bet details.
- **Image classifier** (`image_classifier/classifier.py`) — Runs pytesseract OCR on screenshots in `data/screenshots/`, then uses regex to parse the raw text into bets. Block segmentation is driven by date-start and sport-prefix end triggers.
3. **Resolver** (`resolvers/resolver.py`) — Takes a classified `Ticket` and resolves each bet's outcome (WIN/LOSE/VOID/UNKNOWN) by querying the `api-sports.io` football API. Matches fixtures using team name similarity (SequenceMatcher) and date proximity. Results are disk-cached in `data/fixture_cache/` to avoid redundant API calls.
4. **CLI** (`cli.py`) — Ties everything together. Handles `--classifier` and `--dump` flags; renders ANSI-colored comparison tables for side-by-side link-vs-image output.
5. **REST API** (`api/`) — FastAPI app exposing a single endpoint. Runs the full pipeline (screenshot → both classifiers → resolve) for a given URL and returns the verdict. Classifiers and resolver are instantiated once at startup (`app.state`) and reused across requests.
### Core Domain Models (`datamodels/ticket.py`)
`Bet` is an abstract Pydantic dataclass with a `resolve(MatchInfo) -> BetOutcome` method. Concrete subtypes include: `WinDrawLose`, `WinDrawLoseDouble`, `WinLose`, `BothTeamScored`, `GoalAmount`, `GoalHandicap`, `HalfTimeResult`, `HalfTimeDouble`, `HalfTimeFullTime`, `CornerAmount`, `TeamCornerAmount`, `MoreOffsides`, `Advance`, `UnknownBet`. Adding a new bet type requires: a new subclass here, detection regex in both classifiers, and a `resolve()` implementation.
### REST API
**Endpoint:** `POST /api/v1/resolve`
```json
{ "url": "<fortuna ticket url>", "debug": false }
```
Response includes `verdict` and per-bet `outcome`/`fixture_id`/`confidence`. With `debug: true` also returns raw `link_ticket`, `img_ticket`, and per-bet `match_info`.
Ticket ID is derived as `md5(url) % 10^9` — stable across restarts. Screenshots are saved to `data/screenshots/{ticket_id}.png`.
**Environment variables** (all optional):
| Var | Default |
|---|---|
| `BEAKY_CONFIG` | `config/application.yml` |
| `BEAKY_HOST` | `0.0.0.0` |
| `BEAKY_PORT` | `8000` |
| `LOG_LEVEL` | value from `config/application.yml``api.log_level` |
OpenAPI docs available at `/docs` when the server is running.
### Logging
All modules use `logging` (no `print()`). The CLI's user-facing output (`cli.py`) still uses `print`. Resolver debug output (fixture matching, API calls) goes through `_ansi.log()` which emits at `DEBUG` level with ANSI colors preserved. Set `api.log_level: DEBUG` in `config/application.yml` (or `LOG_LEVEL=DEBUG` env var) to see it.
### Configuration
Config is loaded from `config/application.yml` into Pydantic dataclasses (`Config`, `ScreenshotterConfig`, `ResolverConfig`, `ImgClassifierConfig`, `ApiConfig`). Key fields:
- `path` — path to the input Excel file
- `resolver.api_key` — api-sports.io API key
- `resolver.league_map` — maps Czech league name patterns to API league IDs (longest-match wins)
- `resolver.cache_path` — disk cache directory (default: `data/fixture_cache`)
- `api.log_level` — logging level for the API server (default: `INFO`)
### Bet text language
All bet type strings are in Czech (from the Fortuna betting platform). Regex patterns in both classifiers match Czech text (e.g. `"Výsledek zápasu"`, `"Počet gólů"`).

View File

@@ -64,3 +64,7 @@ img_classifier:
target_path: data/screenshots/ target_path: data/screenshots/
log_level: INFO # set to DEBUG to see raw classifier and resolver output log_level: INFO # set to DEBUG to see raw classifier and resolver output
api:
host: 0.0.0.0
port: 8000

View File

@@ -16,6 +16,9 @@ dependencies = [
"playwright==1.58.0", "playwright==1.58.0",
"requests>=2.32.0", "requests>=2.32.0",
"diskcache>=5.6", "diskcache>=5.6",
"pytesseract==0.3.13",
"fastapi>=0.115",
"uvicorn[standard]>=0.34",
] ]
[project.optional-dependencies] [project.optional-dependencies]
@@ -30,6 +33,7 @@ dev = [
[project.scripts] [project.scripts]
beaky = "beaky.cli:main" beaky = "beaky.cli:main"
beaky-api = "beaky.api.main:main"
[tool.ruff] [tool.ruff]

View File

@@ -3,11 +3,8 @@ import re as _re
import shutil import shutil
from datetime import datetime from datetime import datetime
import yaml
from pydantic import ValidationError
from beaky import _ansi from beaky import _ansi
from beaky.config import Config from beaky.config import load_config
from beaky.datamodels.ticket import Bet, Ticket from beaky.datamodels.ticket import Bet, Ticket
from beaky.image_classifier.classifier import img_classify from beaky.image_classifier.classifier import img_classify
from beaky.link_classifier.classifier import LinkClassifier from beaky.link_classifier.classifier import LinkClassifier
@@ -191,16 +188,6 @@ def _print_dump(ticket: Ticket, label: str) -> None:
print(f" {k}: {val}") print(f" {k}: {val}")
def load_config(path: str) -> Config | None:
with open(path) as f:
config_dict = yaml.safe_load(f)
try:
return Config(**config_dict)
except ValidationError as e:
print("Bad config")
print(e)
return None
def main() -> None: def main() -> None:
parser = argparse.ArgumentParser(prog="beaky") parser = argparse.ArgumentParser(prog="beaky")
parser.add_argument("--config", help="Path to config file.", default="config/application.yml") parser.add_argument("--config", help="Path to config file.", default="config/application.yml")
@@ -212,8 +199,10 @@ def main() -> None:
help="Dump all bet fields untruncated (compare mode only).") help="Dump all bet fields untruncated (compare mode only).")
args = parser.parse_args() args = parser.parse_args()
config = load_config(args.config) try:
if config is None: config = load_config(args.config)
except RuntimeError as e:
print(e)
return return
# always load testing data, we will modify that later # always load testing data, we will modify that later

View File

@@ -0,0 +1,34 @@
from dataclasses import field as _field
import yaml
from pydantic import ValidationError
from pydantic.dataclasses import dataclass
from beaky.image_classifier.config import ImgClassifierConfig
from beaky.resolvers.config import ResolverConfig
from beaky.screenshotter.config import ScreenshotterConfig
def load_config(path: str) -> "Config":
with open(path) as f:
data = yaml.safe_load(f)
try:
return Config(**data)
except ValidationError as exc:
raise RuntimeError(f"Invalid config at {path}: {exc}") from exc
@dataclass
class ApiConfig:
host: str = "0.0.0.0"
port: int = 8000
@dataclass
class Config:
path: str
screenshotter: ScreenshotterConfig
resolver: ResolverConfig
img_classifier: ImgClassifierConfig
log_level: str = "INFO"
api: ApiConfig = _field(default_factory=ApiConfig)

View File

@@ -179,7 +179,8 @@ class TicketResolver:
if cache_key not in self._fixture_cache: if cache_key not in self._fixture_cache:
if cache_key in self._disk_cache and not cache_may_be_stale: if cache_key in self._disk_cache and not cache_may_be_stale:
self._fixture_cache[cache_key] = self._disk_cache[cache_key] self._fixture_cache[cache_key] = self._disk_cache[cache_key]
_ansi.log(_ansi.gray(f" │ /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)")) _ansi.log(
_ansi.gray(f" │ /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)"))
else: else:
date_from = (center - timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d") date_from = (center - timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
date_to = (center + timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d") date_to = (center + timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
@@ -197,7 +198,8 @@ class TicketResolver:
self._disk_cache[cache_key] = cacheable self._disk_cache[cache_key] = cacheable
_ansi.log(_ansi.gray(f"{len(cacheable)} non-NS fixture(s) written to disk cache")) _ansi.log(_ansi.gray(f"{len(cacheable)} non-NS fixture(s) written to disk cache"))
else: else:
_ansi.log(_ansi.gray(f" │ /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory")) _ansi.log(
_ansi.gray(f" │ /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory"))
fixture, name_match, date_prox = _best_fixture_match( fixture, name_match, date_prox = _best_fixture_match(
self._fixture_cache[cache_key], bet.team1Name, bet.team2Name, center self._fixture_cache[cache_key], bet.team1Name, bet.team2Name, center
@@ -227,7 +229,8 @@ class TicketResolver:
if results: if results:
league_id = results[0]["league"]["id"] league_id = results[0]["league"]["id"]
league_found_name = results[0]["league"]["name"] league_found_name = results[0]["league"]["name"]
_ansi.log(_ansi.gray(f" │ matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)")) _ansi.log(
_ansi.gray(f" │ matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)"))
self._league_cache[key] = (league_id, 0.7) self._league_cache[key] = (league_id, 0.7)
return league_id, 0.7 return league_id, 0.7

0
beaky-frontend/.gitkeep Normal file
View File

View File

@@ -1,88 +0,0 @@
import os
import re
import sys
import argparse
from datetime import datetime
import pytz
from openpyxl import Workbook
def process_files(starting_id, output_filename="output.xlsx"):
# Find all txt files in the current directory
txt_files = [f for f in os.listdir('.') if f.endswith('.txt')]
if not txt_files:
print("No .txt files found in the current directory.")
return
# Regex patterns for input data
date_pattern = re.compile(r'\[.*?(\d{1,2})\s+(\d{1,2}),\s+(\d{4})\s+at\s+(\d{1,2}:\d{2})\]')
url_pattern = re.compile(r'(https?://[^\s]+)')
# Timezone setup (CET to UTC)
local_tz = pytz.timezone("Europe/Prague")
# Set up the Excel Workbook
wb = Workbook()
ws = wb.active
ws.title = "Fortuna Data"
ws.append(["ID", "URL", "Date_UTC"]) # Add headers
current_id = starting_id
success_files = []
for filename in txt_files:
try:
with open(filename, 'r', encoding='utf-8') as f:
content = f.read()
dates = date_pattern.findall(content)
urls = url_pattern.findall(content)
# Extract and format the data
for i in range(min(len(dates), len(urls))):
month, day, year, time_str = dates[i]
# Parse the datetime from the text file
dt_str = f"{year}-{month}-{day} {time_str}"
local_dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M")
# Convert CET to UTC
localized_dt = local_tz.localize(local_dt)
utc_dt = localized_dt.astimezone(pytz.utc)
# NEW: Format to ISO 8601 with T and Z
formatted_date = utc_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
# Add a new row to the Excel sheet
ws.append([current_id, urls[i], formatted_date])
current_id += 1
# Queue file for deletion
success_files.append(filename)
except Exception as e:
print(f"Error processing {filename}: {e}", file=sys.stderr)
# Save the Excel file
try:
wb.save(output_filename)
print(f"Successfully saved data to {output_filename}")
# Clean up only if save was successful
for filename in success_files:
os.remove(filename)
print(f"Deleted: {filename}")
except Exception as e:
print(f"Failed to save {output_filename}. No text files were deleted. Error: {e}", file=sys.stderr)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Extract URLs to an Excel file with ISO UTC dates.")
parser.add_argument("start_id", type=int, help="Starting ID for the output")
parser.add_argument("--output", type=str, default="extracted_data.xlsx",
help="Output Excel filename (default: extracted_data.xlsx)")
args = parser.parse_args()
process_files(args.start_id, args.output)

View File

@@ -1,14 +0,0 @@
from pydantic.dataclasses import dataclass
from beaky.image_classifier.config import ImgClassifierConfig
from beaky.resolvers.config import ResolverConfig
from beaky.screenshotter.config import ScreenshotterConfig
@dataclass
class Config:
path: str
screenshotter: ScreenshotterConfig
resolver: ResolverConfig
img_classifier: ImgClassifierConfig
log_level: str = "INFO"