Stěhování národů
This commit is contained in:
2
.gitignore
vendored
2
.gitignore
vendored
@@ -1,5 +1,5 @@
|
|||||||
.idea/
|
.idea/
|
||||||
data/
|
beaky-backend/data/
|
||||||
report.xml
|
report.xml
|
||||||
|
|
||||||
# Byte-compiled / optimized / DLL files
|
# Byte-compiled / optimized / DLL files
|
||||||
|
|||||||
98
CLAUDE.md
Normal file
98
CLAUDE.md
Normal file
@@ -0,0 +1,98 @@
|
|||||||
|
# CLAUDE.md
|
||||||
|
|
||||||
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||||||
|
|
||||||
|
## What This Project Does
|
||||||
|
|
||||||
|
Beaky is a CLI tool for verifying the truthfulness of sports betting tickets. It reads ticket URLs from an Excel file, classifies the bets on each ticket (via web scraping or OCR), then resolves each bet against a football statistics API to determine if the ticket is genuine.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install (with dev dependencies)
|
||||||
|
pip install -e ".[dev]"
|
||||||
|
|
||||||
|
# Install Playwright browser (required for link classifier and screenshotter)
|
||||||
|
playwright install chromium
|
||||||
|
|
||||||
|
# Run the CLI
|
||||||
|
beaky <mode> [--config config/application.yml] [--id <ticket_id>] [--classifier {link,img,both}] [--dump]
|
||||||
|
|
||||||
|
# Modes:
|
||||||
|
# screen - screenshot all ticket URLs to data/screenshots/<id>.png
|
||||||
|
# parse - print all links loaded from Excel
|
||||||
|
# compare - classify tickets and print bet comparison table
|
||||||
|
# resolve - classify via link classifier, then resolve bets against football API
|
||||||
|
|
||||||
|
# Run the REST API (default: http://0.0.0.0:8000)
|
||||||
|
beaky-api
|
||||||
|
|
||||||
|
# Run tests
|
||||||
|
pytest
|
||||||
|
|
||||||
|
# Lint
|
||||||
|
ruff check .
|
||||||
|
|
||||||
|
# Format
|
||||||
|
ruff format .
|
||||||
|
```
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
Data flows through four stages:
|
||||||
|
|
||||||
|
1. **Scanner** (`scanner/scanner.py`) — Reads `data/odkazy.xlsx` and produces `Link` objects (id, url, date).
|
||||||
|
|
||||||
|
2. **Classifiers** — Two independent classifiers both produce a `Ticket` (list of typed `Bet` objects):
|
||||||
|
- **Link classifier** (`link_classifier/classifier.py`) — Launches a headless Chromium browser via Playwright, navigates to the ticket URL (a Czech Fortuna betting site), and parses the DOM using CSS selectors to extract bet details.
|
||||||
|
- **Image classifier** (`image_classifier/classifier.py`) — Runs pytesseract OCR on screenshots in `data/screenshots/`, then uses regex to parse the raw text into bets. Block segmentation is driven by date-start and sport-prefix end triggers.
|
||||||
|
|
||||||
|
3. **Resolver** (`resolvers/resolver.py`) — Takes a classified `Ticket` and resolves each bet's outcome (WIN/LOSE/VOID/UNKNOWN) by querying the `api-sports.io` football API. Matches fixtures using team name similarity (SequenceMatcher) and date proximity. Results are disk-cached in `data/fixture_cache/` to avoid redundant API calls.
|
||||||
|
|
||||||
|
4. **CLI** (`cli.py`) — Ties everything together. Handles `--classifier` and `--dump` flags; renders ANSI-colored comparison tables for side-by-side link-vs-image output.
|
||||||
|
|
||||||
|
5. **REST API** (`api/`) — FastAPI app exposing a single endpoint. Runs the full pipeline (screenshot → both classifiers → resolve) for a given URL and returns the verdict. Classifiers and resolver are instantiated once at startup (`app.state`) and reused across requests.
|
||||||
|
|
||||||
|
### Core Domain Models (`datamodels/ticket.py`)
|
||||||
|
|
||||||
|
`Bet` is an abstract Pydantic dataclass with a `resolve(MatchInfo) -> BetOutcome` method. Concrete subtypes include: `WinDrawLose`, `WinDrawLoseDouble`, `WinLose`, `BothTeamScored`, `GoalAmount`, `GoalHandicap`, `HalfTimeResult`, `HalfTimeDouble`, `HalfTimeFullTime`, `CornerAmount`, `TeamCornerAmount`, `MoreOffsides`, `Advance`, `UnknownBet`. Adding a new bet type requires: a new subclass here, detection regex in both classifiers, and a `resolve()` implementation.
|
||||||
|
|
||||||
|
### REST API
|
||||||
|
|
||||||
|
**Endpoint:** `POST /api/v1/resolve`
|
||||||
|
|
||||||
|
```json
|
||||||
|
{ "url": "<fortuna ticket url>", "debug": false }
|
||||||
|
```
|
||||||
|
|
||||||
|
Response includes `verdict` and per-bet `outcome`/`fixture_id`/`confidence`. With `debug: true` also returns raw `link_ticket`, `img_ticket`, and per-bet `match_info`.
|
||||||
|
|
||||||
|
Ticket ID is derived as `md5(url) % 10^9` — stable across restarts. Screenshots are saved to `data/screenshots/{ticket_id}.png`.
|
||||||
|
|
||||||
|
**Environment variables** (all optional):
|
||||||
|
|
||||||
|
| Var | Default |
|
||||||
|
|---|---|
|
||||||
|
| `BEAKY_CONFIG` | `config/application.yml` |
|
||||||
|
| `BEAKY_HOST` | `0.0.0.0` |
|
||||||
|
| `BEAKY_PORT` | `8000` |
|
||||||
|
| `LOG_LEVEL` | value from `config/application.yml` → `api.log_level` |
|
||||||
|
|
||||||
|
OpenAPI docs available at `/docs` when the server is running.
|
||||||
|
|
||||||
|
### Logging
|
||||||
|
|
||||||
|
All modules use `logging` (no `print()`). The CLI's user-facing output (`cli.py`) still uses `print`. Resolver debug output (fixture matching, API calls) goes through `_ansi.log()` which emits at `DEBUG` level with ANSI colors preserved. Set `api.log_level: DEBUG` in `config/application.yml` (or `LOG_LEVEL=DEBUG` env var) to see it.
|
||||||
|
|
||||||
|
### Configuration
|
||||||
|
|
||||||
|
Config is loaded from `config/application.yml` into Pydantic dataclasses (`Config`, `ScreenshotterConfig`, `ResolverConfig`, `ImgClassifierConfig`, `ApiConfig`). Key fields:
|
||||||
|
- `path` — path to the input Excel file
|
||||||
|
- `resolver.api_key` — api-sports.io API key
|
||||||
|
- `resolver.league_map` — maps Czech league name patterns to API league IDs (longest-match wins)
|
||||||
|
- `resolver.cache_path` — disk cache directory (default: `data/fixture_cache`)
|
||||||
|
- `api.log_level` — logging level for the API server (default: `INFO`)
|
||||||
|
|
||||||
|
### Bet text language
|
||||||
|
|
||||||
|
All bet type strings are in Czech (from the Fortuna betting platform). Regex patterns in both classifiers match Czech text (e.g. `"Výsledek zápasu"`, `"Počet gólů"`).
|
||||||
@@ -64,3 +64,7 @@ img_classifier:
|
|||||||
target_path: data/screenshots/
|
target_path: data/screenshots/
|
||||||
|
|
||||||
log_level: INFO # set to DEBUG to see raw classifier and resolver output
|
log_level: INFO # set to DEBUG to see raw classifier and resolver output
|
||||||
|
|
||||||
|
api:
|
||||||
|
host: 0.0.0.0
|
||||||
|
port: 8000
|
||||||
@@ -16,6 +16,9 @@ dependencies = [
|
|||||||
"playwright==1.58.0",
|
"playwright==1.58.0",
|
||||||
"requests>=2.32.0",
|
"requests>=2.32.0",
|
||||||
"diskcache>=5.6",
|
"diskcache>=5.6",
|
||||||
|
"pytesseract==0.3.13",
|
||||||
|
"fastapi>=0.115",
|
||||||
|
"uvicorn[standard]>=0.34",
|
||||||
]
|
]
|
||||||
|
|
||||||
[project.optional-dependencies]
|
[project.optional-dependencies]
|
||||||
@@ -30,6 +33,7 @@ dev = [
|
|||||||
|
|
||||||
[project.scripts]
|
[project.scripts]
|
||||||
beaky = "beaky.cli:main"
|
beaky = "beaky.cli:main"
|
||||||
|
beaky-api = "beaky.api.main:main"
|
||||||
|
|
||||||
|
|
||||||
[tool.ruff]
|
[tool.ruff]
|
||||||
@@ -3,11 +3,8 @@ import re as _re
|
|||||||
import shutil
|
import shutil
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
|
|
||||||
import yaml
|
|
||||||
from pydantic import ValidationError
|
|
||||||
|
|
||||||
from beaky import _ansi
|
from beaky import _ansi
|
||||||
from beaky.config import Config
|
from beaky.config import load_config
|
||||||
from beaky.datamodels.ticket import Bet, Ticket
|
from beaky.datamodels.ticket import Bet, Ticket
|
||||||
from beaky.image_classifier.classifier import img_classify
|
from beaky.image_classifier.classifier import img_classify
|
||||||
from beaky.link_classifier.classifier import LinkClassifier
|
from beaky.link_classifier.classifier import LinkClassifier
|
||||||
@@ -191,16 +188,6 @@ def _print_dump(ticket: Ticket, label: str) -> None:
|
|||||||
print(f" {k}: {val}")
|
print(f" {k}: {val}")
|
||||||
|
|
||||||
|
|
||||||
def load_config(path: str) -> Config | None:
|
|
||||||
with open(path) as f:
|
|
||||||
config_dict = yaml.safe_load(f)
|
|
||||||
try:
|
|
||||||
return Config(**config_dict)
|
|
||||||
except ValidationError as e:
|
|
||||||
print("Bad config")
|
|
||||||
print(e)
|
|
||||||
return None
|
|
||||||
|
|
||||||
def main() -> None:
|
def main() -> None:
|
||||||
parser = argparse.ArgumentParser(prog="beaky")
|
parser = argparse.ArgumentParser(prog="beaky")
|
||||||
parser.add_argument("--config", help="Path to config file.", default="config/application.yml")
|
parser.add_argument("--config", help="Path to config file.", default="config/application.yml")
|
||||||
@@ -212,8 +199,10 @@ def main() -> None:
|
|||||||
help="Dump all bet fields untruncated (compare mode only).")
|
help="Dump all bet fields untruncated (compare mode only).")
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
config = load_config(args.config)
|
try:
|
||||||
if config is None:
|
config = load_config(args.config)
|
||||||
|
except RuntimeError as e:
|
||||||
|
print(e)
|
||||||
return
|
return
|
||||||
|
|
||||||
# always load testing data, we will modify that later
|
# always load testing data, we will modify that later
|
||||||
34
beaky-backend/src/beaky/config.py
Normal file
34
beaky-backend/src/beaky/config.py
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
from dataclasses import field as _field
|
||||||
|
|
||||||
|
import yaml
|
||||||
|
from pydantic import ValidationError
|
||||||
|
from pydantic.dataclasses import dataclass
|
||||||
|
|
||||||
|
from beaky.image_classifier.config import ImgClassifierConfig
|
||||||
|
from beaky.resolvers.config import ResolverConfig
|
||||||
|
from beaky.screenshotter.config import ScreenshotterConfig
|
||||||
|
|
||||||
|
|
||||||
|
def load_config(path: str) -> "Config":
|
||||||
|
with open(path) as f:
|
||||||
|
data = yaml.safe_load(f)
|
||||||
|
try:
|
||||||
|
return Config(**data)
|
||||||
|
except ValidationError as exc:
|
||||||
|
raise RuntimeError(f"Invalid config at {path}: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class ApiConfig:
|
||||||
|
host: str = "0.0.0.0"
|
||||||
|
port: int = 8000
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Config:
|
||||||
|
path: str
|
||||||
|
screenshotter: ScreenshotterConfig
|
||||||
|
resolver: ResolverConfig
|
||||||
|
img_classifier: ImgClassifierConfig
|
||||||
|
log_level: str = "INFO"
|
||||||
|
api: ApiConfig = _field(default_factory=ApiConfig)
|
||||||
@@ -179,7 +179,8 @@ class TicketResolver:
|
|||||||
if cache_key not in self._fixture_cache:
|
if cache_key not in self._fixture_cache:
|
||||||
if cache_key in self._disk_cache and not cache_may_be_stale:
|
if cache_key in self._disk_cache and not cache_may_be_stale:
|
||||||
self._fixture_cache[cache_key] = self._disk_cache[cache_key]
|
self._fixture_cache[cache_key] = self._disk_cache[cache_key]
|
||||||
_ansi.log(_ansi.gray(f" │ /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)"))
|
_ansi.log(
|
||||||
|
_ansi.gray(f" │ /fixtures served from disk cache ({len(self._fixture_cache[cache_key])} fixtures)"))
|
||||||
else:
|
else:
|
||||||
date_from = (center - timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
|
date_from = (center - timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
|
||||||
date_to = (center + timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
|
date_to = (center + timedelta(days=_DATE_WINDOW)).strftime("%Y-%m-%d")
|
||||||
@@ -197,7 +198,8 @@ class TicketResolver:
|
|||||||
self._disk_cache[cache_key] = cacheable
|
self._disk_cache[cache_key] = cacheable
|
||||||
_ansi.log(_ansi.gray(f" │ {len(cacheable)} non-NS fixture(s) written to disk cache"))
|
_ansi.log(_ansi.gray(f" │ {len(cacheable)} non-NS fixture(s) written to disk cache"))
|
||||||
else:
|
else:
|
||||||
_ansi.log(_ansi.gray(f" │ /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory"))
|
_ansi.log(
|
||||||
|
_ansi.gray(f" │ /fixtures (±{_DATE_WINDOW}d of {date_str}, league={league_id}) served from memory"))
|
||||||
|
|
||||||
fixture, name_match, date_prox = _best_fixture_match(
|
fixture, name_match, date_prox = _best_fixture_match(
|
||||||
self._fixture_cache[cache_key], bet.team1Name, bet.team2Name, center
|
self._fixture_cache[cache_key], bet.team1Name, bet.team2Name, center
|
||||||
@@ -227,7 +229,8 @@ class TicketResolver:
|
|||||||
if results:
|
if results:
|
||||||
league_id = results[0]["league"]["id"]
|
league_id = results[0]["league"]["id"]
|
||||||
league_found_name = results[0]["league"]["name"]
|
league_found_name = results[0]["league"]["name"]
|
||||||
_ansi.log(_ansi.gray(f" │ matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)"))
|
_ansi.log(
|
||||||
|
_ansi.gray(f" │ matched {league_found_name!r} id={league_id} (API fallback, confidence=0.7)"))
|
||||||
self._league_cache[key] = (league_id, 0.7)
|
self._league_cache[key] = (league_id, 0.7)
|
||||||
return league_id, 0.7
|
return league_id, 0.7
|
||||||
|
|
||||||
0
beaky-frontend/.gitkeep
Normal file
0
beaky-frontend/.gitkeep
Normal file
@@ -1,88 +0,0 @@
|
|||||||
import os
|
|
||||||
import re
|
|
||||||
import sys
|
|
||||||
import argparse
|
|
||||||
from datetime import datetime
|
|
||||||
import pytz
|
|
||||||
from openpyxl import Workbook
|
|
||||||
|
|
||||||
|
|
||||||
def process_files(starting_id, output_filename="output.xlsx"):
|
|
||||||
# Find all txt files in the current directory
|
|
||||||
txt_files = [f for f in os.listdir('.') if f.endswith('.txt')]
|
|
||||||
|
|
||||||
if not txt_files:
|
|
||||||
print("No .txt files found in the current directory.")
|
|
||||||
return
|
|
||||||
|
|
||||||
# Regex patterns for input data
|
|
||||||
date_pattern = re.compile(r'\[.*?(\d{1,2})\s+(\d{1,2}),\s+(\d{4})\s+at\s+(\d{1,2}:\d{2})\]')
|
|
||||||
url_pattern = re.compile(r'(https?://[^\s]+)')
|
|
||||||
|
|
||||||
# Timezone setup (CET to UTC)
|
|
||||||
local_tz = pytz.timezone("Europe/Prague")
|
|
||||||
|
|
||||||
# Set up the Excel Workbook
|
|
||||||
wb = Workbook()
|
|
||||||
ws = wb.active
|
|
||||||
ws.title = "Fortuna Data"
|
|
||||||
ws.append(["ID", "URL", "Date_UTC"]) # Add headers
|
|
||||||
|
|
||||||
current_id = starting_id
|
|
||||||
success_files = []
|
|
||||||
|
|
||||||
for filename in txt_files:
|
|
||||||
try:
|
|
||||||
with open(filename, 'r', encoding='utf-8') as f:
|
|
||||||
content = f.read()
|
|
||||||
|
|
||||||
dates = date_pattern.findall(content)
|
|
||||||
urls = url_pattern.findall(content)
|
|
||||||
|
|
||||||
# Extract and format the data
|
|
||||||
for i in range(min(len(dates), len(urls))):
|
|
||||||
month, day, year, time_str = dates[i]
|
|
||||||
|
|
||||||
# Parse the datetime from the text file
|
|
||||||
dt_str = f"{year}-{month}-{day} {time_str}"
|
|
||||||
local_dt = datetime.strptime(dt_str, "%Y-%m-%d %H:%M")
|
|
||||||
|
|
||||||
# Convert CET to UTC
|
|
||||||
localized_dt = local_tz.localize(local_dt)
|
|
||||||
utc_dt = localized_dt.astimezone(pytz.utc)
|
|
||||||
|
|
||||||
# NEW: Format to ISO 8601 with T and Z
|
|
||||||
formatted_date = utc_dt.strftime("%Y-%m-%dT%H:%M:%SZ")
|
|
||||||
|
|
||||||
# Add a new row to the Excel sheet
|
|
||||||
ws.append([current_id, urls[i], formatted_date])
|
|
||||||
current_id += 1
|
|
||||||
|
|
||||||
# Queue file for deletion
|
|
||||||
success_files.append(filename)
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Error processing {filename}: {e}", file=sys.stderr)
|
|
||||||
|
|
||||||
# Save the Excel file
|
|
||||||
try:
|
|
||||||
wb.save(output_filename)
|
|
||||||
print(f"Successfully saved data to {output_filename}")
|
|
||||||
|
|
||||||
# Clean up only if save was successful
|
|
||||||
for filename in success_files:
|
|
||||||
os.remove(filename)
|
|
||||||
print(f"Deleted: {filename}")
|
|
||||||
|
|
||||||
except Exception as e:
|
|
||||||
print(f"Failed to save {output_filename}. No text files were deleted. Error: {e}", file=sys.stderr)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
parser = argparse.ArgumentParser(description="Extract URLs to an Excel file with ISO UTC dates.")
|
|
||||||
parser.add_argument("start_id", type=int, help="Starting ID for the output")
|
|
||||||
parser.add_argument("--output", type=str, default="extracted_data.xlsx",
|
|
||||||
help="Output Excel filename (default: extracted_data.xlsx)")
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
process_files(args.start_id, args.output)
|
|
||||||
@@ -1,14 +0,0 @@
|
|||||||
from pydantic.dataclasses import dataclass
|
|
||||||
|
|
||||||
from beaky.image_classifier.config import ImgClassifierConfig
|
|
||||||
from beaky.resolvers.config import ResolverConfig
|
|
||||||
from beaky.screenshotter.config import ScreenshotterConfig
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class Config:
|
|
||||||
path: str
|
|
||||||
screenshotter: ScreenshotterConfig
|
|
||||||
resolver: ResolverConfig
|
|
||||||
img_classifier: ImgClassifierConfig
|
|
||||||
log_level: str = "INFO"
|
|
||||||
Reference in New Issue
Block a user