ADR-005: YAML-based Seed Jobs at Application Startup¶
Status: Accepted Date: 2026-03-17 Deciders: Kamerplanter Development Team
Context¶
Kamerplanter requires extensive master data (botanical families, species, cultivars, fertilizers, nutrient plans, starter kits, workflows, IPM data, activities, etc.) that must be consistently present in the database on first start and on every update. The following decisions had to be made:
- Data format: In which format are seed data maintained?
- Execution timing: When and how are seed jobs executed?
- Idempotency: How is it ensured that repeated seeding does not create duplicates?
- Enrichment process: How are new master data added and existing data extended?
Decision¶
Declarative YAML Files as Single Source of Truth¶
All seed data is stored as YAML files in src/backend/app/migrations/seed_data/. YAML was chosen because it is human-readable, diff-friendly (Git), and well-suited for hierarchical data. Each domain has its own file (e.g., species.yaml, plagron.yaml, starter_kits.yaml).
Startup Seeding in FastAPI Lifespan¶
Seed jobs are executed at application startup in the FastAPI lifespan hook (main.py) — not as a separate CLI command or migration step. The execution order is fixed:
ensure_collections()— Create collections and graphseed_location_types()— Location typesrun_seed()— Core master data (families, species, cultivars, IPM, workflows)run_seed_starter_kits()— Onboarding starter kitsrun_seed_adventskalender()— Seasonal kitsrun_seed_plant_info()/run_seed_plant_info_extended()— Extended plant datarun_seed_plagron()/run_seed_gardol()— Product-specific fertilization plansrun_seed_nutrient_plans_outdoor()— Outdoor nutrient plansrun_seed_activities()— Activity definitionsrun_seed_lifecycles_outdoor()— Outdoor lifecycles- Conditional:
run_seed_light_mode()— Only whenKAMERPLANTER_MODE=light
Idempotency via Lookup-before-Create¶
Each seed job checks whether a record already exists (by scientific_name, kit_id, product_name, etc.) before creating it. Four patterns are used:
- Lookup + Create/Update: Existence check by unique field, then insert or selective update of defined fields
- Selective Field Update: Only predefined
seed_update_fieldsare overwritten — user-defined changes to other fields are preserved - Backfill Missing: Count existing entries, add missing ones (e.g., nutrient plan phases)
- Exception-based: Try/catch for graph edges that throw an error on duplicate
Reference Resolution via Intermediate Maps¶
YAML files use human-readable names (scientific_name, product_name). During seeding, intermediate maps are built (name -> _key) that subsequent steps use to resolve references (e.g., species_names in starter kits -> species_keys).
Enrichment Process for Seed Data¶
The process for adding or extending master data follows a fixed schema:
1. Edit or Create YAML File¶
New data is added to the appropriate YAML file in seed_data/. For new domains, a new file is created. The structure follows the Pydantic models in domain/models/.
2. Extend Pydantic Model (if needed)¶
If new fields are required, the Pydantic model in domain/models/ is extended. Pydantic v2 automatically handles coercion from YAML strings to enums, lists, etc.
3. Adapt Seed Function¶
The corresponding seed function in migrations/seed_*.py is extended:
- Add new fields to
seed_update_fields(so existing records get updated) - Add new reference resolutions (if the new field references other entities)
yaml_loader.load_yaml()uses the same mechanism
4. Observe Startup Order¶
When a new seed file is created, it must be hooked into main.py in the correct order — dependencies (e.g., species before starter kits) determine the position.
5. Enrichment via Agents¶
For initial creation and extension of plant data, the plant-info-document-generator agent is available. It researches botanical data and produces structured documents that are then transferred into the YAML files.
6. Testing¶
A restart of the application automatically triggers all seed jobs. Structured logging (structlog) records every action (created/updated/skipped) with identifiers.
Rationale¶
Why YAML and not SQL Migrations, JSON, or CSV?¶
- SQL Migrations (Alembic-style) are not suitable for ArangoDB as a document database
- JSON is less readable and harder to diff than YAML for deeply nested structures
- CSV cannot represent hierarchical data (nested phases, dosage lists)
- YAML is the natural compromise: machine-readable, human-readable, Git-diff-friendly
Why Startup and not Separate Migrations?¶
- Simplicity: No separate migration command needed, no forgotten step during deployment
- Always consistent: Every startup guarantees complete master data
- Idempotency: Repeated execution is safe — no state tracking (no migration table) needed
- Kubernetes-ready: Pods can restart at any time; seed jobs are part of the startup lifecycle
Why no Transaction Rollback?¶
- ArangoDB transactions across many collections are complex and limited
- Partial seeding is acceptable: idempotency ensures that a restart fills in the rest
- Startup blocks on errors anyway — an incomplete seed leads to a pod restart
Consequences¶
Positive¶
- Master data is versioned and reviewable (Git)
- Simple onboarding process:
git pull+ restart = current data - Clear separation: YAML = data, Python = orchestration
- Extensible: new seed file + hook into
main.pyis sufficient - Observable: structured logging shows exactly what was seeded
Negative¶
- Startup time increases with growing data volume (currently ~2-3s, acceptable)
- No atomic rollback on partial failure (compensated by idempotency)
- Order of seed jobs must be maintained manually (dependency graph is implicit)
seed_update_fieldsmust be manually extended for new fields — otherwise existing records are not updated