ADR-005: YAML-based Seed Jobs at Application Startup¶

Status: Accepted Date: 2026-03-17 Deciders: Kamerplanter Development Team

Context¶

Kamerplanter requires extensive master data (botanical families, species, cultivars, fertilizers, nutrient plans, starter kits, workflows, IPM data, activities, etc.) that must be consistently present in the database on first start and on every update. The following decisions had to be made:

Data format: In which format are seed data maintained?
Execution timing: When and how are seed jobs executed?
Idempotency: How is it ensured that repeated seeding does not create duplicates?
Enrichment process: How are new master data added and existing data extended?

Decision¶

Declarative YAML Files as Single Source of Truth¶

All seed data is stored as YAML files in src/backend/app/migrations/seed_data/. YAML was chosen because it is human-readable, diff-friendly (Git), and well-suited for hierarchical data. Each domain has its own file (e.g., species.yaml, plagron.yaml, starter_kits.yaml).

Startup Seeding in FastAPI Lifespan¶

Seed jobs are executed at application startup in the FastAPI lifespan hook (main.py) — not as a separate CLI command or migration step. The execution order is fixed:

ensure_collections() — Create collections and graph
seed_location_types() — Location types
run_seed() — Core master data (families, species, cultivars, IPM, workflows)
run_seed_starter_kits() — Onboarding starter kits
run_seed_adventskalender() — Seasonal kits
run_seed_plant_info() / run_seed_plant_info_extended() — Extended plant data
run_seed_plagron() / run_seed_gardol() — Product-specific fertilization plans
run_seed_nutrient_plans_outdoor() — Outdoor nutrient plans
run_seed_activities() — Activity definitions
run_seed_lifecycles_outdoor() — Outdoor lifecycles
Conditional: run_seed_light_mode() — Only when KAMERPLANTER_MODE=light

Idempotency via Lookup-before-Create¶

Each seed job checks whether a record already exists (by scientific_name, kit_id, product_name, etc.) before creating it. Four patterns are used:

Lookup + Create/Update: Existence check by unique field, then insert or selective update of defined fields
Selective Field Update: Only predefined seed_update_fields are overwritten — user-defined changes to other fields are preserved
Backfill Missing: Count existing entries, add missing ones (e.g., nutrient plan phases)
Exception-based: Try/catch for graph edges that throw an error on duplicate

Reference Resolution via Intermediate Maps¶

YAML files use human-readable names (scientific_name, product_name). During seeding, intermediate maps are built (name -> _key) that subsequent steps use to resolve references (e.g., species_names in starter kits -> species_keys).

Enrichment Process for Seed Data¶

The process for adding or extending master data follows a fixed schema:

1. Edit or Create YAML File¶

New data is added to the appropriate YAML file in seed_data/. For new domains, a new file is created. The structure follows the Pydantic models in domain/models/.

2. Extend Pydantic Model (if needed)¶

If new fields are required, the Pydantic model in domain/models/ is extended. Pydantic v2 automatically handles coercion from YAML strings to enums, lists, etc.

3. Adapt Seed Function¶

The corresponding seed function in migrations/seed_*.py is extended:

Add new fields to seed_update_fields (so existing records get updated)
Add new reference resolutions (if the new field references other entities)
yaml_loader.load_yaml() uses the same mechanism

4. Observe Startup Order¶

When a new seed file is created, it must be hooked into main.py in the correct order — dependencies (e.g., species before starter kits) determine the position.

5. Enrichment via Agents¶

For initial creation and extension of plant data, the plant-info-document-generator agent is available. It researches botanical data and produces structured documents that are then transferred into the YAML files.

6. Testing¶

A restart of the application automatically triggers all seed jobs. Structured logging (structlog) records every action (created/updated/skipped) with identifiers.

Rationale¶

Why YAML and not SQL Migrations, JSON, or CSV?¶

SQL Migrations (Alembic-style) are not suitable for ArangoDB as a document database
JSON is less readable and harder to diff than YAML for deeply nested structures
CSV cannot represent hierarchical data (nested phases, dosage lists)
YAML is the natural compromise: machine-readable, human-readable, Git-diff-friendly

Why Startup and not Separate Migrations?¶

Simplicity: No separate migration command needed, no forgotten step during deployment
Always consistent: Every startup guarantees complete master data
Idempotency: Repeated execution is safe — no state tracking (no migration table) needed
Kubernetes-ready: Pods can restart at any time; seed jobs are part of the startup lifecycle

Why no Transaction Rollback?¶

ArangoDB transactions across many collections are complex and limited
Partial seeding is acceptable: idempotency ensures that a restart fills in the rest
Startup blocks on errors anyway — an incomplete seed leads to a pod restart

Consequences¶

Positive¶

Master data is versioned and reviewable (Git)
Simple onboarding process: git pull + restart = current data
Clear separation: YAML = data, Python = orchestration
Extensible: new seed file + hook into main.py is sufficient
Observable: structured logging shows exactly what was seeded

Negative¶

Startup time increases with growing data volume (currently ~2-3s, acceptable)
No atomic rollback on partial failure (compensated by idempotency)
Order of seed jobs must be maintained manually (dependency graph is implicit)
seed_update_fields must be manually extended for new fields — otherwise existing records are not updated