Reset Choreography¶
Choreography-driven, event-based system-state reset across Dashboard.Api and its optional components (fetcher, demo-driver). The API is the orchestrator and single source of truth; components react to control-stream events and report back via POST /api/control/events.
Authoritative behaviour: API_SPECIFICATION.md §10. This document is the visual reference only — wire shapes live in openapi.yaml.
Control-stream event vocabulary¶
type |
Emitted when | component |
Effect on components |
|---|---|---|---|
reset-initiated |
POST /api/control/reset accepted |
* |
Drain: stop fetching/ingesting, block own API + UI, then ack paused. |
reset-started |
Acks in OR timeout elapsed | * |
Reset window opened; ingest briefly returns 503. |
reset-completed |
Data cleared, gates released | * |
Recover: clear state, re-ingest/backfill, unblock, report running. |
correlation_id — the universal process id (#265). It originates at reset-initiated (where correlation_id == id) and is carried by every downstream command frame (reset-started / reset-completed) and component event (reset-ack, post-reset status), so the whole saga shares one filterable key. Each event keeps its own unique id (PK / SSE cursor), distinct from correlation_id; there is no reset_id field.
Acks are POST /api/control/events with event_type: reset-ack, state: paused, and the required header X-Correlation-Id: <reset-initiated id>. The orchestrator gates on correlation_id (the stored header value, matched against the in-flight cycle). A reset-ack with a missing/mismatched correlation_id is recorded but does not count toward the gate (see API_SPECIFICATION.md §7 Channel 3).
Sequence¶
sequenceDiagram
autonumber
actor Op as Operator
participant API as Dashboard.Api
participant DB as PostgreSQL
participant F as Fetcher
participant D as Demo Driver
Note over F,D: Both hold an open GET /api/control/stream
Op->>API: POST /api/control/reset
activate API
Note over API: idle → draining<br/>advisory lock · correlation_id = reset-initiated id (uuidv7)
API->>DB: emit reset-initiated {id, correlation_id: id, component:"*"}
API-->>Op: 202 Accepted {correlation_id, state: draining}
deactivate API
Note over API: reset → 409 · ingest STILL OPEN
DB-->>F: reset-initiated
DB-->>D: reset-initiated
par Fetcher drains
F->>F: stop poll loop + ingestion
F->>API: POST /api/control/events {reset-ack, paused} · X-Correlation-Id
and Demo driver drains
D->>D: block /demo/ API · stop emit · UI banner
D->>API: POST /api/control/events {reset-ack, paused} · X-Correlation-Id
end
Note over API: orchestrator counts acks for correlation_id
alt both acks in (fetcher + demo-driver)
DB-->>API: ack(fetcher) + ack(demo-driver)
else AckTimeoutSeconds (default 10s) elapses
Note over API: proceed — components optional
end
Note over API: draining → resetting · ingest gate ON
API->>DB: emit reset-started {correlation_id}
Note over API: POST /api/deployments → 503 (brief)
API->>DB: clear deployment history + fetcher cursors
Note over API: gates OFF · resetting → idle · reset reopens
API->>DB: emit reset-completed {correlation_id}
DB-->>F: reset-completed
DB-->>D: reset-completed
par Fetcher recovers
F->>F: clear cursor → backfill (initial ingestion)
F->>API: POST /api/deployments (backfill batch)
F->>F: resume periodic poll
F->>API: POST /api/control/events {running}
and Demo driver recovers
D->>D: unblock /demo/ API · clear UI banner
D->>API: POST /api/control/events {running}
end
State machine (API-driven)¶
Implemented with the Stateless library (dotnet-state-machine/stateless); current state is externally persisted in a DB state row (loaded per transition), with a Postgres advisory lock electing a single driver across instances (NFR-05). A GateMaxTtl safety abort prevents the system wedging with ingestion blocked if the driving instance dies mid-cycle.
stateDiagram-v2
[*] --> Idle
Idle --> Draining: POST /reset → emit reset-initiated
Draining --> Resetting: both acks OR 10s timeout<br/>emit reset-started · ingest gate ON
Resetting --> Idle: data cleared · gates OFF<br/>emit reset-completed
Draining --> Draining: POST /reset → 409
Resetting --> Resetting: POST /reset → 409 · ingest → 503
Draining --> Idle: GateMaxTtl exceeded (abort)
Resetting --> Idle: GateMaxTtl exceeded (abort)
Decisions (locked)¶
| # | Decision |
|---|---|
| 1 | State machine via the Stateless library; state externally persisted (DB state row) + Postgres advisory lock for single-driver election across instances (NFR-05). |
| 2 | Proceed when both acks (fetcher + demo-driver) are in OR AckTimeoutSeconds elapses; default 10 s. |
| 3 | Reset clears only deployment_events + fetcher_state; control/component tables left to the 2 h retention job. |
| 4 | Event types reset-initiated / reset-started / reset-completed; the legacy reset type is dropped (no alias). |
| 5 | Ack = POST /api/control/events {event_type: reset-ack, state: paused} + required header X-Correlation-Id = the reset-initiated id. The ack-gate keys on correlation_id (#265, Option A — reset_id retired everywhere). A missing/mismatched correlation_id is recorded but does not count toward the gate. |
| 6 | No status endpoint — reset progress is observable via the control-stream events only. |
Config keys (appsettings + env override): AckTimeoutSeconds (default 10), ExpectedComponents (default dashboard-fetcher, demo-driver), GateMaxTtlSeconds (default 60).
GateMaxTtlSeconds enforcement. This is a hard wall-clock ceiling on the entire orchestrator cycle, not just a checkpoint. A linked CancellationTokenSource armed with GateMaxTtlSeconds is created when the cycle starts and passed to every await inside the cycle — including ClearDataTablesAsync. If the ceiling fires (timeout, not graceful shutdown), the cycle is force-aborted: state written to idle, a reset-completed event emitted on the control stream so connected components recover, and the advisory lock released. The ack-wait (AckTimeoutSeconds) is a separate, inner timeout and is unaffected.