checking system…
Docs / Runbook
Preflight, restart dashboard, sidecar, watch list, verify Kronos, rate-limit recovery.

MAF — Runbook

Operational tasks, ordered by frequency. Each command is paste-ready.

Table of contents

  1. Preflight: am I healthy?
  2. Service management — systemd
  3. Manual start / restart (development)
  4. Wiring up trtools2 (TT2_API_KEY)
  5. Start the Kronos sidecar
  6. Watch / unwatch a symbol
  7. Verify the Kronos loop end-to-end
  8. Trigger an arena from the shell
  9. Tail live events
  10. Ollama rate-limit recovery
  11. Production gotchas

Preflight: am I healthy?

python -m maf doctor

Walks Redis · Ollama Cloud · arenas-load · streams-write · trtools2 · fomo2 · mirofish. Returns exit-0 only when every required check passes. Each failing line ships an actionable hint (doctor.py).

The dashboard's status bar surfaces the same info as coloured pills — redis · ollama · trtools2 · fomo2 · mirofish · kronos_refresher · mirofish_refresher. The last two are red when only the dashboard is running (no worker); see service management below.


Service management — systemd

Production runs two long-lived processes — the dashboard and the service-mode worker. Both managed by systemd, both survive reboot, both log-rotated.

One-time install

sudo mkdir -p /var/log/maf
sudo chown trbck:trbck /var/log/maf

sudo cp deploy/systemd/maf-dashboard.service /etc/systemd/system/
sudo cp deploy/systemd/maf-worker.service     /etc/systemd/system/
sudo cp deploy/logrotate.d/maf                /etc/logrotate.d/maf

sudo systemctl daemon-reload
sudo systemctl enable --now maf-dashboard.service
sudo systemctl enable --now maf-worker.service

See deploy/README.md for the unit-file breakdown.

Day-to-day ops

Action Command
Restart dashboard (after template / JS edits) sudo systemctl restart maf-dashboard
Restart worker (after arena YAML edits) sudo systemctl restart maf-worker
Restart both sudo systemctl restart maf-dashboard maf-worker
Tail both sudo journalctl -u maf-dashboard -u maf-worker -f
Status systemctl status maf-dashboard maf-worker
Force log rotation (don't wait for cron) sudo logrotate -f /etc/logrotate.d/maf

Verifying the worker is up

Within ~60 s of maf-worker starting, the dashboard's kronos_refresher + mirofish_refresher pills turn green — that's the heartbeat key landing in Redis. If they stay red, check journalctl -u maf-worker -n 100.


Manual start / restart (development)

If you're not using systemd (dev box, CI, ad-hoc):

# stop the old process
pkill -9 -f 'python -m maf' ; sleep 3

# dashboard only (no refreshers — fine for UI work)
nohup ./.venv/bin/python -m maf --dashboard \
    --host 127.0.0.1 --port 8420 --log-level WARNING \
    >/tmp/maf-dashboard.log 2>&1 &
disown

# service mode (refreshers + trigger dispatcher + control inbox)
nohup ./.venv/bin/python -m maf --log-level INFO \
    >/tmp/maf-worker.log 2>&1 &
disown

Verify:

sleep 5
curl -sI http://127.0.0.1:8420/api/system/status | head -1
curl -sI https://maf.techvizier.com/ | head -1

Wiring up trtools2 (TT2_API_KEY)

MAF talks to trtools2's dashboard API (default port 8888) via the trtools2_api adapter. trtools2 requires X-API-Key auth — copy the value from trtools2's config into MAF's .env:

TT2_KEY=$(grep -E '^TT2_API_KEY=' /home/trbck/workspace/trtools2/config/dashboard.env | cut -d= -f2-)
if grep -q '^TT2_API_KEY=' /home/trbck/workspace/MAF/.env 2>/dev/null; then
    sed -i "s|^TT2_API_KEY=.*|TT2_API_KEY=${TT2_KEY}|" /home/trbck/workspace/MAF/.env
else
    echo "TT2_API_KEY=${TT2_KEY}" >> /home/trbck/workspace/MAF/.env
fi

MAF's python-dotenv auto-loader picks it up on next startup — restart the dashboard and worker.

Verify:

curl -s 'http://127.0.0.1:8420/api/data/sources/alpaca_live/feed_health/sample' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('ok' if d['ok'] else 'fail')"

The dashboard's freshness badges on the alpaca_live arena's Setup tab also flip to external · HTTP API once auth works.

What if trtools2 is on a different host?

Override the base URL:

echo 'TRTOOLS2_API_URL=https://trtools2.internal:8888' >> /home/trbck/workspace/MAF/.env

Or per-binding in YAML:

- name: feed_health
  adapter: trtools2_api
  config:
    query_type: feed_stats
    base_url: https://trtools2.internal:8888

Start the Kronos sidecar

cd /home/trbck/workspace/MAF/services/kronos-svc

# one-time setup (only on a fresh box; ~1.1 GB venv after install)
python3 -m venv .venv
./.venv/bin/pip install -r requirements.txt

# start
PORT=5102 LOG_LEVEL=INFO nohup ./.venv/bin/python server.py \
    >/tmp/kronos-svc.log 2>&1 &
disown
sleep 4

# health
curl -s http://127.0.0.1:5102/health
# {"status":"ok","model":"NeoQuasar/Kronos-small","loaded":false}

The model loads lazily on the first /forecast call (~12 s). Subsequent calls are warm (~5–10 s for Kronos-small on CPU).

Source: server.py + PredictorPool.


Watch / unwatch a symbol

Adding a symbol kicks off the KronosRefresher for that symbol (provided the service-mode worker is running). Costs zero until a fresh forecast is requested.

# watch
curl -X POST -H 'Content-Type: application/json' \
  -d '{"target_id":"NVDA","kind":"symbol","ttl_seconds":21600}' \
  http://localhost:8420/api/watch

# list
curl -s 'http://localhost:8420/api/watch?kind=symbol' | jq .

# unwatch
curl -X DELETE 'http://localhost:8420/api/watch/NVDA?kind=symbol'

Backing class: WatchList.


Verify the Kronos loop end-to-end

Hermetic 4-step test that proves real torch inference + cache + adapter + specialist contract, without depending on Ollama Cloud:

# make sure kronos-svc is running (see above)
cd /home/trbck/workspace/MAF
PYTHONPATH=src ./.venv/bin/python scripts/verify_kronos_loop.py

Expected output:

[1] sidecar health      ✓ /health → model=NeoQuasar/Kronos-small loaded=True
[2] refresher tick      ✓ cache populated: direction=NEUTRAL prob_up=0.0
[3] adapter read        ✓ age=0.0s direction=NEUTRAL prob_up=0.0
[4] kronos_specialist   ✓ AgentSignal: signal=NEUTRAL confidence=0.27 …

Source: verify_kronos_loop.py. The script seeds 80 synthetic NVDA bars, runs one refresher tick, then exercises the specialist with a stub LLM that follows the prompt's deterministic mapping rule.


Trigger an arena from the shell

Two ways. The control plane is the production path — it's the same plumbing the dashboard uses.

A. Direct XADD (raw)

redis-cli XADD maf:control:in '*' data '{
  "command": "run_arena",
  "correlation_id": "demo-1",
  "args": {
    "arena": "market_pulse",
    "target": {"ticker": "NVDA"},
    "action_mode": "manual"
  }
}'

# wait for ack
redis-cli XREAD COUNT 1 BLOCK 60000 STREAMS maf:control:out '$'

B. Via the Python client (prettier)

from maf.control.client import ControlClient
import asyncio

async def main():
    client = ControlClient()
    ack = await client.send("run_arena", {
        "arena": "market_pulse",
        "target": {"ticker": "NVDA"},
        "action_mode": "manual",
    })
    print(ack["result"]["synthesis_verdict"],
          ack["result"]["synthesis_confidence"])

asyncio.run(main())

Source: ControlClient + ControlInbox.

C. Via the dashboard Run button

Open the dashboard, click Run on any arena card. The adaptive modal collects target + mode, posts to POST /api/arenas/{name}/run, and renders the verdict + reasoning inline. This is the path most operators use.

D. Auto-firing via smart triggers

Use the Setup tab on the arena page to add a trigger rule. The TriggerDispatcher auto-fires the arena when a matching event hits the configured stream. See config/trigger_templates.yaml for prebuilt rules.


Tail live events

python -m maf events                      # all events
python -m maf events --filter arena.complete
python -m maf events --arena market_pulse

Or in the browser: https://maf.techvizier.com/live.

Backend: EventBusReader → WS pump at /ws/events.


Ollama rate-limit recovery

Symptom: many HTTP 429 — Too Many Requests lines in the logs, arenas falling back to verdict=HOLD confidence=0.0.

Steps:

  1. Confirm it's a rate-limit (not a billing issue): bash curl -s -o /dev/null -w '%{http_code}\n' \ -H "Authorization: Bearer $OLLAMA_API_KEY" \ https://ollama.com/v1/models 200 means the account is live. 401 means the key is invalid (rotate).

  2. Wait 3–5 minutes. The per-account RPM resets quickly. Re-run the arena.

  3. If the limit keeps biting, lower the burst by reducing max_react_steps on the noisier specialists. Each ReAct iteration = one LLM call.

  4. Switch the quick-profile model to a smaller one in PROFILE_MAPglm-4.7 is small but might still hit the limit; try gemma3:4b for the quick profile. Heavy-reasoning profile can stay on gpt-oss:120b.


Production gotchas

Two clones, one inode

/home/trbck/wp/MAF and /home/trbck/workspace/MAF are the same directory (bind-mounted). Editing one edits the other. No syncing needed.

Refresher loops only run in service mode

The dashboard process (python -m maf --dashboard) does not start the Kronos/MiroFish refreshers or the trigger dispatcher. Those live in the full service mode (python -m maf with no flags). The kronos_refresher / mirofish_refresher status pills in the dashboard surface this — green means the heartbeat key is fresh, red with "not running" means start the worker.

For one-off testing without a long-running worker:

python -c "
import asyncio
from maf.scheduler.kronos_refresher import KronosRefresher
import httpx
async def main():
    r = KronosRefresher(sidecar_url='http://localhost:5102')
    async with httpx.AsyncClient(timeout=120) as http:
        await r._tick(http)
    await r.aclose()
asyncio.run(main())
"

Config saved but didn't take effect

The Setup tab writes YAML atomically but in-memory configs are loaded once at process start. Either:

  • Restart the worker (sudo systemctl restart maf-worker) for arena
  • trigger changes to take effect.
  • Send a reload_config control command (rebuilds in-memory arena graph without restart): bash redis-cli XADD maf:control:in '*' data '{"command":"reload_config","args":{}}'

Save returned 412 in the Setup tab

Another browser tab (or another operator) saved this arena's config while you were editing. Reload the Setup tab to pull the latest ETag, re-apply your edits, save again. The on-disk YAML has the other writer's version — your edits weren't lost in the UI, just not yet persisted.

Disk pressure

The kronos-svc venv is 1.1 GB; the Docker image (when built) is ~3 GB. With 13 GB free, prefer the native venv path. To free space, prune docker:

docker system prune -af --volumes

Browser cache

After any nav / CSS / JS change, hard-refresh in the browser (Ctrl-Shift-R on Linux, Cmd-Shift-R on macOS). Cloudflare in front of the public domain holds JS for a few minutes.