NoETL Catalog-Driven MCP Architecture
How NoETL turns the Model Context Protocol into a first-class object in the catalog: registered alongside playbooks and credentials, deployed and operated entirely through playbooks, surfaced in the GUI as a friendly form-driven workspace.
This document is a high-level tour. For the deploy walkthrough, see MCP End-to-End on Local Kind; for the GKE variant, see MCP End-to-End on GKE.
Why this design
A working MCP integration needs three things working together:
- Where does an MCP server live? Some are managed services (Anthropic-hosted), some run in-cluster (a kubernetes-mcp-server Deployment), some are local processes. The system has to know how to reach each one and how to provision new ones.
- Who can run what? Granting "may invoke
podson the Kubernetes MCP" should not require a separate IAM system — the same authorisation primitives that gate playbook execution should gate MCP tool calls. - What's it doing right now? Every MCP call should be visible in the same execution dashboard that shows playbook runs. Same events, same audit trail, same retry semantics.
The answer NoETL converged on: the catalog is the source of truth, and every MCP operation goes through a playbook.
The four phases
flowchart TB
subgraph "Phase 1: Catalog"
Mcp[(Mcp resource<br/>kind: Mcp)]
Playbook[(Lifecycle agents<br/>kind: Playbook)]
end
subgraph "Phase 2: Authorisation"
Auth[check_playbook_access<br/>auth.playbook_permissions]
end
subgraph "Phase 3: Lifecycle agents"
Deploy[deploy / undeploy / restart]
Status[status / discover]
end
subgraph "Phase 4: GUI"
Catalog[Catalog browser]
RunDialog[Friendly run dialog]
end
Catalog --> Mcp
Catalog --> Playbook
Mcp -.references.-> Playbook
Catalog --> RunDialog
RunDialog -- POST /api/mcp/.../lifecycle/{verb} --> Auth
Auth -- allowed --> Deploy
Auth -- allowed --> Status
Deploy -- helm upgrade --> Cluster[(Kubernetes cluster)]
Status -- kubectl get --> Cluster
Each phase is implemented and merged:
| Phase | Implementation | Released in |
|---|---|---|
1 — Mcp resource lifecycle endpoint | /api/mcp/{path}/lifecycle/{verb} + _ui_schema | noetl ≥ 2.26 |
2 — server-side check_playbook_access | noetl/server/api/auth/check_access.py | noetl ≥ 2.27 |
| 3 — Kubernetes MCP lifecycle agent fleet | automation/agents/kubernetes/lifecycle/* | ops main |
| 4 — friendly run dialog + Mcp tile renderer | gui/src/components/PlaybookRunDialog.tsx | gui ≥ 1.3 |
Phase 1 — kind: Mcp is a catalog resource
Playbook, Credential, Memory and now Mcp all live in the
same catalog table, distinguished by their kind column. An
Mcp resource looks like:
apiVersion: noetl.io/v2
kind: Mcp
metadata:
name: kubernetes
path: mcp/kubernetes
spec:
url: http://kubernetes-mcp-server.mcp.svc.cluster.local:8080/mcp
protocol: mcp/1.0
lifecycle:
deploy: automation/agents/kubernetes/lifecycle/deploy
undeploy: automation/agents/kubernetes/lifecycle/undeploy
redeploy: automation/agents/kubernetes/lifecycle/redeploy
restart: automation/agents/kubernetes/lifecycle/restart
status: automation/agents/kubernetes/lifecycle/status
discover: automation/agents/kubernetes/lifecycle/discover
discovery:
initialize_url: http://kubernetes-mcp-server.mcp.svc.cluster.local:8080/healthz
tools_list_url: http://kubernetes-mcp-server.mcp.svc.cluster.local:8080/mcp/tools
refresh_via: automation/agents/kubernetes/lifecycle/discover
runtime:
agent: automation/agents/kubernetes/runtime
deployment:
namespace: mcp
chart_ref: oci://ghcr.io/containers/charts/kubernetes-mcp-server
image_tag: v0.0.61
toolsets: "core,config"
Every block describes intent rather than commands:
spec.url— where the running MCP server livesspec.lifecycle.{verb}— which playbook deploys/inspects/operates this MCPspec.runtime.agent— which playbook the GUI calls when a user invokes a toolspec.deployment— knobs the lifecycle agents read at install time
Managed MCP providers use the same catalog shape without a deploy
lifecycle. For example, the Google Cloud GKE managed endpoint is
registered as kind: Mcp at mcp/gcp, points at
https://container.googleapis.com/mcp/read-only, and uses the
terminal agent playbook mcp/gcp/gke. No pod is installed in the
cluster; the NoETL worker obtains a Google Cloud access token through
Workload Identity or an explicit execution workload override, calls
the managed endpoint, and records the response as a normal execution.
The server exposes:
POST /api/mcp/{path}/lifecycle/{verb}— dispatch a lifecycle agentPOST /api/mcp/{path}/discover— refresh the tools listGET /api/catalog/{path}/ui_schema— render a workload form for any catalog entry
Phase 2 — auth-as-playbook, server-side
Every dispatch through the MCP routes runs check_playbook_access
on the server, against the same auth.playbook_permissions
table the gateway already used:
enforce— deny → 403, missing token → 401, DB error → 503advisory— log "would deny", proceedskip— pass through (the local-kind default)
The GUI can also call /api/auth/check-access itself for UI
gating (greying out a button before the user clicks it). The
server-side enforcement is the source of truth even if the GUI
were lying.
Phase 3 — lifecycle agents are playbooks
Each lifecycle verb is a regular kind: Playbook resource with
metadata.agent: true:
# automation/agents/kubernetes/lifecycle/deploy.yaml
metadata:
name: kubernetes_mcp_lifecycle_deploy
path: automation/agents/kubernetes/lifecycle/deploy
agent: true
capabilities: [kubernetes, mcp:lifecycle:deploy]
workload:
mcp_resource: ... # populated by the dispatcher
verb: deploy
expected_kube_context: kind-noetl
workflow:
- step: deploy
tool:
kind: shell
cmds:
- |
# in-cluster guard: skip the local-terminal check
if [ -z "${KUBERNETES_SERVICE_HOST:-}" ]; then
...
fi
helm upgrade --install "$RELEASE_NAME" "$CHART_REF" ...
next:
spec: { mode: exclusive }
arcs: [{ step: end }]
- step: end
tool:
kind: python
code: |
result = {"status": "completed", "agent": "...", "text": deploy_output}
A few things worth noticing:
- Same DSL. Lifecycle agents use the exact same
workflow:/step:/tool:shape every other playbook does. Nothing special about being a "lifecycle agent" beyond themetadata.agent: trueflag and thecapabilities:list. kind: shellruns everywhere. The distributed worker now ships its own shell tool kind that callssubprocess.runwith Jinja-rendered commands, conservative env forwarding (only PATH + KUBERNETES_SERVICE_HOST + explicittask.env), per-cmd timeout, and structured failure aggregation. The local rust binary'skind: shellworks the same way.- In-cluster vs local. The
KUBERNETES_SERVICE_HOSTenv var (which the kubelet always sets inside a Pod) lets the same agent run from an operator's terminal or from a worker pod. The local terminal path keeps the kubectl-context guard; in-cluster execution skips it because the worker's SA already pins it to the right cluster. - Results flow back as events. The python
endstep explicitly returns a structured result withstatus,text,mcp_path,verb. The dispatcher'splaybook.completedevent surfaces that text inline in the GUI's run dialog.
Phase 4 — friendly run dialog + Mcp browser
The GUI's catalog browser auto-detects kind: Mcp entries and
renders them as workspaces with verb-buttons:
mcp/kubernetes :: Read-only Kubernetes runtime agent backed by the Kubernetes MCP server
[ status ] inspect through agent playbook
[ tools ] list MCP tools through agent playbook
[ deploy ] full helm upgrade via lifecycle.deploy
[ pods ] runtime agent: pods across namespaces
[ events ] runtime agent: recent cluster events
...
Each button opens a friendly run dialog generated from
/api/catalog/{path}/ui_schema. The endpoint walks the
playbook's workload: block and emits a JSON-Schema-shaped
description of every field — Antd renders string as inputs,
enum as selects, object as JSON textareas with live
validation, boolean as checkboxes. The user submits, the
dispatcher validates the workload against the agent's Pydantic
contract, and the resulting execution streams back into the
dialog through SSE.
Two things make this nice in practice:
- No code path drift. The form fields are derived from the
agent's actual
workload:schema. If you add a field, the form picks it up next time the user opens the dialog. No<form>to maintain alongside the YAML. - Polling is epoch-guarded. Closing the dialog mid-run can't
zombie the polling loop —
stopPolling()increments an epoch counter and any in-flightgetExecution()bails before re-scheduling.
Cross-cutting: the DSL schema
The same Pydantic models (Playbook, Step, NextRouter,
Tool, ...) drive:
- Server validation at
POST /api/catalog/register— a malformed playbook gets a 422 with the field path before it hits the catalog - Engine load at execute time — every playbook the worker picks up is reconstructed from the same model
- A published JSON Schema at
noetl/core/dsl/playbook.schema.json, auto-generated from the Pydantic models viapython -m noetl.core.dsl._generate_schema. Editors that read the schema URL get autocomplete and inline error reporting against the canonical v10 contract.
Catalog kind is also authoritative from the YAML payload, not
the request parameter — noetl catalog register mcp_kubernetes.yaml correctly stores it as kind: mcp even
when the CLI defaults its hint to Playbook.
Cross-cutting: RBAC
Two service accounts, two ClusterRoles:
| SA | Lives in | Granted by | What it does |
|---|---|---|---|
noetl-worker | noetl namespace | noetl-worker-lifecycle-installer ClusterRole | helm install + kubectl create namespace + apply chart resources |
kubernetes-mcp-server | mcp namespace (chart-managed) | kubernetes-mcp-server-reader ClusterRole | read pods/events/services/etc. across all namespaces |
Both are namespace-scoped Subjects bound to cluster-wide rules,
narrowed to read for the MCP server and CRUD-on-chart-resources
for the worker. Neither is cluster-admin — the broadest verbs
each really needs.
Why dispatcher → worker → in-cluster execution
A reasonable alternative is to have the noetl-server execute shell commands directly. We chose the dispatch path because:
- Audit trail. Every lifecycle invocation produces the same
playbook.initialized/command.issued/command.doneevents as a regular playbook run. No special tracing for MCP ops. - Backpressure. The worker pool has admission/concurrency
controls. A burst of
lifecycle.statuscalls from a polling GUI doesn't compete with helm-drivenlifecycle.deployruns for a single thread on the noetl-server pod. - Failure isolation. A misbehaving shell that pegs CPU or burns through file descriptors hurts a worker pod, not the serving pod that the GUI depends on.
- Consistency. Same dispatch semantics, same retry config, same error envelope as every other tool kind. Lifecycle operations look like any other playbook execution from the outside.
Where the architecture is going
Open follow-ups (none blocking):
- Mcp tab + Add-MCP wizard in the GUI. With the JSON schema
- the curated
mcp_kubernetes.yamltemplate, the wizard can prefill from a known-good shape and live-validate user input againstplaybook.schema.jsonas they type.
- the curated
- Bake the kubernetes-mcp-server reader RBAC into the chart
values so a fresh
lifecycle.deploydoesn't need a manualkubectl applyafter. - Tighter cluster RBAC for the worker by label selector or namespace allowlist once helm 3.x supports the necessary scoping cleanly.
- More MCP servers. The catalog template shape is generic —
any MCP server that ships a helm chart can drop into the same
lifecycle agent fleet with a different
spec.deployment.chart_refand newspec.runtime.agent.
Read more
- MCP End-to-End on Local Kind — full bring-up from scratch
- MCP End-to-End on GKE — same architecture in the cloud
- Older Kubernetes MCP runbook — pre-architecture-PR, kept for context
- Sink-driven storage — the event/projection pattern this builds on