LLM_Monitor

An LLM observability and orchestration platform implemented as a set of containerized microservices. A C#/.NET server acts as the edge/API gateway, forwarding user messages to a Python service built on LangChain/LangGraph that orchestrates the full AI workflow: policy and prompt-injection guardrails, retrieval-augmented generation, tool invocation, conversation memory, and a final grounded response.

The architecture deliberately mirrors commercial LLM application systems (edge gateway + orchestration service + data/observability plane) at reduced scale. This is an active learning project — every line of code is written by hand, with AI used only for code review and concept lectures.

Architecture

user ─▶ [C#/.NET API gateway] ─HTTP─▶ [Python LangChain/LangGraph orchestrator]
             (validation,                 │ policy/injection check (guardrails)
              telemetry,                  │ RAG retrieval (pgvector)
              response shaping)           │ tool invocation (bounded agent loop)
                                          │ conversation memory (checkpointer)
                                          ▼
                    [Ollama: local LLM + embeddings]  [pgvector/Postgres: vectors, telemetry, history]

All services run as Docker containers on a private network, orchestrated via docker-compose.

Services

dotnet_server (C# / ASP.NET Core) — Edge gateway: request validation, middleware-based telemetry, forwards to the orchestrator via IHttpClientFactory
langchain_service (Python / Flask / LangChain / LangGraph) — Orchestrates the AI pipeline: guardrails → RAG → tools → memory → response
ollama — Serves local LLMs and embedding models (qwen2.5, nomic-embed-text)
pgvector-service (PostgreSQL + pgvector) — Vector store for RAG; relational store for telemetry and history
model-provisioning — Container job ensuring required models are pulled before serving

Key Engineering Decisions

Mock/live seam via a model factory + env config — enables full pipeline development on low-compute hardware with a one-flag switch to real inference; demonstrates test-double and dependency-inversion thinking
LangGraph state-machine orchestration — the request pipeline is modeled as nodes, conditional edges, and shared state rather than a monolithic function, supporting branching (block on policy violation), a bounded tool loop, and checkpointer-based memory
pgvector over a standalone vector DB — keeps vectors and relational telemetry in one engine, enabling joins and simpler operations
Idempotent startup ingestion + healthcheck-gated dependencies — services wait for readiness, and re-running setup never duplicates data

Observability (the project's thesis)

Middleware- and node-level telemetry captures latency, tokens, model, and pipeline decisions, with correlation-ID propagation across services — designed toward OpenTelemetry and dashboards.

Tech Stack

Python, C#, ASP.NET Core, Flask, LangChain, LangGraph, Ollama, PostgreSQL, pgvector, Docker, docker-compose

Status

In active development. The multi-container environment, end-to-end local inference, and the mock/live model factory are in place; the LangGraph pipeline, RAG integration, conversation memory, and telemetry persistence are being wired now. An evaluation harness (golden dataset + LLM-as-judge) is planned.