AI Infrastructure & Local Deployment•NDA•2026
Private AI Platform for 200-Employee Engineering Firm
Fully self-hosted AI platform running Qwen3 models on 4x NVIDIA L40S GPUs for a German engineering consultancy — replacing EUR 14K/month in cloud AI subscriptions with local chat, RAG, transcription, and code assistance for all 200 employees.
200Employees Onboarded
€0Per-Query Cost
94%Monthly Active Adoption
<4moHardware Payback Period
Qwen3-32B · Qwen3-8B · Qwen2.5-Coder-32B · NVIDIA L40S · vLLM · Open WebUI · Whisper · Ollama · Qdrant · FastAPI · Docker · Keycloak
Deep Dive
The platform was delivered in four components over three months, starting with hardware and model infrastructure, then rolling out capabilities progressively to departments.
| Component | Focus | Status | Key Deliverables |
|---|---|---|---|
| 1 | GPU Infrastructure & Model Serving | Completed | Supermicro 4x L40S server, vLLM inference engine, Ollama model management, Docker orchestration, Qwen3-32B and Qwen3-8B deployment |
| 2 | Company-Wide Chat Platform | Completed | Open WebUI deployment, Keycloak SSO with Active Directory, per-department prompt templates, model selection interface, conversation history |
| 3 | RAG Knowledge Base | Completed | Qdrant vector database, 23 years of project archives indexed, DIN/Eurocode standards, proposal templates, source-cited search interface |
| 4 | Productivity Tools | Completed | Local Whisper transcription (German + English), meeting summary generation, Qwen2.5-Coder-32B for Python/MATLAB code assistance, report draft automation |

