Squig Model Server
Run Local LLMs at Full Speed
A single-binary desktop server that runs any GGUF model with an OpenAI-compatible API. Continuous batching, speculative decoding, smart GPU offloading, and multi-model parallelism — zero cloud dependency, zero config.
Screenshots
See it in action
Features
Everything you need
Continuous Batching
Serve multiple concurrent requests per model with parallel slots — no queuing, no waiting.
Speculative Decoding
2–3× faster generation using a small draft model to predict tokens verified by the main model in parallel.
Smart GPU Offloading
Automatic layer splitting between GPU and CPU. Detects VRAM, tunes context, KV cache quant, and batch sizes at load time.
Multi-Model Parallelism
Load multiple models simultaneously — each runs in its own isolated process on a unique port.
OpenAI-Compatible API
Drop-in replacement for the OpenAI SDK. Chat completions, text completions, embeddings, tool calling, and streaming — all at /v1.
HuggingFace Integration
Search, browse, and one-click download GGUF models from HuggingFace Hub with live progress tracking.
Multi-Backend GPU Support
CUDA, Vulkan, ROCm, or CPU — auto-detects your hardware and picks the fastest backend. Run side-by-side builds.
KV Cache Quantization
Independent key and value cache quantization with 9 datatypes. Save 50–75% KV memory with minimal quality loss.
Self-Optimizing Performance
Built-in performance profiler with automated tuning suggestions, bottleneck detection, and trend analysis.
Single Binary, Full Dashboard
Svelte 5 dashboard compiled into the Rust binary — model management, chat, performance monitoring, and dev tools in one file.
Live GPU Monitoring
Real-time GPU utilization, VRAM usage, temperature, power draw, and clock speeds via nvidia-smi.
Auto-Updater
Signed Windows and Linux installers with over-the-air updates via Tauri 2.
What's New
Latest in v0.1.0
Initial Release
Desktop app with OpenAI-compatible API, multi-backend GPU support, and embedded dashboard.
Speculative Decoding
2–3× inference speedup with draft model prediction and parallel verification.
HuggingFace Hub
Search, browse, and download GGUF models directly from HuggingFace with one click.
Self-Optimization
Performance profiler with automated tuning suggestions and bottleneck detection.
Pricing
Simple, fair pricing
Squig Model Server License
Start using Squig Model Server today
Join developers and creators who build with AI that respects their workflow.