v0.1.0 — Live

Squig Model Server

Run Local LLMs at Full Speed

A single-binary desktop server that runs any GGUF model with an OpenAI-compatible API. Continuous batching, speculative decoding, smart GPU offloading, and multi-model parallelism — zero cloud dependency, zero config.

DesktopAIInferenceGPUFree

Screenshots

See it in action

Features

Everything you need

Continuous Batching

Serve multiple concurrent requests per model with parallel slots — no queuing, no waiting.

🚀

Speculative Decoding

2–3× faster generation using a small draft model to predict tokens verified by the main model in parallel.

🧠

Smart GPU Offloading

Automatic layer splitting between GPU and CPU. Detects VRAM, tunes context, KV cache quant, and batch sizes at load time.

🔀

Multi-Model Parallelism

Load multiple models simultaneously — each runs in its own isolated process on a unique port.

🔌

OpenAI-Compatible API

Drop-in replacement for the OpenAI SDK. Chat completions, text completions, embeddings, tool calling, and streaming — all at /v1.

🤗

HuggingFace Integration

Search, browse, and one-click download GGUF models from HuggingFace Hub with live progress tracking.

🎮

Multi-Backend GPU Support

CUDA, Vulkan, ROCm, or CPU — auto-detects your hardware and picks the fastest backend. Run side-by-side builds.

💾

KV Cache Quantization

Independent key and value cache quantization with 9 datatypes. Save 50–75% KV memory with minimal quality loss.

📊

Self-Optimizing Performance

Built-in performance profiler with automated tuning suggestions, bottleneck detection, and trend analysis.

📦

Single Binary, Full Dashboard

Svelte 5 dashboard compiled into the Rust binary — model management, chat, performance monitoring, and dev tools in one file.

🌡️

Live GPU Monitoring

Real-time GPU utilization, VRAM usage, temperature, power draw, and clock speeds via nvidia-smi.

🔄

Auto-Updater

Signed Windows and Linux installers with over-the-air updates via Tauri 2.

What's New

Latest in v0.1.0

🎉

Initial Release

Desktop app with OpenAI-compatible API, multi-backend GPU support, and embedded dashboard.

🚀

Speculative Decoding

2–3× inference speedup with draft model prediction and parallel verification.

🤗

HuggingFace Hub

Search, browse, and download GGUF models directly from HuggingFace with one click.

📊

Self-Optimization

Performance profiler with automated tuning suggestions and bottleneck detection.

Pricing

Simple, fair pricing

Free

Squig Model Server License

100% free — no account required
All future updates included
OpenAI-compatible API
Multi-backend GPU support
Windows + Linux installers
Local-first — your data stays yours

Start using Squig Model Server today

Join developers and creators who build with AI that respects their workflow.