v0.1.82-alpha — gate BLOCKED

RunusefullocalAIonolderNVIDIAGPUs

Kimari is the framework. Kimari-4B is not released. No public weights, adapters, or GGUF files are available.

Kimari is the framework. Kimari-4B is not released yet. No public weights, adapters, or GGUF.

Why Kimari?

Built for the GPUs you already own.

Older GPU support

Designed for GTX 1060 and GTX 1080 — not just the latest cards.

Zero cloud dependency

Everything runs locally. No subscriptions, API keys, or telemetry.

OpenAI-compatible

Drop-in endpoint for existing tools, agents, and integrations.

CLI-first

One command to install, start, and diagnose. No npm needed.

Honest status

No inflated benchmarks, no "coming soon" claims. Alpha means alpha.

Future Kimari models

Private training/eval infrastructure for future Kimari models.

IntegrationCode

Connect your apps in seconds with the OpenAI-compatible API.

Integration CodeOpenAI-compatible

GPU Profiles

0 tok/s

Peak Prompt Speed

Local-first

Integration Guides

Current Status

Transparent project health at a glance.

Project Status

Framework / CLIWorking

Local GGUF runtimeWorking

OpenAI-compatible endpointWorking

GTX 1060 validationWorking

Gateway DashboardPreview

Open WebUI / Continue configsWorking

Kimari SFT/private adapterPrivate

Public Kimari-4B weightsBlocked

Public GGUF Kimari modelBlocked

Release gateBlocked

Model Status

TinyLlama test profileWorking

Kimari Runtime 1.5BPrivate

Kimari Core 3BPrivate

Kimari-4BBlocked

Official Kimari GGUFBlocked

Safety Defaults

Host127.0.0.1

Public bindDisabled

GateBLOCKED

Tokens in UINo

Public uploadNo

Public GGUFNo

Quick Start

From zero to running in under 5 minutes.

Install

$curl -fsSL https://raw.githubusercontent.com/smouj/kimari-local-ai/main/install.sh | bash

One-command install on Linux/WSL2

Open Console

$kimari console

Guided setup experience

Diagnose

$kimari doctor --deep

14 diagnostic checks

Download Model

$kimari pull test

TinyLlama 1.1B test model

Start API

$kimari start

OpenAI-compatible endpoint at 127.0.0.1:11435/v1

Open Dashboard

$kimari gateway start --open

Local Gateway Dashboard

Custom Install Command

GPU Detection

Model

$curl -fsSL https://raw.githubusercontent.com/smouj/kimari-local-ai/main/install.sh | bash

CLI Command Explorer

Every command at your fingertips.

$ kimari doctor --deepessential

14 deep diagnostic checks

$ kimari info

Installation info & version

$ kimari status

Server status

View full CLI reference

Command Builder

Visually construct Kimari CLI commands with the right flags.

Command Builder

1. Select an action

3. Generated command

Select an action above to build a command...

System Requirements

Check if your system is ready to run Kimari.

System Requirements Check

Simulated compatibility check

Python 3.10+PASS

NVIDIA GPU (6GB+ VRAM)PASS

CUDA 12.1+PASS

4GB free disk spacePASS

WSL2 / LinuxRECOMMENDED

llama.cpp binaryPASS

Compatibility Score85%

This is a simulated check. Run kimari doctor --deep for real diagnostics.

Detailed Requirements

Minimum and recommended specs for running Kimari.

Component	Minimum	Recommended	Notes
GPU	NVIDIA GTX 1060 6GB	NVIDIA RTX 3060 12GB+	Must support CUDA 12.1+
VRAM	6 GB	12 GB+	More VRAM = larger models & context
CPU	4 cores, 2.5 GHz	8+ cores, 3.5 GHz+	CPU-only mode is 3-5x slower
RAM	8 GB	16 GB+	System RAM for offloading & OS
Disk	4 GB free	20 GB+ SSD	Per model: 0.7–6 GB each
OS	Ubuntu 20.04+ / WSL2	Ubuntu 22.04+ native	macOS works with Metal backend
CUDA	12.1	12.4+	Driver 525+ required for CUDA 12

Click any row to toggle simulated compatibility status.

Find Your Profile

Answer 3 questions and get a personalized GPU recommendation.

GPU Compatibility Quiz

Step 1 of 3

Which GPU do you have?

Local Runtime Validation

Tested on a real NVIDIA GTX 1060 6GB under WSL2. Measured in tok/s with CUDA acceleration.

CUDA vs CPU

NVIDIA GTX 1060 6GB — TinyLlama 1.1B Q4_K_M

Prompt processingtok/s

228

CUDA

CPU

Token generationtok/s

CUDA

CPU

TinyLlama test model — NOT Kimari-4B. Local validation only.

GPU Profiles

Pre-tuned for specific hardware — VRAM and Quantization optimized

testDefault

GPU

Any 6 GB+

VRAM

6 GB

Quant

Q4_K_M

gtx1060

Requires Kimari-4B

GPU

GTX 1060

VRAM

6 GB

Quant

Q4_K_M

gtx1080

Requires Kimari-4B

GPU

GTX 1080

VRAM

8 GB

Quant

Q5_K_M

turbo

Requires Kimari-4B

GPU

6 GB+

VRAM

6 GB

Quant

IQ4_XS

GPU

GTX 1060 6GB

WSL2 Ubuntu 24.04

Backend

llama-server CUDA

Test Model

TinyLlama 1.1B Q4_K_M

GPUPerformanceComparison

Estimated benchmark performance across different NVIDIA GPUs.

Prompt Processing (tok/s)

GTX 1060 6GB228 tok/s

GTX 1080 8GB310 tok/s

RTX 3060 12GB520 tok/s

RTX 4070 12GB890 tok/s

Token Generation (tok/s)

GTX 1060 6GB73 tok/s

GTX 1080 8GB98 tok/s

RTX 3060 12GB145 tok/s

RTX 4070 12GB230 tok/s

GTX 1060 results are validated; others are projected estimates.

GTX 1060 6GB (TinyLlama 1.1B)GTX 1080 8GB (TinyLlama 1.1B)RTX 3060 12GB (Qwen3-4B)RTX 4070 12GB (Qwen3-4B)

BenchmarkHistory

Performance progression across Kimari versions on GTX 1060.

All measurements on GTX 1060 6GB with TinyLlama 1.1B Q4_K_M. Prompt and generation speeds measured separately.

GPU VRAM Calculator

Check if your GPU can run a specific model — see VRAM usage at a glance.

GPU VRAM Calculator

Select GPU

Select Model

VRAM Usage1.2 GB / 6 GB

Model fits!

Runs comfortably on GTX 1060 6GB

Recommended Quantization

IQ4_XS

Max Context Length

4,096 tokens

Optimization Tips

Reduce context length to 2048

Saves ~0.3 GB VRAM for faster inference on 6 GB cards

Use IQ4_XS quantization

Saves 15% VRAM vs Q4_K_M with minimal quality loss

Close other GPU applications

Free VRAM for larger batch sizes and smoother generation

Estimated VRAM values. Actual usage depends on context length and batch size.

Model Performance Estimator

Estimate how a model will perform on your GPU for different tasks.

Model Performance Estimator

Select GPU

Select Model

Task Type

Latency180ms

Tokens/sec73 tok/s

Time to First Token120ms

Max Context4,096 tokens

Estimated Performance

These are projected estimates based on GPU specs and model size, not actual benchmarks. Real performance depends on quantization, context length, system config, and batch size. Only GTX 1060 results have been validated on real hardware.

ModelComparison

Compare models side-by-side across speed, quality, VRAM, and context metrics.

Model Comparison Tool

Metric	TinyLlama 1.1B	Qwen3-4B	Kimari-4B preview
Parameters	1.1B	4B	4B
Gen Speed	73 tok/s	42 tok/s	38 tok/s
VRAM Required	1.2 GB	3.8 GB	3.2 GB
Context Length	4,096	8,192	8,192
Quantization	Q4_K_M	Q5_K_M	Q4_K_M
Quality Score	35/100	65/100	70/100

Quality scores are estimated. Only GTX 1060 speed results are validated. Select 2-3 models to compare.

Local vs Cloud

How running Kimari locally compares to cloud API services.

Local vs Cloud Performance

Simulated comparison. Kimari values based on GTX 1060 6GB benchmarks. Cloud values are typical for API services.

Resource Monitor

Simulated real-time GPU, memory, and CPU usage gauges.

Resource Monitor

Live (simulated)

45%

GPU

20%

VRAM

12%

CPU

38%

RAM

62°C

Temp

85W

Power

GPU

GTX 1060 6GB

Driver

CUDA 12.1

Model

TinyLlama Q4_K_M

Uptime

4h 23m

Simulated metrics for demonstration. Real values available via kimari status.

GPUTemperatureSimulator

See how GPU temperature changes under different workloads.

20°

40°

60°

80°

38°

Idle Mode

System idle, no model loaded

Temperature38°C

Fan Speed30%

Power Draw15W

Simulated temperatures. Actual values depend on ambient temp, airflow, and thermal paste.

ModelDownloadSimulator

Simulate downloading different model quantizations with speed and progress.

Simulated download. Actual speed depends on network and HuggingFace CDN.

DeploymentChecklist

Track your progress from install to production.

Deployment Checklist

Prerequisites0/3

Install0/2

Configure0/2

Test0/3

Deploy0/2

Progress saved to your browser localStorage.

ErrorTroubleshooter

Search for common errors and find solutions quickly.

Error Troubleshooter

CUDA out of memoryRuntime

Reduce context length with --ctx-size flag, use a smaller quantization (IQ4_XS or Q4_K_M), or close other GPU applications. Run kimari optimize for recommended settings.

Model not foundModels

Run kimari pull test or kimari pull recommended to download the model. Check that the model file exists in ~/.config/kimari/models/

Port already in useServer

Another process is using port 11435. Run kimari stop first, or use --port flag to specify a different port.

llama-server binary not foundInstall

Re-run the install script to download llama-server. Ensure /usr/local/bin/llama-server exists and is executable. Run kimari doctor to check.

Slow token generationPerformance

Ensure CUDA is properly installed and the GPU is being used (check for "CUDA" in kimari status output). CPU-only mode is significantly slower. Run kimari optimize for GPU-specific settings.

Connection refusedServer

Make sure the server is running with kimari start. Check the endpoint URL matches the running server (default: http://127.0.0.1:11435/v1).

GGUF file corruptModels

Re-download the model with kimari pull. Verify integrity with kimari models hash <path>. The SHA256 should match the published hash.

Gate BLOCKEDModels

This is expected. The Kimari-4B release gate is BLOCKED — no public weights are available. Use test or recommended profiles with community models instead.

Gateway dashboard blankGateway

Run kimari gateway setup to install dependencies, then kimari gateway start --open. Ensure Node.js 18+ is installed.

Context length overflowRuntime

The requested context length exceeds available VRAM. Reduce --ctx-size or use a smaller model. GTX 1060 6GB supports up to 8192 context with Q4_K_M on TinyLlama.

Common errors and solutions. For detailed diagnostics, run kimari doctor --deep.

PromptTemplateLibrary

Ready-to-use system prompts for local LLMs — copy and customize.

Code Assistant

Development

Full-stack coding helper with best practices

You are an expert software engineer. Help me write clean, efficient, and well-documented code. When I describe a problem...

Creative Writing

Creative

Story and narrative generation assistant

You are a creative writing assistant. Help me craft engaging stories, poems, and narratives. Focus on vivid descriptions...

Code Reviewer

Development

Thorough code review with security focus

You are a senior code reviewer. Analyze my code for bugs, security vulnerabilities, performance issues, and style proble...

Data Analyst

Analysis

Data interpretation and insight generation

You are a data analysis expert. Help me interpret data, identify trends, and create insights. When I provide data, give ...

Technical Writer

Writing

Documentation and technical content creation

You are a technical writer. Help me create clear, accurate documentation for software projects. Write API docs, README f...

Debug Assistant

Development

Step-by-step debugging and error resolution

You are a debugging expert. When I share an error message or describe unexpected behavior, help me identify the root cau...

Smart Summarizer

Analysis

Multi-length text summarization

You are an expert summarizer. When I provide text, create a concise summary that captures the key points, main arguments...

Translation Expert

Language

Context-aware translation with cultural notes

You are a professional translator. Translate text between languages while preserving tone, context, and cultural nuances...

ConfigurationWizard

Get a personalized configuration in 3 steps.

Select Your GPU

Choose Use Case

Set Priority

Which GPU do you have?

How Kimari Compares

See how Kimari stacks up against other local LLM tools.

Feature	Kimari	Ollama	LM Studio	text-gen-webui
Old GPU focus
CLI-first
OpenAI-compatible
Gateway Dashboard
GPU Profiles
Model Hashing
Local-only default
Open Source

HowKimariWorks

A look inside the local AI runtime architecture.

Gateway Dashboard

http://127.0.0.1:3105

Your Applications

Open WebUI · Continue.dev

OpenAI-Compatible API

http://127.0.0.1:11435/v1

Kimari Runtime

llama-server · CUDA · GGUF

NVIDIA Hardware

GPU · VRAM · CUDA Cores

CLI Commands

kimari doctor · start · pull

Try the Terminal

Interactive simulator — type a Kimari command and see the output. No real server needed.

kimari-console

Kimari Terminal Simulator v0.1.82-alpha

Type "help" for available commands.

$▊

Ask Kimari AI

Chat with an AI assistant that knows everything about Kimari.

Ask Kimari AI

Ask anything about Kimari

APIHealthCheck

Simulate a health check on the local API endpoint.

API Health Check

Click "Run Check" to simulate an API health check

Gateway Dashboard

Local monitoring and management from your browser.

http://127.0.0.1:3105

Overview

Server

Analytics

Profiles

Integrations

Logs

Chat

Gate

Overview

Runtime Active

GPU

GTX 1060

VRAM

1.2 / 6 GB

Model

TinyLlama

Gate

BLOCKED

VRAM Usage

1221 MiB / 6144 MiB (20%)

$ kimari gateway start --open

Frequently Asked Questions

Common questions about Kimari, answered honestly.

APIEndpointExplorer

OpenAI-compatible endpoints ready for your integrations.

POST/v1/chat/completions

Send a chat message and get a completion response

Request Body

{
  "model": "tinyllama-1.1b-chat-v1.0.Q4_K_M",
  "messages": [
    {
      "role": "user",
      "content": "Hello, how are you?"
    }
  ],
  "temperature": 0.7,
  "max_tokens": 256
}

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "model": "tinyllama-1.1b-chat-v1.0.Q4_K_M",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! I am doing well, thank you for asking. How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 18,
    "total_tokens": 30
  }
}

CompatibleModels

Click any card to see detailed specifications.

🦙

TinyLlama 1.1B

1.1B params

Click to see specs

TinyLlama 1.1B

QuantizationQ4_K_M

VRAM1.2 GB

Speed73 tok/s

Context4096

Fast, compact model for basic chat and testing. Great starting point.

🧠

Qwen3-4B

4B params

Click to see specs

Qwen3-4B

QuantizationQ5_K_M

VRAM3.8 GB

Speed42 tok/s

Context8192

Balanced model with good quality. Recommended for most use cases.

🔮

Kimari-4B

4B params

Click to see specs

Kimari-4B

QuantizationQ4_K_M

VRAM3.2 GB

Speed38 tok/s

Context8192

Future Kimari model. Not released — gate BLOCKED.

🦙

Llama 2 7B

7B params

Click to see specs

Llama 2 7B

QuantizationQ4_K_M

VRAM5.5 GB

Speed18 tok/s

Context4096

Larger model requiring more VRAM. Best on 8GB+ cards.

⚡

Phi-3 Mini

3.8B params

Click to see specs

Phi-3 Mini

QuantizationQ4_K_M

VRAM2.8 GB

Speed55 tok/s

Context4096

Microsoft's compact model. Strong reasoning for its size.

💎

Gemma 2 2B

2B params

Click to see specs

Gemma 2 2B

QuantizationQ5_K_M

VRAM1.8 GB

Speed62 tok/s

Context8192

Google's efficient model. Good quality-to-size ratio.

What'sNew

Latest changes and improvements — click entries to expand.

View full changelog on GitHub

ProjectTimeline

Major milestones on the path to local AI.

ReleaseJan 2026

v0.1.0 — Initial CLI framework

First working CLI with doctor, start, pull commands

FeatureFeb 2026

v0.1.4 — Gateway Dashboard preview

Local web dashboard for monitoring runtime

MilestoneMar 2026

v0.1.6 — GTX 1060 validated

228 tok/s prompt processing confirmed on real hardware

FeatureMar 2026

v0.1.7 — Integration guides

Open WebUI, Continue.dev, OpenClaw configurations

FeatureApr 2026

v0.1.8 — GPU profiles & hashing

Pre-tuned profiles and SHA256 model verification

CurrentMay 2026

v0.1.82 — Current alpha

Gate BLOCKED — quality and safety review in progress

Safety & Privacy

Local-first and conservative by default.

Localhost-only defaults
No public bind unless explicitly requested
No token storage in dashboard
No public model upload
No public GGUF
No benchmark claims without reproducible validation
No automatic gate transitions

Roadmap

Honest, incremental progress.

Current

Local runtime + Gateway Dashboard polish

Private adapter runtime preview

Agent Gateway tools & web-search dry-run

Manual review of private outputs

Later

Private GGUF export

Later

GTX 1060 / GTX 1080 validation

Later

Public preview decision

Documentation

Everything you need, organized by category.

Getting Started

Integrations

Model & Training Policy

Advanced

Hugging Face

Kimari Organization

Official org page

Kimari Fit Lab

GPU compatibility checker

GGUF Collection

Reference community models

The Space is a compatibility/demo tool. It does not run Kimari-4B.

Kimari

RunusefullocalAIonolderNVIDIAGPUs

Why Kimari?

IntegrationCode

Current Status

Safety Defaults

Quick Start

Install

Open Console

Diagnose

Download Model

Start API

Open Dashboard

Custom Install Command

CLI Command Explorer

Command Builder

Command Builder

System Requirements

Detailed Requirements

Find Your Profile

GPU Compatibility Quiz

Which GPU do you have?

Local Runtime Validation

GPUPerformanceComparison

Prompt Processing (tok/s)

Token Generation (tok/s)

BenchmarkHistory

GPU VRAM Calculator

GPU VRAM Calculator

Model Performance Estimator

Model Performance Estimator

ModelComparison

Model Comparison Tool

Local vs Cloud

Local vs Cloud Performance

Resource Monitor

Resource Monitor

GPUTemperatureSimulator

ModelDownloadSimulator

DeploymentChecklist

Deployment Checklist

ErrorTroubleshooter

Error Troubleshooter

PromptTemplateLibrary

Code Assistant

Creative Writing

Code Reviewer

Data Analyst

Technical Writer

Debug Assistant

Smart Summarizer

Translation Expert

ConfigurationWizard

Which GPU do you have?

How Kimari Compares

HowKimariWorks

Gateway Dashboard

Your Applications

OpenAI-Compatible API

Kimari Runtime

NVIDIA Hardware

CLI Commands

Try the Terminal

Ask Kimari AI

Ask Kimari AI

APIHealthCheck

Gateway Dashboard

Overview

Frequently Asked Questions

What is Kimari?

Is Kimari-4B available?

Do I need an internet connection?

Which GPUs are supported?

Is it free?

How is this different from Ollama?

Can I use Open WebUI?

APIEndpointExplorer

CompatibleModels

TinyLlama 1.1B

TinyLlama 1.1B