Web — Tech News

All topics AI agents ai api architecture automation aws beginners career claude database devchallenge devops javascript learning linux llm machinelearning mcp opensource performance productivity programming python react security showdev tutorial typescript webdev

All EN RU

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

You already know what --n-gpu-layers does. It moves transformer layers onto your GPU. This post is the next step: how to actually pick the number. If …

localllm llamacpp gpu vram

How fast is LlamaStash? Overhead, throughput, and a fair comparison with Ollama and LM Studio

Originally published at deepu.tech . In my release post for LlamaStash I made a claim I need to back up. The wrapper adds zero overhead vs running lla…

ai llamacpp benchmark llm

Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

The Claude Agent SDK exposes three budget tiers ( haiku , sonnet , opus ) and reads its routing target from environment variables on every call. That …

llm claude llamacpp benchmark

Qwen 3.6 27B and 35B MTP vs Standard on 16GB GPU

I tested Speculative decoding (Multi-Token Prediction, MTP) performance in Qwen 3.6 27B and 35B on an RTX 4080 with 16 GB VRAM. For a broader view of …

selfhosting llm ai llamacpp

Ollama vs llama.cpp vs vLLM: Which Should You Use in 2026?

From the Best GPU for LLM archive. The canonical version has interactive calculators, an up-to-date GPU comparison table, and live pricing. Three tool…

ollama llamacpp vllm comparison

Discontinued Optane Local LLM Powers a Kimi K2.5 Desktop Run

A user on r/LocalLLaMA reported on May 12 that an Optane local LLM desktop build ran Moonshot’s Kimi K2.5 at about 4 tokens per second using discontin…

intel optane kimik25 llamacpp