Testing & QA — Tech News

EN

Hardening an AI coding agent: the failures, and the code that fixed them

At Univoco we build retrieval-augmented assistants over a customer's own documentation. One of them is a coding agent that writes code for a proprieta…

ai llm rag agents

EN

5 Practical RAG Challenges and How to Mitigate Them

Retrieval-Augmented Generation (RAG) sounds simple on paper: embed your documents, retrieve the relevant chunks, stuff them into a prompt, let the LLM…

rag ai llm machinelearning

EN

The memory layer that never calls an LLM: what that buys, and what it costs

Part 4 of **The Answerability Problem , and the one that isn't about abstention. Parts 1–3 argued that the field measures the wrong half and that my o…

ai rag opensource discuss

EN

Relevance is not answerability: six signals, and none of them beat plain cosine

Part 2 of **The Answerability Problem . Part 1 showed the standard harness excluding the questions that test refusal, my own system scoring 0.000, and…

ai rag machinelearning discuss

EN

Corrective RAG for billing: the bug is not retrieval, it's the model narrating correct numbers wrong

Most RAG demos are graded by an audience that cannot check the answer. Ask a docs bot something, get a fluent paragraph back, nobody in the room knows…

rag python llm ai

RU

Облачные ИИ не справляются, MiniLM-L6 ломается на философии: строим локальный RAG для сложных семантических текстов

Этот проект долго вынашивался и, в конце концов, начался как очередная попытка разобраться в философских текстах, написанных Джейн Робертс во второй п…

rag local llm nlp python vector search

EN

Data, Context & RAG Lineage Governance for Enterprise AI Agents

The RAG Security Gap Retrieval-Augmented Generation (RAG) has rapidly emerged as the foundational architecture for grounding enterprise AI agents in p…

ai security architecture rag

EN

I gave the same fabricated answer to RAGAS and DeepEval. One scored it 0.0. The other scored it 1.0

Here's an output from a RAG system asserting a pricing claim it was never given, for a question its context couldn't answer. I ran it past the two mos…

ai llm rag testing

EN

Stop Stuffing Your LLM Agent's Context Window: Structured Memory Categories with Mem0

Stop Stuffing Your LLM Agent's Context Window: Structured Memory Categories with Mem0 Most tutorials on giving an LLM agent "memory" show you the same…

ai llm pytho rag

EN

Your RAG Index Might Be Lying to You: Data Freshness Is the Missing Signal for AI Systems

A follow-up to How Old Is My Data? The failure mode that gets worse when a machine is reading the data In a classic dashboard, stale data is a human p…

ai rag observability opentelemetry

EN

🚀 From Transformers to AI Agents: The Complete Engineering Guide to Modern AI Architecture (LLMs, RAG, Vector Databases & Agentic Systems)

Most people think ChatGPT is "the AI." In reality, ChatGPT is just one layer of a much larger engineering stack. Modern AI applications aren't powered…

llm rag agenticsystem ai

EN

Coverage Before Creativity: The RAG Gate That Keeps My Blog Pipeline Honest

The first failure I had to eliminate in the blog pipeline was not a bad paragraph. It was a bad evidence set. The system was finding a few nearby chun…

rag supabase nextjs typescript

RU

Архитектурный паттерн «LangGraph, гибридный RAG + Сигнатурный движок»: универсальный граф для потоковых данных

Мы попытались автоматизировать первую линию SOC . Захотелось объединить гибкость ЛЛМ и надежность сигнатурных движков. Поместилось все это в&nbsp…

python langgraph ai-агенты analysis pipeline llm rag cybersecurity soc паттерн

RU

Как QA я все равно пишу документацию, но с ИИ трачу на нее часы, а не дни

В статье я расскажу, как решал проблемы погружения в проект сначала вручную, а потом с помощью ИИ-инструментов и как мне удалось сократить путь от дне…

тестирование qa rag искусственный интеллект документация

EN

ur rag can return 200 and still be cooked — how we built Goose on SigNoz

ok so real talk. your latency graph looks clean. error rate is chilling. every /chat is HTTP 200. and the answers are just… wrong. like confidently wr…

rag signoz ai opensource

EN

Your baseline scored 0.000. That's a broken harness, not a result.

Your baseline scored 0.000? Before you publish the win, here is the checklist I now run, because a zero from a baseline is almost never a result. It i…

rag llm testing ai

RU

RUMBA: русскоязычный бенчмарк для оценки долгосрочной памяти

Память стала одной из самых востребованных функций диалоговых и агентных систем. Если пользователь регулярно обращается к ассистенту — для работы, кон…

бенчмарк память llm память rag

EN

The Watermelon Effect: How My AI Scored 94% in Testing But Only 22.2% in Real Use

discovery that changed how I think about AI evaluation — and led me to build an open-source testing framework. ───────────────────────────────────────…

ai llm rag testing

EN

The Silent Vector Contamination Bug: Why Your Concurrent Embeddings Might Be Lying to You

TL;DR: If you run concurrent inference (e.g., via OpenVINO AsyncInferQueue or custom threading) for text/code embeddings, your tests might show 0 exce…

machinelearning rag openvino embeddings

EN

How Michael Vicente’s RAG Project Teach Me About Building Smarter AI?

I recently came across a weekend project shared by Michael Vicente , and it changed how I think about building AI-powered applications. Michael built …

ai chatgpt rag software

EN

Your hallucination checker only sees the final paragraph

Your hallucination checker only sees the final paragraph. That’s the bleed. A fluent wrong number often starts earlier: empty retrieval, a swallowed t…

agents ai llm rag

EN

Chat with Your Documents: Building a RAG Pipeline with AWS Blocks

One of the first features users expect from an AI application is deceptively simple: Upload a document. Ask questions. Get accurate answers. Whether y…

aws rag ai agents

EN

RAG for developers who aren't AI engineers: what actually matters

RAG for developers who aren't AI engineers: what actually matters Most non-AI developers have a mental model of RAG that is either wrong or dangerousl…

ai architecture llm rag

EN

I open-sourced a macro execution layer to reduce coding-agent turns (60-task benchmark)

Disclosure: I maintain Tura. A coding agent often spends a separate model turn on each part of a routine workflow: inspect the environment, edit packa…

ai opensource testing rag

EN

Retrieval-Augmented Self-Recall — Part 6: The Fine-Tune That Did Nothing, and Shipping It as an MCP Server

Part 6 (finale) of Retrieval-Augmented Self-Recall. Code: RE-call . Part 5: the gap threshold that didn't transfer . I fine-tuned the embedder on my o…

ai rag mcp machinelearning

EN

Retrieval-Augmented Self-Recall — Part 4: Benchmarking Retrieval and Honesty

Part 4 of Retrieval-Augmented Self-Recall. Code: RE-call . Part 3: the honesty guards . A standard RAG benchmark would have handed my retriever a perf…

ai rag testing programming

EN

Retrieval-Augmented Self-Recall — Part 3: Teaching RAG to Say \"I Don't Know\

Part 3 of Retrieval-Augmented Self-Recall. Code: RE-call . Part 2: hybrid retrieval on Postgres . Ask your agent "have we tried this filter on this ma…

ai rag llm programming

EN

Retrieval-Augmented Self-Recall — Part 2: Hybrid RAG on Nothing but Postgres

Part 2 of Retrieval-Augmented Self-Recall. Code: RE-call . Part 1: the self-recall problem . Say "vector search" and the reflex is a dedicated vector …

ai rag postgres database

RU

LLM-wiki против RAG: Оцениваем и сравниваем

Про LLM-wiki здесь уже было несколько хороших статей ( 1 , 2 и 3 ), поэтому подробно останавливаться на идее Andrej Karpathy не буду. В двух словах: в…

llm llm-агенты wiki rag оценка качества claude-code langchain rag система wilcoxon scores

EN

Genkit Agents API, ORA, Python AI Explainer: New Tools for Workflow Automation

Genkit Agents API, ORA, Python AI Explainer: New Tools for Workflow Automation Today's Highlights This week, Google's Genkit ships a powerful Agents A…

ai rag automation