Dual Encoder vs Cross-Encoder: Why Your RAG Pipeline Needs Both
My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a litt…
Tech news from the best sources
My RAG pipeline looked fine on paper. Fast retrieval. Decent cosine scores. But when I tested it with real queries, the top results were always a litt…
The Problem I wanted to monitor discussions around products, bugs and trends across communities. Examples: Reddit Hacker News GitHub Issues Forums Mos…
Spam detection datasets are surprisingly bad once you move outside English. Most public datasets are: tiny, outdated, English-only, SMS-only, or missi…
In the previous post, we saw what chunking is and the various methdologies of chunking. In this post, we are going to see the next stage of the RAG pi…
This article was written by Erik Hatcher . This is the third and final article of this hybrid search series. First, we surveyed the (hybrid) search la…
Semantic Chunking Lets Consider two paragraphs A and B, focussing on strings in python. para A focus on typecasting and para B focus on accessing char…
The Foundation of Modern AI Systems When people think of tools like ChatGPT, they often assume the intelligence comes from a single powerful system th…
Most “what watch should I buy?” discussions online skew heavily male. A friend wanted to launch a women’s watch, so I helped with a small data analysi…
TL;DR Since version 23.0.0 , Manticore can make searches like xt850 match xt 850 using bigram_delimiter together with digit-aware bigram_index modes. …
In Week 11 Tenacious-Bench, we trained a LoRA adapter on Tenacious-style B2B sales emails using Supervised Fine-Tuning (SFT). We got a real performanc…
Furigana are the small hiragana annotations that sit above kanji to show how they should be read. Schoolbooks, kid manga, and language-learning materi…
Background I did some research online and found a nice course that teach how to build LLM from scratch. The course is shared public online and all the…
Most text analysis solutions fall into one of two problems: Too expensive — OpenAI API costs money for every call Too complex — Hosting your own Huggi…
This is my Day 2 of learning AI fundamentals where I will be covering the following concepts: Vector Embeddings How Tokenisation and Vector Embeddings…