Build a RAG Model from Scratch

There is a site for this coming soon - while this is complete there is more to come..
Tech‑Scroll 114: The Parable of Hidden Knowledge
“He who hides knowledge, hides light.
And he who hides light, lets others stumble in the dark.”
Some build walls around knowledge to profit from the confusion of others. Others take down those walls and let the light through. This scroll is for those who believe knowledge should be shared, code should be seen, and that true mastery is found in teaching it forward.
⚙️ Goal
Build, from first principle to functioning system, a Retrieval‑Augmented Generation (RAG) pipeline with its own embedding model, served locally through llama‑server or Ollama, and integrated into a minimal web application (frontend + backend). By the end, we will have:
- Our own trained embedding model (using
snowflake-arctic-embed:137m
, 274 MB). - A vector database for document search.
- A tiny language model (
qwen3:0.6b
, 522 MB) serving RAG queries. - A full walkthrough: setup → training → integration → UI.
Each section answers: how, why, when, and where not to.
🧩 Section 1 – Understanding What We’re Building
RAG, short for Retrieval‑Augmented Generation, is a system that helps computers remember things before they answer you.
🧒 What it is (so a 10‑year‑old could follow)
Think of RAG as a library helper:
- It first finds the right books (retrieval).
- Then it reads them and writes you a new answer (generation).
It’s like having a smart assistant that checks the library before replying.
💡 Why we use it
Large language models (LLMs) like Qwen or Llama know a lot from their training, but they forget new facts or company data. RAG lets us feed them new knowledge without retraining. We give them context right when they need it.
⚙️ How it works
- We break our texts into small pieces called chunks.
- We turn each chunk into a vector — a long list of numbers that capture meaning. Sentences with similar meaning have numbers that look alike.
- We store these vectors in a vector database.
- When you ask a question, we also turn your question into a vector and find which document vectors are closest.
- We send those matching pieces to a small LLM that writes an answer using both its general knowledge and your data.
🔢 Why vectors work
Computers can’t understand words, but they can compare numbers fast. A vector lets a machine tell when two ideas are close, even if the words differ.
🧠 Why two models
We use two models because they do different jobs:
- Embedding model (Snowflake‑Arctic‑Embed) → turns words into vectors; it’s like a translator between language and math.
- Language model (Qwen3 0.6B) → turns context back into sentences; it’s like a storyteller that reads the facts and writes a reply.
⏱️ When to use each
- Use the embedding model when you’re storing or searching knowledge.
- Use the language model when you’re ready to answer a question.
Put simply: the embedder finds the right pages, the language model explains them in words you understand.
🧱 Section 2 – Environment & Dependencies
🖥️ System Requirements
It really doesn’t matter what computer or operating system you use. The key tools are the same everywhere.
Whether you’re on Windows (perhaps using WSL2), macOS, or Linux (Arch or Alpine preferred), you just need these ingredients:
- Python ≥ 3.11 or Node ≥ 20 for running code and servers.
- Ollama installed (
curl -fsSL https://ollama.com/install.sh | sh
) to manage local AI models. - llama‑server (optional alternative runtime for serving models).
- Vector store:
sqlite-vss
,qdrant
, orweaviate
(local mode) to hold your embeddings.
If you have an account with Hugging Face, that’s a wonderful starting point to explore and download open models safely, learn about datasets, and share your own experiments.
🧰 Install
Here we explain what each part does so that you’re learning, not just copying.
- Python3 / pip / git / curl / build-base – the essential tools. Python runs scripts; pip installs packages; git lets you fetch source code; curl downloads from the web; build‑base gives compilers for native extensions.
- torch – PyTorch, the math engine that lets our models run and learn.
- transformers – a library by Hugging Face containing pre‑trained AI models and tokenizers.
- sentence‑transformers – utilities to build and fine‑tune embedding models for RAG.
- qdrant‑client – the connector to our vector database; it stores and searches embeddings.
- fastapi / uvicorn – these run our backend web API; FastAPI handles requests, Uvicorn serves them fast.
- langchain – tools to chain together LLMs, embeddings, and logic.
- Node / Vite / React / axios – Node runs JavaScript locally, Vite builds the frontend, React builds the UI, and axios sends requests to the backend.
Now that you know what you’re installing, you’re holding the fishing rod, not being the parrot.
# Alpine / Arch basics
apk add python3 py3-pip git curl build-base
# or
pacman -S python git curl base-devel
# Python deps
pip install torch transformers sentence-transformers qdrant-client fastapi uvicorn langchain
# Node (optional for frontend)
npm create vite@latest rag-ui --template react
cd rag-ui && npm install axios
🧠 Section 3 – Embedding Model
We use Snowflake‑Arctic‑Embed:137M, a small, open model for text embedding.
What we are doing
Here we are teaching the computer how to understand meaning instead of just memorizing words. When we install this embedding model, we are giving our system the ability to turn any sentence into a pattern of numbers — that’s what we store and compare in the vector database.
Why we are doing it
Language models need context. Embeddings let us measure how close two ideas are in meaning, even if the words differ. This step builds the bridge between plain text and mathematical understanding.
What is Snowflake‑Arctic‑Embed
It’s an open, compact model made by Snowflake for creating embeddings efficiently. It’s small enough to run locally, fast enough for interactive work, and open so we can inspect and learn from it. It’s ideal for:
- Learning how embeddings work.
- Running offline or on modest hardware.
- Rapid testing of RAG systems.
When to use another embedding model
If you need:
- Higher accuracy → try
text-embedding-3-large
(OpenAI) orE5-Large-V2
. - Multilingual coverage → use
LaBSE
ordistiluse-base-multilingual
. - Domain-specific understanding → fine‑tune a small model with your own text.
There’s no single “best” model — choose based on speed, cost, and quality for your needs. The important lesson is understanding what each does so you can swap or improve it later; that’s the fishing rod we pass on here.
Download
ollama pull snowflake-arctic-embed:137m
Use in Python
Here we use Python to talk to Ollama — but before running this, make sure Ollama is installed. We installed it earlier because it is the program that actually manages and serves our AI models locally. Think of Ollama as the engine that keeps the models running: it is cross‑platform (works on Windows, macOS, and Linux), simple to use, and fast. It lets us pull models, create embeddings, and chat without needing cloud access.
So here, Python is our language of instruction, Ollama is the tool that runs the model, and together they form the hook and line of our fishing setup — connecting your code to the model that thinks.
import ollama
texts = ["RAG integrates retrieval and generation.", "Embeddings map meaning to numbers."]
embeds = [ollama.embeddings(model="snowflake-arctic-embed:137m", prompt=t)["embedding"] for t in texts]
Why this model?
Think of this as the fishing reel — it lets you pull in results smoothly without heavy gear.
- Small on purpose (274 MB): Embedding models don’t write essays; they turn text into numbers. That job needs fewer parameters than big chat models (which can be many GB). Fewer parameters → smaller file → faster load.
- Runs on CPU-only: Because it’s compact, it fits in normal RAM and can do math fast enough without a GPU. That means it works on almost any laptop or server, including WSL2 on Windows and low‑power boxes.
- Speed vs. quality sweet spot: Big models can be slightly more precise, but this size gives quick results for RAG where retrieval speed matters.
- Cross‑platform via Ollama: One command to pull and run, same API on Linux/macOS/Windows.
- When to not use small: If you need best‑in‑class semantic precision across huge, tricky corpora, choose a larger embedder (e.g., E5‑Large‑V2) or a commercial high‑accuracy model. Small = faster; large = potentially finer nuance.
- Why it isn’t 8 GB: That kind of size is typical for generator LLMs that must predict the next word. Embedders solve a narrower task (map meaning to vectors), so they can be much smaller while still excellent at retrieval.
When not to use it
- For multimodal or large enterprise datasets — use
text-embedding-3-large
orE5-Large-V2
.
🗂️ Section 4 – Building the Vector Store
Before we start writing code, let’s understand what a vector store really is and why we need it.
Think back to our fishing setup: we already have the rod (Python and Ollama), the line (embeddings), the hooks (models), and the reel (small efficient model). Now we need somewhere to keep our gear organized — that’s the vector store.
You can imagine it like a tackle box. Each vector we create (each piece of text turned into numbers) is stored inside with a label (metadata). When you ask a question, the system quickly looks through the tackle box to find which hooks (vectors) best match the bait (your query).
In more technical words, a vector store is a database that holds all the embeddings and lets us search them by similarity instead of exact match. It measures distance between vectors (using cosine or dot product) to find which stored meaning is closest to the one you just asked for.
It’s the part that gives RAG its memory — without it, the system could not recall what it has read.
Example: Qdrant (local)
Before we run this command, let’s explain what Qdrant is and why we’re running it locally.
Qdrant is an open‑source vector database designed for high‑speed similarity search. Think of it as one of the floats and weights from our tackle box — it helps balance and control how the system searches through all the stored vectors. Each time we cast a query (like throwing a fishing line), Qdrant keeps the lines untangled and ensures we pull up the right catch.
We run it locally because it’s lightweight, fast, and requires no external account or Internet access. That means we can test everything offline, learn how the system works, and later move to a remote Qdrant or another vector store if needed.
You can swap Qdrant for other options such as Weaviate, Pinecone, or FAISS, but Qdrant is an excellent starting point for simplicity, transparency, and local development.
qdrant --uri http://localhost:6333
Python code
Here comes our first box of weights to go into the tackle box. These pieces of Python code show how we start storing and using our vectors, how we organise them into collections, and how we finally search through them — we are now walking along the riverbank looking for the place where the fish feed.
- Collections: think of them as different compartments in your tackle box. Each collection can hold vectors of a certain kind — for instance, all product descriptions, all support articles, or all training notes.
- Inserting documents and metadata: this is like putting bait and labels in each compartment. The vector is the bait (it attracts matching questions), and the metadata is the label that tells us what that bait belongs to.
- Searching: when you throw your line, you’re really asking the system to look in the tackle box, measure which bait is closest in meaning to your question, and pull out the best match.
Now, when we read the next three blocks of Python code, we can picture exactly what’s happening and why.
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
collection_name="docs",
vectors_config=VectorParams(size=len(embeds[0]), distance=Distance.COSINE)
)
Insert documents + metadata:
Here the key action is upsert — short for update or insert. It’s like carefully placing a new lure into your tackle box: if it’s already there, we clean and replace it; if it’s new, we add it fresh.
In our code, each point in the upsert command has three parts:
- id – the label or tag for this specific item, so we can find or replace it later.
- vector – the numerical embedding that describes the meaning of the text (our bait).
- payload – extra information or metadata, like the note on the bait packet that says what it catches (in this case, the text itself or a title).
This operation ensures our vector store always stays up‑to‑date: if a document changes, the next upsert refreshes it instead of adding duplicates.
client.upsert(
collection_name="docs",
points=[{"id":i, "vector":embeds[i], "payload":{"text":texts[i]}} for i in range(len(texts))]
)
Search:
Here we query our embeddings to find the best spot where “the fish are feeding.”
- We turn your question into a vector (your bait).
- The vector store measures which stored vectors (our lures) are closest in meaning.
- We take the top‑k closest (e.g., 3–5) as our likely feeding area.
- Optionally, we apply a similarity threshold to ignore weak matches, and re‑rank by metadata (freshness, source, section).
This is how the shiny new tackle works together: question → embedding → nearest vectors → trusted passages we’ll hand to the LLM.
query = ollama.embeddings(model="snowflake-arctic-embed:137m", prompt="What does RAG do?")["embedding"]
result = client.search(collection_name="docs", query_vector=query, limit=3)
print(result)
🤖 Section 5 – Integrating with an LLM (Ollama or llama‑server)
Pull the Qwen model
Before we run this, let’s make sure we understand what we’re pulling here.
This is the part where we bait the hook. Up to now, we’ve prepared everything — rod, line, reel, tackle box — but we haven’t yet attached the part that will speak back to us. The Qwen model is our small, efficient language model that can answer questions and generate responses.
It’s small (only about 522 MB) so it won’t tangle our line or overload our system. That’s why it runs smoothly on a CPU-only setup. We’re not pulling a massive 8 GB deep-sea rig here — this is the light, responsive model perfect for everyday catches, the one we can actually talk to.
ollama pull qwen3:0.6b
Quick sanity check — talk to Qwen without context
Before casting with our full RAG rig, let’s try the bait alone and see that the model answers on its own.
ollama run qwen3:0.6b
You’ll get an interactive prompt. Type a couple of questions (e.g., “What is RAG?”). This proves the reel and bait work by themselves — fast, CPU‑friendly, and cross‑platform. No vectors, no tackle box yet — just Qwen replying.
Now we’ll add the RAG worm 🪱 — the retrieved context — so answers are grounded in your documents.
Query with context
Now our float is in the water — this is the moment we wait and see if the fish bite. We’ve got the full setup ready: rod, line, reel, tackle box, and bait. What we’re doing here is casting out the line with our RAG worm 🪱 attached, waiting for that first tug.
It may sound fancy, but it’s really simple: the model reads the retrieved context (our bait) and then responds based on that information. Many people say building AI like this is expensive or complicated — yet here you are, fishing for knowledge with tools freely available to anyone willing to learn. This is the real catch.
context = "\n".join([hit.payload["text"] for hit in result])
prompt = f"Using the following context, explain what RAG is:\n{context}\n\nAnswer:"
response = ollama.chat(model="qwen3:0.6b", messages=[{"role":"user","content":prompt}])
print(response["message"]["content"])
Why Qwen3 0.6B?
Think of this as the difference between baiting with a RAG worm, a whole squid, or even a livebait — each works, but in its own way.
- The RAG worm 🪱 (our Qwen3 0.6B at 522 MB) is small, quick, and ideal for learning or testing in calm waters. It’s great for local development, offline demos, and low‑power systems. Runs on CPU‑only — fast, nimble, and no GPU required.
- The whole squid 🦑 would be something like llama3.2:1b at 1.3 GB or deepseek‑r1 at 5.2 GB — these give more substance, detail, and richer responses but need a stronger machine.
- The livebait 🐟 represents the giants like gpt‑oss:20b (13 GB) — deep‑sea rigs meant for production workloads and nuanced reasoning. They pull big catches but take power, memory, and patience.
Choosing the right bait is about balance — smaller models mean speed and accessibility, while larger ones mean deeper reasoning and more natural output. The trick is knowing what you’re fishing for.
When not to use it not to use it
- Production Q&A with nuanced semantics → use a 7B–14B model instead.
🧾 Section 6 – Backend API (FastAPI Example)
Before we dive into the code, let’s talk about what FastAPI is and why we use it.
FastAPI is the part of our setup that lets everything talk together. It’s like the fishing net that connects our rod, line, and bait to the catch. It takes questions from the frontend (the person holding the rod), passes them to the backend (the tackle and bait), and then returns the answer (the fish!).
Why not call it SlowAPI or FishAPI? Because it’s genuinely fast — built on top of Python’s async engine, it handles many requests at once without waiting around. It’s also easy to understand and works everywhere: Windows, macOS, or Linux.
So, when you run FastAPI, you’re running your very own RAG service that can listen, think, and reply in real time. It’s the communication bridge between your human users and the machine intelligence below.
from fastapi import FastAPI, UploadFile
from qdrant_client import QdrantClient
import ollama, json
app = FastAPI()
client = QdrantClient(host="localhost", port=6333)
@app.post("/ask")
async def ask(question: str):
qvec = ollama.embeddings(model="snowflake-arctic-embed:137m", prompt=question)["embedding"]
hits = client.search(collection_name="docs", query_vector=qvec, limit=3)
ctx = "\n".join([h.payload["text"] for h in hits])
prompt = f"Context:\n{ctx}\n\nQuestion: {question}\nAnswer:"
resp = ollama.chat(model="qwen3:0.6b", messages=[{"role":"user","content":prompt}])
return {"answer": resp["message"]["content"], "context": ctx}
Run:
Before we run this, meet our little Unicorn — actually Uvicorn, a lightning-fast web server built for FastAPI. There’s nothing uni-corny about it; it’s the runner that brings your API to life.
Why use Uvicorn? Because it’s lightweight, asynchronous, and production-ready right out of the box. It listens for web requests and passes them into FastAPI faster than a startled fish darting from a line.
You can even place another web server (like NGINX or Caddy) in front of it when hosting in the wild. For local work, it runs perfectly on ::1
(IPv6 localhost) or 127.0.0.1
(IPv4 localhost) — no cloud needed, no magic required.
uvicorn app:app --host 0.0.0.0 --port 8000
💻 Section 7 – Frontend (React/Vite)
Before diving into the code, let’s explain what Vite is and why we’re using it.
Vite (pronounced “veet,” meaning fast in French) is part of the Node and TypeScript ecosystem. It’s a modern JavaScript build tool and development server that helps us create frontends quickly. Think of it as the gaff — the tool you use to lift the big fish onto the boat once you’ve hooked it.
Why use Vite?
- It’s incredibly fast because it uses modern ECMAScript (ESM) imports and native browser modules.
- It provides live reloads, so when you change code, you instantly see updates — no waiting.
- It works seamlessly with React, Vue, Svelte, or vanilla JS.
- It builds lightweight production files that can be served by any web server.
We use Vite here to make our demo frontend quick, responsive, and simple to run anywhere Node works — Windows, macOS, or Linux.
Minimal App.jsx
snippet:
import { useState } from 'react';
import axios from 'axios';
export default function App() {
const [q, setQ] = useState('');
const [a, setA] = useState('');
const ask = async () => {
const res = await axios.post('/ask', null, { params: { question: q } });
setA(res.data.answer);
};
return (
<div className="p-8">
<h1 className="text-xl mb-4">Local RAG Demo</h1>
<textarea value={q} onChange={e => setQ(e.target.value)} className="w-full h-24 border" />
<button onClick={ask} className="mt-2 p-2 bg-blue-500 text-white">Ask</button>
<pre className="mt-4 p-2 bg-gray-100">{a}</pre>
</div>
);
}
🔁 Section 8 – Training & Extending Your Own Embeddings
Sometimes, if you don’t go fishing as often as you’d like — work, family, or life getting in the way — it’s easy to forget how to cast properly. In machine learning, that’s the same as forgetting how to tune your model. You might get poor results, no results, or simply stop trying altogether.
Here we learn to practice, to fine‑tune. The more we train, the better we get at knowing which bait to use and where to cast. In model terms, we teach our embedder to understand our own domain — to recognise our pond, our fish, and our language.
To fine‑tune a domain‑specific embedder:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer('snowflake-arctic-embed-137m')
train_examples = [InputExample(texts=['RAG combines retrieval', 'RAG merges retrieval and generation'], label=1.0)]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=8)
train_loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=10)
model.save('models/custom-embedder')
Reload and re‑embed documents with your new model.
📏 Section 8a – Chunking Explained (Slow Down & See)
Before we continue, let’s pause and clearly explain chunking, because it’s one of the most important parts of RAG and easy to misunderstand.
- Why we chunk text: Large models can’t take unlimited input. If you give them a whole book at once, they lose detail or simply refuse it. By breaking text into smaller pieces (chunks), we keep details intact and retrievable. Think of it as slicing bread: easier to chew and easier to serve.
- How to choose chunk size: In practice, chunks are usually 300–1000 tokens (roughly a few paragraphs). If chunks are too large, retrieval is sloppy (like casting with a net too wide). If chunks are too small, context is fragmented (like crumbs instead of slices). Adding a small overlap (e.g., 50–150 tokens) ensures we don’t cut off sentences mid‑thought.
- Simple example:
- Full text: “RAG combines retrieval and generation to make answers more accurate.”
- Chunking into two parts with overlap: [“RAG combines retrieval and generation”], [“generation to make answers ore accurate”]. Now, a question about “retrieval” still matches the first chunk, and a question about “accurate answers” matches the second.
So, chunking is not just a code trick — it’s the balance between giving the model context and keeping details sharp. Get it right, and retrieval feels natural; get it wrong, and results drift.
🌉 Section 9 – Common Pitfalls
Every fisherman learns that using the wrong bait for the wrong fish leads to disappointment. The same is true in RAG — use the wrong model, the wrong chunk size, or ignore the right settings, and you’ll either catch nothing or get bitten by your own mistakes. Sometimes you even hook a shark 🦈 — powerful but dangerous if you don’t know how to handle it.
- Embedding mismatches – like using bait for trout while trying to catch tuna; models with different dimensions or versions will not fit together.
- Too large chunks → poor retrieval – big bait might scare small fish; split your text sensibly. Think of it like using a soil worm in the sea — it won’t survive the salt, and the fish won’t bite. Oversized chunks of text drown the fine detail the model needs to match meaning precisely. If you feed it too much, it stops listening; if you feed it too little, it starves. Find balance in your chunk size: enough to hold context, but light enough to stay alive in the saltwater of retrieval.
- No metadata → ambiguous context – no labels means you won’t remember which pond you were fishing in, and it’s like forgetting to wrap the bait on the hook; it falls off before it even hits the sea. Metadata keeps your bait attached — the data stays connected to its meaning so your model knows what it’s reeling in.
- LLM prompt too short or long – too little and it’s vague; too long and you tangle the line.
When we talk to the model, our words are the conversation’s rhythm — like telling a fishing tale around the campfire. If we mutter a few words, the story makes no sense; if we ramble for hours, we lose the listener. The same with a model: give it just enough to understand, not so little it’s lost or so much it’s overwhelmed.
Here’s how not to talk: “Fish?” (too short, no context).
Better: “Using the pond map and today’s weather, where are trout likely feeding?” (clear, directed, contextual).
That’s why our prompts matter — they are bedtime stories for the machine’s memory, reminders of our own fishing trials so it can retell them properly.
- Missing
similarity_threshold
in search – without limits, you’ll catch old boots and weeds with your fish. Imagine casting too close to shore with the wrong worm — you’ll pull in seaweed, tin cans, and the occasional boot instead of trout. Thesimilarity_threshold
sets a minimum closeness score between your question’s vector and stored vectors.
If the similarity is high (close to 1.0 in cosine distance), the meanings are nearly identical — the fish is biting. If it’s low (say 0.3 or 0.4), it means they’re unrelated — like dangling the wrong bait in the wrong water.
High similarity → 🎯 relevant text (fish!)
Medium similarity → 🤔 maybe relevant, check manually
Low similarity → 🪣 weeds and boots
Tuning this threshold ensures we don’t waste time hauling seaweed — only good, contextually matched content.
🧭 Section 10 – Deployment & Next Steps
- Containerize FastAPI
Here we mention mounting /data
for Qdrant persistence — that’s a sign we’re talking about Docker or containerized deployment. While we often avoid Docker when a simple ZFS dataset or loopback mount works better, Docker still has its place. It’s like a well-made cooler box: keeps everything sealed and portable, though sometimes heavier than needed.
Some will prefer to mount /data
directly from ZFS for simplicity, others might use Docker volumes, and a few will wrap it in Kubernetes or another orchestrator. There’s no single right way — just many ways to prepare your bait. Some use Elastic for storage, some keep it simple, and others go full gourmet with crab, squid, and bluey cocktails all wrapped up.
Here’s a simple Docker example:
FROM python:3.12-alpine
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
And a minimal NGINX stanza to serve the frontend:
server {
listen 80;
server_name _;
location / {
root /var/www/html;
index index.html;
}
location /ask {
proxy_pass http://127.0.0.1:8000;
}
}
Each fisherman finds their own way to keep bait fresh — Docker, ZFS, or loopback file — all valid, all part of learning the craft. (FROM python:3.12-alpine
).
- Mount
/data
for Qdrant persistence. - Expose port 8000.
- Serve frontend statically via NGINX or Vite build output.
Future expansions (with quick how‑tos):
- Integrate monitoring (Prometheus + Grafana)
Why: Monitoring helps you see how your system behaves under real use—how fast, how often, and how reliably. Latency, hit rate, and error counts reveal if your fishing line is taut or tangled.
How: Add a Prometheus client library to expose runtime metrics at/metrics
. Prometheus will regularly scrape that endpoint and record values. Grafana can then visualize them—dashboards that show requests per second, vector‑search durations, or response times.
Example: track counters for/upload
and/ask
endpoints, histogram buckets for vector search latency, and duration metrics for LLM calls.
Optional: containerize Prometheus and Grafana for local use or small setups; no need for heavy cloud setups. It’s like noting tide, weather, and time for each fishing trip—data that guides improvement without overcomplication.
Add caching layer for repeated queries
Why: speed + lower cost.
How (simple TTL cache):
from time import time
CACHE, TTL = {}, 120
def get_cache(key):
v = CACHE.get(key); return v[0] if v and time()-v[1] < TTL else None
def set_cache(key, val): CACHE[key] = (val, time())
Enable document uploads + automatic re‑embedding
Why: keep the tackle box fresh when new files arrive.
How (FastAPI):
from fastapi import UploadFile, File
@app.post('/upload')
async def upload(file: UploadFile = File(...), token: str = Depends(require_bearer)):
text = (await file.read()).decode('utf-8', errors='ignore')
chunks = split_into_chunks(text, max_tokens=300) # implement splitter
for i, chunk in enumerate(chunks):
vec = ollama.embeddings(model="snowflake-arctic-embed:137m", prompt=chunk)["embedding"]
client.upsert(collection_name="docs", points=[{"id": f"{file.filename}:{i}", "vector": vec, "payload": {"text": chunk, "src": file.filename}}])
return {"status": "ok", "chunks": len(chunks)}
How (Frontend form):
const fd = new FormData(); fd.append('file', fileInput.files[0]);
fetch('/upload', { method: 'POST', headers: { Authorization: `Bearer ${token}` }, body: fd });
Add user authentication
Why: protect your API.
How (FastAPI):
from fastapi import Header, HTTPException
def require_bearer(auth: str | None = Header(None, alias="Authorization")):
if not auth or not auth.startswith("Bearer "):
raise HTTPException(status_code=401, detail="Missing token")
token = auth.split(" ",1)[1]
# TODO: verify token (HMAC/JWT/Keycloak)
return token
@app.post("/ask")
async def ask(question: str, token: str = Depends(require_bearer)):
...
How (Node client header):
// axios
axios.post('/ask', null, { params: { question: q }, headers: { Authorization: `Bearer ${token}` }});
🎣 Section 11 – Gratitude & Invitation
Thank you for reading this far and joining in the story. If any step leaves you puzzled, reach out and ask — we welcome questions and ideas. You can find and contact BreathTechnology, AKADATA LIMITED, or the Director through the website.
If DIY leaves your belly empty and no fish in the pan, we offer DIFY – Do It For You services at a fair cost. Yet our first calling is always to give before we take, to serve before we ask for service. The tools shared here are meant to empower you — to learn, to build, to share. May your nets be full and your mind brighter for what was learned here.
🧰 Appendix A — Complete Python Backend (Auth, Upload, Train)
Below is a single‑file FastAPI app that you can run as‑is. It includes:
- Bearer Authorization header validation
- /ask (RAG answer), /upload (file → chunks → embeddings → Qdrant), /health
- Minimal training example (fine‑tunes an embedder with sentence‑pairs)
Requires:fastapi uvicorn qdrant-client sentence-transformers torch python-dotenv
and Ollama running with models pulled:snowflake-arctic-embed:137m
,qwen3:0.6b
.
# app.py
import os, io, json, math
from typing import List, Optional
from fastapi import FastAPI, UploadFile, File, Depends, Header, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
import requests
OLLAMA = os.getenv("OLLAMA_URL", "http://127.0.0.1:11434")
QDRANT_HOST = os.getenv("QDRANT_HOST", "127.0.0.1")
QDRANT_PORT = int(os.getenv("QDRANT_PORT", "6333"))
COLLECTION = os.getenv("QDRANT_COLLECTION", "docs")
AUTH_TOKEN = os.getenv("AUTH_TOKEN", "dev-token")
EMBED_MODEL = os.getenv("EMBED_MODEL", "snowflake-arctic-embed:137m")
LLM_MODEL = os.getenv("LLM_MODEL", "qwen3:0.6b")
TOP_K = int(os.getenv("TOP_K", "4"))
SIM_THRESHOLD = float(os.getenv("SIM_THRESHOLD", "0.45"))
app = FastAPI(title="RAG Demo (Python)")
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"]
)
# --- Auth ---
def require_bearer(auth: Optional[str] = Header(None, alias="Authorization")):
if not auth or not auth.startswith("Bearer "):
raise HTTPException(401, "Missing token")
token = auth.split(" ", 1)[1]
if token != AUTH_TOKEN:
raise HTTPException(403, "Bad token")
return token
# --- Utilities ---
def embed_text(text: str) -> List[float]:
r = requests.post(f"{OLLAMA}/api/embeddings", json={"model": EMBED_MODEL, "prompt": text})
r.raise_for_status()
return r.json()["embedding"]
# simple splitter by characters (keep it readable)
def split_into_chunks(text: str, max_chars: int = 1200, overlap: int = 150) -> List[str]:
parts, i = [], 0
while i < len(text):
j = min(len(text), i + max_chars)
parts.append(text[i:j])
i = j - overlap
if i < 0: i = 0
if i >= len(text): break
return [p.strip() for p in parts if p.strip()]
# --- Vector store ---
client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)
# lazy-create collection with the right vector size from a small probe
def ensure_collection(vec_size: int):
try:
client.get_collection(COLLECTION)
except Exception:
client.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=vec_size, distance=Distance.COSINE)
)
# --- Schemas ---
class AskRequest(BaseModel):
question: str
class TrainPair(BaseModel):
a: str
b: str
label: float = 1.0 # 1.0 similar, 0.0 dissimilar
class TrainRequest(BaseModel):
pairs: List[TrainPair]
# --- Routes ---
@app.get("/health")
def health():
return {"ok": True}
@app.post("/upload")
async def upload(file: UploadFile = File(...), token: str = Depends(require_bearer)):
raw = (await file.read()).decode("utf-8", errors="ignore")
chunks = split_into_chunks(raw)
# probe vector size
vec0 = embed_text(chunks[0] if chunks else "sample")
ensure_collection(len(vec0))
points = []
for i, ch in enumerate(chunks):
v = embed_text(ch)
points.append({"id": f"{file.filename}:{i}", "vector": v, "payload": {"text": ch, "src": file.filename}})
if points:
client.upsert(collection_name=COLLECTION, points=points)
return {"status": "ok", "chunks": len(chunks)}
@app.post("/ask")
def ask(body: AskRequest, token: str = Depends(require_bearer)):
q = body.question.strip()
if not q:
raise HTTPException(400, "Empty question")
qvec = embed_text(q)
ensure_collection(len(qvec))
hits = client.search(collection_name=COLLECTION, query_vector=qvec, limit=TOP_K)
# filter by similarity threshold (cosine distance → score is higher when closer in Qdrant)
ctx_snips = []
for h in hits:
# Qdrant returns distance/score depending on API; use payload + limit here
txt = h.payload.get("text", "") if h.payload else ""
if txt:
ctx_snips.append(txt)
context = "
".join(ctx_snips)
prompt = (
"You are a precise assistant. Use ONLY the context to answer.
" \
+ f"Context:
{context}
Question: {q}
Answer:"
)
r = requests.post(f"{OLLAMA}/api/chat", json={
"model": LLM_MODEL,
"messages": [{"role": "user", "content": prompt}],
"stream": False
})
r.raise_for_status()
content = r.json()["message"]["content"]
return {"answer": content, "context": ctx_snips}
# --- Minimal training endpoint (toy fine-tune) ---
# In practice you would run a separate job and version the model. Shown here for completeness.
@app.post("/train")
def train(req: TrainRequest, token: str = Depends(require_bearer)):
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model_name = os.getenv("BASE_EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")
model = SentenceTransformer(model_name)
examples = [InputExample(texts=[p.a, p.b], label=p.label) for p in req.pairs]
loader = DataLoader(examples, shuffle=True, batch_size=16)
loss = losses.CosineSimilarityLoss(model)
model.fit(train_objectives=[(loader, loss)], epochs=1, warmup_steps=10)
out_dir = os.getenv("FT_OUT", "models/custom-embedder")
model.save(out_dir)
return {"status": "trained", "path": out_dir, "pairs": len(req.pairs)}
To run the app we then need to type
# Run: uvicorn app:app --host 0.0.0.0 --port 8000
🧰 Appendix B — Complete Node Backend (Auth, Upload, Progress, Report)
This Express server mirrors the Python app and shows SSE progress during uploads.
Requires:express multer axios cors dotenv
(and optionallynode-fetch
if on older Node). Ensure Ollama + Qdrant are running.
// server.js
import 'dotenv/config'
import express from 'express'
import multer from 'multer'
import axios from 'axios'
import cors from 'cors'
import fs from 'fs'
const app = express()
app.use(cors())
app.use(express.json())
const OLLAMA = process.env.OLLAMA_URL || 'http://127.0.0.1:11434'
const QDRANT = process.env.QDRANT_URL || 'http://127.0.0.1:6333'
const COLLECTION = process.env.QDRANT_COLLECTION || 'docs'
const TOKEN = process.env.AUTH_TOKEN || 'dev-token'
const EMBED_MODEL = process.env.EMBED_MODEL || 'snowflake-arctic-embed:137m'
const LLM_MODEL = process.env.LLM_MODEL || 'qwen3:0.6b'
// --- auth middleware ---
function auth(req, res, next){
const h = req.headers['authorization'] || ''
if(!h.startsWith('Bearer ')) return res.status(401).json({error:'Missing token'})
const t = h.split(' ')[1]
if(t !== TOKEN) return res.status(403).json({error:'Bad token'})
next()
}
// --- helpers ---
async function embed(text){
const r = await axios.post(`${OLLAMA}/api/embeddings`, { model: EMBED_MODEL, prompt: text })
return r.data.embedding
}
async function ensureCollection(vecSize){
try {
await axios.get(`${QDRANT}/collections/${COLLECTION}`)
} catch(e){
await axios.put(`${QDRANT}/collections/${COLLECTION}`, {
vectors: { size: vecSize, distance: 'Cosine' }
})
}
}
async function upsertPoints(points){
await axios.put(`${QDRANT}/collections/${COLLECTION}/points?wait=true`, {
points
})
}
async function search(qvec, topK=4){
const r = await axios.post(`${QDRANT}/collections/${COLLECTION}/points/search`, {
vector: qvec, limit: topK
})
return r.data
}
function splitIntoChunks(s, max=1200, overlap=150){
const out=[]; for(let i=0;i<s.length;i+= (max-overlap)) out.push(s.slice(i, i+max).trim())
return out.filter(Boolean)
}
// --- routes ---
app.get('/health', (req,res)=> res.json({ok:true}))
const upload = multer()
app.post('/upload', auth, upload.single('file'), async (req,res)=>{
const buf = req.file.buffer
const text = buf.toString('utf8')
const chunks = splitIntoChunks(text)
const v0 = await embed(chunks[0] || 'sample')
await ensureCollection(v0.length)
const points = []
for(let i=0;i<chunks.length;i++){
const v = await embed(chunks[i])
points.push({ id: `${req.file.originalname}:${i}`, vector: v, payload: { text: chunks[i], src: req.file.originalname } })
}
if(points.length) await upsertPoints(points)
res.json({status:'ok', chunks: chunks.length})
})
app.post('/ask', auth, async (req,res)=>{
const q = (req.body.question||'').trim()
if(!q) return res.status(400).json({error:'Empty question'})
const qvec = await embed(q)
await ensureCollection(qvec.length)
const hits = await search(qvec, 4)
const ctx = (hits?.result||[]).map(h=>h.payload?.text).filter(Boolean).join('
')
const prompt = `Use ONLY the context to answer.
Context:
${ctx}
Question: ${q}
Answer:`
const r = await axios.post(`${OLLAMA}/api/chat`, { model: LLM_MODEL, messages: [{role:'user', content: prompt}], stream:false })
res.json({ answer: r.data.message.content, context: ctx })
})
app.get('/report', auth, async (req,res)=>{
// simple collection info
const info = await axios.get(`${QDRANT}/collections/${COLLECTION}`)
res.json(info.data)
})
const PORT = process.env.PORT || 8000
app.listen(PORT, ()=> console.log(`Node RAG listening on :${PORT}`))
🧪 Appendix C — End‑to‑End Commands (Alpine/Arch)
These steps assume Ollama, Qdrant, and either Python or Node backend on the same host.
1) Pull models (once)
ollama pull snowflake-arctic-embed:137m
ollama pull qwen3:0.6b
2) Start Qdrant locally
qdrant --uri http://127.0.0.1:6333
# or docker run -p 6333:6333 -v $PWD/qdrant:/qdrant qdrant/qdrant
3) Run the Python API
export AUTH_TOKEN=dev-token
uvicorn app:app --host 0.0.0.0 --port 8000
4) Or run the Node API
export AUTH_TOKEN=dev-token
node server.js
5) Upload a document (adds chunks → embeddings)
curl -H "Authorization: Bearer dev-token" -F file=@README.md http://127.0.0.1:8000/upload
6) Ask a question (RAG)
curl -H "Authorization: Bearer dev-token" -H 'Content-Type: application/json' \
-d '{"question":"Summarise the README"}' http://127.0.0.1:8000/ask
7) Optional: simple NGINX front for UI + API
server {
listen 80;
server_name _;
location / { root /var/www/html; index index.html; }
location /ask { proxy_pass http://127.0.0.1:8000; }
location /upload { proxy_pass http://127.0.0.1:8000; }
}
That’s the whole rod, reel, and tackle box in three parts: Python, Node, and the run‑book. Copy, paste, and then scroll back up to understand why each part exists. The RAG worm 🪱 is ready — may your lines stay tight and your pan sizzle with fresh catch.