rag

How to Build a Node‑based RAG System

AKADATA

18 Sep 2025 • 18 min read

Tech‑Scroll 116

How to Build a Node‑based RAG System

Proverb: “Better to show the path than to speak of the mountain.”

Too many posts tell you what RAG is without ever showing you how to build it. Here we walk the path end‑to‑end.

Before diving into commands and code, let’s set the scene. Retrieval‑Augmented Generation (RAG) is a way of giving a large language model live access to your own knowledge base. Instead of relying on what the model learned during its original training, you feed it the exact documents you want it to use and it retrieves only the most relevant passages when you ask a question.

Why bother? Because this keeps answers accurate, grounded and up‑to‑date while dramatically reducing the tokens (and cost) you would otherwise burn by pasting whole documents into every prompt.

In this scroll we will build a complete Node.js implementation: we will set up the environment, create a vector index of documents, show how to retrieve context for any question, and finally wire it to a local Ollama model so you can ask questions of your own data. This is practical, reproducible and ready to adapt for production use.

1. Install and prepare Node

Make sure you have Node 18+ and npm installed. Using nvm:

nvm install 18
nvm use 18
node -v   # confirm version

Create a new project:

mkdir node-rag-demo && cd node-rag-demo
npm init -y

2. Add the key dependencies

For a minimal Retrieval‑Augmented Generation pipeline:

langchain – the framework that stitches everything together. It provides loaders to pull text from PDFs/HTML/CSV, split and clean it, create embeddings, store them and run retrieval + generation chains.
openai – the client library to call an OpenAI LLM API (or a compatible endpoint). It handles authentication, sending prompts and receiving generated answers.
@pinecone-database/pinecone – the official Node client for the Pinecone managed vector database. It lets you create and query a vector index to store and search document embeddings. (Any other vector DB driver can be swapped in if you prefer Weaviate, Milvus, pgvector, etc.)
dotenv – a tiny utility to load environment variables from a local .env file into process.env, so API keys and configuration are kept out of source code and version control.

npm install langchain openai @pinecone-database/pinecone dotenv

Create a .env file and insert your own keys (never hard‑code them).
This file is a simple text list of environment variables that your Node app can read at runtime via process.env. It keeps secrets (API tokens, host URLs, model names) out of source code and version control.

In this guide the variables are named after the OpenAI and Pinecone clients because the code below expects those names. They do not have to point to OpenAI’s own cloud—you can use any API that speaks the same protocol. For example OPENAI_API_KEY might be the key for a locally hosted OpenAI‑compatible server such as Ollama or another provider.

The PINECONE_ENV (region) is required by the Pinecone client even if you are testing locally; it selects the logical index region for Pinecone’s service. If you are not using Pinecone’s cloud you can still set a dummy region to satisfy the client library.

Why use a .env file at all? Because it:

keeps credentials out of git and out of the codebase,
allows different keys/config for dev, UAT and production without code changes,
follows the 12‑factor‑app convention so tools like Docker or CI/CD pipelines can inject the right values automatically:

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=...
PINECONE_ENV=us-west1-gcp

3. Ingest and index documents

Create ingest.js to split a document into chunks, embed and store it.
Before you paste the code, here’s what’s happening and why:

Imports – we bring in helper classes from langchain (for splitting text and creating embeddings), the PineconeStore (to write vectors into the database), our own pineconeClient.js (which connects to Pinecone), and Node’s built‑in fs module (to read files).
const vs let – we use const because these variables are not reassigned; they point to objects or functions that stay constant. It is safer and communicates intent clearly.
Reading the document – fs.readFileSync loads the raw text we want to index.
Splitting – RecursiveCharacterTextSplitter breaks the text into overlapping chunks so the embeddings capture context without exceeding token limits.
Embedding & Storing – OpenAIEmbeddings converts each chunk into a vector; PineconeStore.fromDocuments writes those vectors to the Pinecone index returned by pinecone.Index(...). That index is the database table of vectors we will later search.
await – each step that talks to the filesystem or Pinecone is asynchronous; await pauses execution until the promise resolves, so we know indexing finished before logging.

Now the code:

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";
import { pinecone } from "./pineconeClient.js";
import fs from "fs";

const rawText = fs.readFileSync("./docs/example.txt", "utf8");
const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200 });
const docs = await splitter.createDocuments([rawText]);

const embeddings = new OpenAIEmbeddings();
await PineconeStore.fromDocuments(docs, embeddings, {
  pineconeIndex: pinecone.Index("node-rag-demo")
});

console.log("Documents indexed!");

Run:

After you execute this command you should see the console message:

Documents indexed!

This confirms the documents have been embedded and stored in the vector database and your index is ready for retrieval.

node ingest.js

4. Retrieval and Generation

Create query.js to retrieve relevant chunks and generate an answer.
Before pasting the code, a quick what–how–why so you learn more than just copy‑paste:

Purpose: This script asks the vector database for the most relevant document chunks and then sends those chunks (the context) to the language model to get a grounded answer.
Imports: We bring in the OpenAI LLM client, the same OpenAIEmbeddings used at indexing (vectors must be in the same space), and the PineconeStore to query the stored vectors. We also import our pinecone client to point to the index.
const store – an object representing our Pinecone vector index opened for reading. It is const because we are not reassigning the variable; it simply holds the connection.
const llm – the language model interface; also constant for the same reason.
const question – grabs the question from the command line (process.argv[2]) or uses a default. Again not reassigned.
await store.similaritySearch(...) – performs an approximate nearest neighbour search and returns an array of the top‑k matching chunks. This is the results set: each element includes pageContent (the text chunk) and metadata.
const context – joins the text of those chunks into a single string. This is the context you supply to the LLM so it can answer with facts from your own data.
await llm.call(...) – sends the combined question and context to the model and waits for the generated answer.
console.log – prints the final answer to your terminal.

Now the code:

import { OpenAI } from "langchain/llms/openai";
import { OpenAIEmbeddings } from "langchain/embeddings/openai";
import { PineconeStore } from "langchain/vectorstores/pinecone";
import { pinecone } from "./pineconeClient.js";

const store = await PineconeStore.fromExistingIndex(
  new OpenAIEmbeddings(),
  { pineconeIndex: pinecone.Index("node-rag-demo") }
);

const llm = new OpenAI({ temperature: 0 });
const question = process.argv[2] || "Summarise the document";

const results = await store.similaritySearch(question, 3);
const context = results.map(r => r.pageContent).join("\n");

const answer = await llm.call(`Answer the question using this context:\n${context}\n\nQuestion: ${question}`);
console.log("Answer:", answer);

Run a query:

node query.js "What is the key message of the text?"

5. What you have built

Indexing – the pipeline that turns raw content into searchable vectors:
1. Normalise & parse: extract text (PDF/HTML/CSV/HTML), strip boilerplate, detect language/encoding.
2. Chunk: split into overlapping chunks (e.g., 800–1,200 tokens with 10–20% overlap) so answers can be grounded in small units.
3. Enrich metadata: doc ID, source URL, title, section/page, timestamp, ACL/tenant, tags, checksum — stored alongside each chunk.
4. Embed: map each chunk to a high‑dimensional vector using an embedding model (local or API). Record embedding model/version for future re‑index.
5. Write to vector store: upsert vectors + metadata into your index (Pinecone/Weaviate/Milvus/pgvector) using stable primary keys.
6. Auxiliary indexes: optional BM25/sparse index for hybrid search; keep token counts for cost/latency tuning.
7. Quality gates: deduplicate near‑identical chunks (cosine > 0.95), drop low‑signal/too‑short chunks, compute checksums, and log coverage stats.
8. Version & lifecycle: record dataset hash, chunking params, tokenizer, and retention/TTL; support soft‑delete and rollbacks.
Retrieval – step‑by‑step at query time:
1. Normalise the question: lowercase/trim, strip boilerplate, detect language; optionally expand synonyms and spell‑correct. Decide which index/tenant to hit.
2. Embed the query: turn the question into a vector using the same embedding model/version used at indexing to keep spaces compatible.
3. Candidate search (ANN/k‑NN): ask the vector store for the top‑k nearest neighbours by cosine similarity/dot‑product. Set k (e.g., 20–50) high enough to avoid missing good chunks.
4. Apply filters: enforce metadata constraints (doc IDs, time windows, tags, ACL/tenant). Security‑sensitive systems filter before ranking to avoid leaking.
5. Hybrid scoring (optional): combine dense scores with sparse (BM25/keyword) to favour exact term matches and recent material.
6. Re‑rank (optional): use a cross‑encoder/re‑ranker to reorder the top candidates for semantic quality and answerability.
7. Deduplicate & diversify: drop near‑duplicates, cap per‑document results, and keep coverage across sources to avoid echoing one page.
8. Assemble context window: pack the highest‑ranked chunks into a prompt up to a token budget (e.g., 1.5–4k tokens), preserving chunk boundaries and adding separators.
9. Attach citations: carry forward source metadata (title, URL, page/section) so the LLM can cite where each snippet came from.
10. Cache & observe: cache frequent queries/results and log retrieval metrics (latency, top‑k scores, hit rate) for tuning.
  Why: Retrieval limits the LLM’s input to the most relevant, permission‑appropriate facts, which (a) reduces hallucination, (b) improves answer quality, (c) lowers token cost/latency, and (d) enables traceable citations and access‑control. Using the same embedding model for query and index preserves vector geometry; filters and re‑rank strike a practical precision/recall balance.
Generation – the final stage where the retrieved chunks are given to the LLM to create an answer:
1. Prompt construction: merge the question with the retrieved context into a single prompt template (often with system instructions and formatting for citations or references).
2. Model call: send the prompt to the chosen LLM (OpenAI, local model, etc.) with parameters such as temperature, max tokens and stop sequences.
3. Grounded reasoning: the LLM uses the retrieved context to produce an answer that stays tied to the source documents, reducing hallucinations.
4. Post‑processing: clean or re‑format the output, add citations or highlights from the metadata and handle any guardrails (for example content filtering or length trimming).
5. Feedback loop: optionally log the query, retrieved context and final output to evaluate quality and feed future improvements.
  Why: Generation is where the system turns raw retrieval results into a coherent, human‑readable response while keeping it anchored to verified data.

This is the heart of Retrieval‑Augmented Generation: showing how to build it, not just talking about it. In the next section we extend this into a real‑world Node webapp that talks to a local Ollama model and queries the RAG database. We will lay out the folder structure, required npm installs, how to run in dev on localhost:port, how to build a production dist/ folder, and how to serve it behind nginx for a UAT environment. This illustrates why localhost:port is a developer sandbox, why an nginx reverse‑proxy is typical for UAT, and why the static dist/ output represents a production build ready for deployment.

With this scaffold you can swap vector stores (e.g. Weaviate, Milvus) or LLM providers. The core pattern remains the same—and now you have walked the path instead of just hearing about the mountain.

6. A minimal Node web app that uses local Ollama and your RAG index

End‑to‑end: dev (localhost:port) → UAT (Nginx proxy) → prod (static /dist).

Why we create this structure: A clear folder layout keeps API code, UI code and data scripts separated so each can evolve independently. Node resolves modules relative to these folders, so keeping a predictable structure avoids fragile import paths and makes scaling the project easier. It also helps other developers (or future you) immediately understand where the server, client and indexing logic live. This is not just tidiness; it’s how Node’s module system and build tools (Vite, ts-node) efficiently locate files and hot‑reload during development.

6.1 Folder structure

node-rag-demo/
├─ .env
├─ package.json
├─ vite.config.ts
├─ src/
│  ├─ server.ts          # Express app (API + static in dev)
│  ├─ rag.ts             # index/query helpers (LanceDB + embeddings)
│  ├─ ollama.ts          # local Ollama client wrapper
│  └─ routes.ts          # /api/search and /api/ask endpoints
├─ public/               # static assets (dev)
├─ docs/                 # source documents to index
├─ web/
│  ├─ index.html
│  ├─ main.tsx           # small React UI
│  └─ App.tsx
├─ ingest.ts             # CLI to (re)index docs
└─ dist/                 # production build output (created by Vite)

6.2 Install (Node 18+)

We already installed Node earlier, but here we call it out again because the build tooling (Vite + modern TypeScript) requires a recent runtime. 18+ is the minimum that supports native ES modules and the APIs used by these tools. If you already have Node 18 or newer, simply ensure node -v confirms it; no need to reinstall—this step is to highlight the version requirement before installing the project’s dev dependencies.

npm i -D typescript ts-node @types/node vite @vitejs/plugin-react
npm i express cors dotenv zod
npm i ollama lancedb @lancedb/vectordb @xenova/transformers
npm i react react-dom
npx tsc --init --rootDir src --outDir build --esModuleInterop true

Why LanceDB? It’s a fast, file‑based vector DB that runs locally (no postgres/containers needed) and works well with Node. Swap later for Weaviate/Milvus if desired.

Install Ollama (local LLM runtime)
Ollama lets you run models locally with an OpenAI‑style API surface. Download for your platform at https://ollama.com (macOS/Windows/Linux). On Linux use their install script or a distro package if available.
Why Ollama? Privacy (data stays local), offline capability, fast iteration, and one‑line model management via ollama pull.

Choose your models (what & why):

EMBEDDING_MODEL – maps text to vectors. Default here: Xenova/all-MiniLM-L6-v2 via transformers.js (small, fast). Alternatives:
- intfloat/e5-small-v2 (very fast, good English retrieval),
- intfloat/e5-base-v2 (higher quality, slower),
- bge-small-en-v1.5 (strong small model),
- nomic-ai/nomic-embed-text (multilingual).
  Important: once you index with a given embedding model, the vectors live in that model’s space—changing models requires re‑indexing.
GEN_MODEL – the generator that writes answers. We suggest llama3.1:8b-instruct as a balanced default. Alternatives: qwen2.5:7b-instruct, mistral:7b-instruct, or larger models if your hardware allows. You can also try OSS‑GPT/"thinking" style models for deeper reasoning; start small for responsiveness.

About the “OpenAI” naming: Earlier sections used OPENAI_API_KEY to illustrate an OpenAI‑compatible client. In this local Ollama setup we use the Ollama Node SDK directly—no OpenAI key is needed. We keep OpenAI‑style names only when targeting an OpenAI‑compatible endpoint.

Add .env:

OLLAMA_BASE_URL=http://127.0.0.1:11434
EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2   # local embedding via transformers.js
GEN_MODEL=llama3.1:8b-instruct             # or another pulled Ollama model
LANCEDB_DIR=./.lancedb

Pull your chosen generator model (examples):

# Llama 3.1 8B instruct
ollama pull llama3.1:8b-instruct
# Or Qwen / Mistral
ollama pull qwen2.5:7b-instruct
ollama pull mistral:7b-instruct
# You can also experiment with larger OSS-GPT style models:
#   For example: a 20B-parameter open-source GPT model
#   (requires a high-end GPU with >24GB VRAM for reasonable speed).
#   On CPU it will run, but generation will be far slower.
#   GPU speed scales roughly with available VRAM and compute cores.
#   Example:
# ollama pull oss-gpt-20b
# Note: always match model size to your hardware; choose smaller models for laptops
# and bigger ones only if you have the GPU capacity and need the extra reasoning power.

6.3 `src/ollama.ts` – local LLM wrapper

Purpose: This small helper module is the bridge between your Node app and the locally running Ollama service on http://127.0.0.1:11434.
Why it exists: It wraps the Ollama client in one place so the rest of the code can simply call a generate() function without repeating connection details or model names.
Security note: By default Ollama listens only on localhost and has no built‑in authentication; it is intended for local development. Do not expose it directly to the internet without your own proxy or access control.
What the generate() function does: sends a prompt to the chosen GEN_MODEL and returns the model’s text response, so API routes or the UI can request answers without caring about the underlying API call.

import Ollama from "ollama";
const baseUrl = process.env.OLLAMA_BASE_URL || "http://127.0.0.1:11434";
const client = new Ollama({ host: baseUrl });
const model = process.env.GEN_MODEL || "llama3.1:8b-instruct";

export async function generate(prompt: string) {
  const res = await client.generate({ model, prompt, stream: false });
  return res.response;
}

6.4 `src/rag.ts` – embeddings + LanceDB index

import * as lancedb from "lancedb";
import { pipeline } from "@xenova/transformers";
import fs from "fs";

const dir = process.env.LANCEDB_DIR || ".lancedb";
let db: lancedb.Connection; let tbl: lancedb.Table<any>;

let embedder: any; // transformers.js pipeline
async function getEmbedder() {
  if (!embedder) embedder = await pipeline("feature-extraction", process.env.EMBEDDING_MODEL);
  return embedder;
}

export async function openIndex() {
  db = await lancedb.connect(dir);
  tbl = await db.openTable("docs").catch(async () => db.createTable("docs", [], { onDisk: true }));
  return tbl;
}

export async function embed(text: string): Promise<number[]> {
  const e = await (await getEmbedder())(text, { pooling: "mean", normalize: true });
  return Array.from(e.data);
}

export async function reindex(files: string[]) {
  await openIndex();
  const rows: any[] = [];
  for (const f of files) {
    const content = fs.readFileSync(f, "utf8");
    const chunks = chunk(content, 900, 150);
    for (const c of chunks) rows.push({ id: cryptoRandom(), text: c, vector: await embed(c), file: f });
  }
  if (rows.length) await tbl.add(rows);
}

export async function search(query: string, k = 5) {
  await openIndex();
  const qv = await embed(query);
  return tbl.search(qv).limit(k).execute();
}

function chunk(t: string, size: number, overlap: number) {
  const out: string[] = []; let i = 0;
  while (i < t.length) { out.push(t.slice(i, i + size)); i += size - overlap; }
  return out;
}
function cryptoRandom() { return Math.random().toString(36).slice(2) + Date.now().toString(36); }

6.5 `src/routes.ts` – API endpoints

What these routes are for & how they work:

GET /health — Liveness check for load balancers and uptime monitors. Returns { ok: true } if the API process is up and able to handle requests. Useful for Docker/K8s health probes and Nginx proxy_next_upstream logic.

POST /search — Pure retrieval. Body: { q: string }. It embeds the query, runs vector search against the RAG index and returns the hits (chunks + metadata). Use this when a client wants to render search results, previews, or build a custom context without calling the LLM.

POST /ask — Retrieval + generation. Body: { q: string }. It first calls the same retrieval as /search, then packs the top chunks into a context string and asks the local LLM (via generate() from ollama.ts) to produce a grounded answer. Response: { answer, sources } where sources lists the originating documents so you can show citations.

How requests flow: Client → Express route → rag.search() (vector DB) → assemble context → ollama.generate(prompt) → JSON response. Each I/O step is awaited to ensure ordering.

Why split /search and /ask? It lets front‑ends implement rich UX (preview the sources, allow user to add/remove chunks, then call /ask) and enables testing retrieval quality independently of generation.

Security & limits: In production, add input validation (e.g., Zod), rate‑limits, and access control; never expose your vector DB or Ollama port directly. Keep /health lightweight and avoid leaking version/build info in responses.
What this file is for: defines the Express routes that the frontend and any client code will call to interact with your RAG backend.
How it works:
• /health – a simple GET route that returns {ok:true} so you can quickly check that the API server is alive.
• /search – accepts a POST with a query string q, runs a vector search over the LanceDB index using the search() helper, and returns the raw matching chunks.
• /ask – accepts a POST with q, performs the same search but then builds a prompt from the retrieved chunks and calls generate() so the Ollama model can craft a grounded natural‑language answer.
Why this design: by keeping the retrieval and generation steps behind these API endpoints you isolate the UI from database or model details, making it easy to swap vector stores or LLMs later without touching the frontend.

import { Router } from "express";
import { search } from "./rag";
import { generate } from "./ollama";

export const api = Router();

api.get("/health", (_req, res) => res.json({ ok: true }));

api.post("/search", async (req, res) => {
  const { q } = req.body as { q: string };
  const hits = await search(q, 8);
  res.json(hits);
});

api.post("/ask", async (req, res) => {
  const { q } = req.body as { q: string };
  const hits = await search(q, 5);
  const context = hits.map((h: any) => h.text).join("
---
");
  const prompt = `Use the context to answer accurately with citations.

CONTEXT:
${context}

QUESTION: ${q}`;
  const answer = await generate(prompt);
  res.json({ answer, sources: hits.map((h: any) => ({ file: h.file })) });
});

6.6 `src/server.ts` – Express app

Purpose: This file creates the actual HTTP server for the RAG backend. It wires the Express framework to our API routes and starts listening for requests.
Why Express: Express is a lightweight, battle‑tested Node web framework that makes it easy to define middleware (for JSON parsing, CORS, logging) and route handlers.
How it works:
• Loads environment variables with dotenv.config() so the app can read settings like PORT.
• Creates an Express app instance, adds CORS middleware so browser clients can call it, and express.json() to parse incoming JSON bodies.
• Mounts the /api routes defined in routes.ts so /api/ask and /api/search become live endpoints.
• Finally calls app.listen(port) to start the HTTP server on localhost:port (by default 3000).
Why this matters: This is the final glue—once running, you now have a simple RAG‑based chatbot service running right on your own laptop or desktop. Soon you can point the React frontend at it and begin asking questions of your indexed documents.

import express from "express";
import cors from "cors";
import dotenv from "dotenv";
import { api } from "./routes";

dotenv.config();
const app = express();
app.use(cors());
app.use(express.json({ limit: "2mb" }));
app.use("/api", api);

// Dev: serve Vite dev server separately; in prod we serve /dist via Nginx
const port = process.env.PORT ? Number(process.env.PORT) : 3000;
app.listen(port, () => console.log(`API listening on http://127.0.0.1:${port}`));

6.7 Minimal web UI (Vite + React)

Purpose: This is a very simple browser interface so you can immediately try out your RAG backend without needing any other tools. It shows the entire flow end‑to‑end—ask a question in a textbox, receive a grounded answer from the local Ollama model—right in your browser.
Why include it: With just a few lines of React you can see the system working and share a clickable demo. This is not a full chat experience; it’s the minimal proof that the pipeline works.
Next steps: This code is the foundation for a more complete chatbot. Later you can extend it with conversation history, streaming answers and WebSockets. We’ll cover how to build that richer chat experience in a future scroll.

web/main.tsx

import React from "react"; import { createRoot } from "react-dom/client";
function App(){
  const [q,setQ]=React.useState("");
  const [a,setA]=React.useState("");
  const ask=async()=>{
    const r=await fetch("/api/ask",{method:"POST",headers:{"Content-Type":"application/json"},body:JSON.stringify({q})});
    const j=await r.json(); setA(j.answer);
  };
  return (<div style={{maxWidth:720,margin:"2rem auto",fontFamily:"sans-serif"}}>
    <h1>Node + Ollama RAG</h1>
    <input value={q} onChange={e=>setQ((e.target as HTMLInputElement).value)} placeholder="Ask a question" style={{width:"100%",padding:"0.6rem"}}/>
    <button onClick={ask} style={{marginTop:"0.8rem"}}>Ask</button>
    <pre style={{whiteSpace:"pre-wrap",marginTop:"1rem"}}>{a}</pre>
  </div>);
}
createRoot(document.getElementById("root")!).render(<App/>);

web/index.html

Why it is required: This HTML file is the single entry‑point the browser loads. It provides the <div id="root"> container for React to attach to and the <script type="module"> tag that loads the compiled JavaScript. Without it, Vite cannot bootstrap the app; the browser will simply display a blank page or a 404 because there is no initial HTML to load and React never runs.

<!doctype html><html><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>Node + Ollama RAG</title></head>
<body><div id="root"></div><script type="module" src="/web/main.tsx"></script></body></html>

vite.config.ts

What it is: the build and dev‑server configuration for Vite.
Why we use it: Vite is the fast bundler/dev server that compiles the React code, injects it into index.html and serves it in development. The config file declares which plugins to load (here the React plugin) and where to output the production bundle.
How it is invoked: when you run npx vite or npm run build, Vite automatically looks for vite.config.ts (or .js) at the project root and applies these settings to start the dev server or create the dist/ production build.

import { defineConfig } from "vite";
import react from "@vitejs/plugin-react";
export default defineConfig({ plugins:[react()], root:".", build:{ outDir:"dist" } });

6.8 Index your docs

Why we index: Indexing converts your raw documents into high-dimensional vectors so they can be searched semantically. Without this step the RAG system has nothing to retrieve and the LLM would have to guess.
How we index here: The reindex([...]) helper reads each file, chunks the text, generates embeddings with the chosen embedding model and stores the vectors in LanceDB.
Benefit: This makes later queries fast and accurate—similar questions map to nearby vectors, allowing relevant context to be fetched instantly for the model to answer grounded in your own data.

node --loader ts-node/esm ingest.ts docs/example.txt

(or adapt your earlier ingest.js to write into LanceDB via reindex([...]))

6.9 Run in dev (localhost:port)

# terminal 1 – Vite frontend dev server
npx vite
# terminal 2 – API
node --loader ts-node/esm src/server.ts

Why localhost**:port**** = dev?** Hot reload, fast feedback, no reverse proxy. Ideal for building quickly on a single machine.

6.10 Build the dist bundle

npm run build   # if you add: "build": "vite build" to package.json scripts

This emits static assets into dist/. That’s your production frontend bundle.

6.11 UAT with Nginx proxy (frontend from dist, API proxied to Node)

/etc/nginx/conf.d/rag-uat.conf:

server {
  listen 80;
  server_name rag-uat.example.local;

  root /opt/node-rag-demo/dist;
  index index.html;

  location /api/ {
    proxy_pass http://127.0.0.1:3000/api/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
  }

  location / {
    try_files $uri $uri/ /index.html;
  }
}

Why Nginx = UAT? It mirrors the production topology (reverse proxy, static files, separate API) without exposing the dev servers. Easier TLS, logs, and access control.

6.12 Production

Serve dist/ from Nginx (or any CDN/object store).
Run the Express API as a system service (pm2/systemd) behind Nginx/TLS.

/etc/nginx/conf.d/rag-prod.conf:

server {
  listen 443 ssl http2;
  server_name rag.example.com;
  ssl_certificate     /etc/ssl/certs/rag.crt;
  ssl_certificate_key /etc/ssl/private/rag.key;

  root /srv/rag/dist;
  index index.html;

  location /api/ {
    proxy_pass http://127.0.0.1:3000/api/;
    proxy_http_version 1.1;
    proxy_set_header Connection "";
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto $scheme;
  }

  location / { try_files $uri $uri/ /index.html; }
}

6.13 Why these three stages?

Dev (localhost**:port****)** – fastest edit‑run loop. No TLS, no proxy, hot reload.
UAT (Nginx proxy) – realistic routing, TLS termination, logs, and headers; safe place for stakeholders to test.
Prod (dist************ + API behind Nginx) – static assets are cacheable and cheap to serve; API is isolated, observable and restartable without breaking the UI.

6.14 Scripts (package.json)

{
  "scripts": {
    "dev:api": "ts-node src/server.ts",
    "dev:web": "vite",
    "build": "vite build",
    "start": "node build/server.js"
  }
}

You now have a real web app using local Ollama for generation and a local LanceDB RAG index for retrieval—runnable on a laptop for dev, behind Nginx for UAT, and as a static+API split for production.

Walk‑through complete. Hand held as requested.

Coffee now? Strong, the way coders like it. Or a calming tea for those steering clear of caffeine.
Anything else to be shown next—benchmarks, diagrams, deployment hardening, Arch/Alpine service files?

There’s the walkthrough—exactly as suggested, hand‑holding included. Can we make you a coffee now (nice and strong, the way most coders like it)? Or perhaps a gentle tea for those who prefer a caffeine‑free late‑night build. Anything else you’d like us to show next?

Tech‑Scroll 116

How to Build a Node‑based RAG System

1. Install and prepare Node

2. Add the key dependencies

3. Ingest and index documents

4. Retrieval and Generation

5. What you have built

6. A minimal Node web app that uses local Ollama and your RAG index

6.1 Folder structure

6.2 Install (Node 18+)

6.3 src/ollama.ts – local LLM wrapper

6.4 src/rag.ts – embeddings + LanceDB index

6.5 src/routes.ts – API endpoints

6.6 src/server.ts – Express app

6.7 Minimal web UI (Vite + React)

6.8 Index your docs

6.9 Run in dev (localhost:port)

6.11 UAT with Nginx proxy (frontend from dist, API proxied to Node)

6.13 Why these three stages?

6.14 Scripts (package.json)

6.3 `src/ollama.ts` – local LLM wrapper

6.4 `src/rag.ts` – embeddings + LanceDB index

6.5 `src/routes.ts` – API endpoints

6.6 `src/server.ts` – Express app