3/28/2026 · 16 min read
Building a RAG Chatbot in 2026: Claude + LangChain + Pinecone, Step by Step
A working RAG chatbot in under 200 lines of code, deployed and answering questions about your docs. With the prompts, the chunking strategy, and the gotchas.
RAG (Retrieval-Augmented Generation) is the technique that lets a chatbot answer questions about your documents — your help center, your PDFs, your Notion. Without RAG, the bot only knows what was in its training data. With RAG, it knows what's in your knowledge base, in real time.
I've built ~12 of these for clients. Here's the version I now copy-paste as my starting point.
The architecture
Documents → Chunker → Embedder → Pinecone (vector DB)
↑
User question → Embedder → Similarity search
↓
Top 5 chunks + question → Claude → Answer
That's it. The whole game.
The code (Node.js / Next.js API route)
// app/api/chat/route.ts
import { Pinecone } from "@pinecone-database/pinecone";
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
const index = pinecone.index("knowledge-base");
async function embed(text: string) {
const res = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text
});
return res.data[0].embedding;
}
export async function POST(req: Request) {
const { question } = await req.json();
// 1. Embed the question
const questionVector = await embed(question);
// 2. Find relevant chunks
const results = await index.query({
vector: questionVector,
topK: 5,
includeMetadata: true
});
const context = results.matches
.map(m => m.metadata?.text)
.filter(Boolean)
.join("\n\n---\n\n");
// 3. Ask Claude with context
const response = await claude.messages.create({
model: "claude-sonnet-4-5",
max_tokens: 1024,
system: `You are a helpful assistant. Answer the user's question using ONLY the context below. If the context doesn't contain the answer, say "I don't have that information in my knowledge base."
CONTEXT:
${context}`,
messages: [{ role: "user", content: question }]
});
return Response.json({
answer: (response.content[0] as any).text,
sources: results.matches.map(m => m.metadata?.source)
});
}The ingestion pipeline (the part everyone gets wrong)
// scripts/ingest.ts
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import fs from "fs";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 150,
separators: ["\n\n", "\n", ". ", " ", ""]
});
async function ingestFile(filepath: string) {
const text = fs.readFileSync(filepath, "utf-8");
const chunks = await splitter.splitText(text);
const vectors = await Promise.all(
chunks.map(async (chunk, i) => ({
id: `${filepath}-${i}`,
values: await embed(chunk),
metadata: { text: chunk, source: filepath, chunk: i }
}))
);
await index.upsert(vectors);
console.log(`Ingested ${chunks.length} chunks from ${filepath}`);
}The four mistakes I made on my first three RAG builds
1. Chunks too big. I started with 2000-char chunks. The retrieval was fuzzy because each chunk covered 5 different topics. Dropped to 800 chars with 150 overlap and accuracy jumped 40%.
2. No metadata. I just stored text. Then when someone asked "what does the refund policy say?", I had no way to filter to just the refund-policy doc. Always store source, section, and date in metadata.
3. No reranking. Top-5 from cosine similarity isn't always the most relevant — it's the most similar. For high-stakes use cases (legal, medical), add a reranker like Cohere's rerank-3. It costs $1/1000 reranks and dramatically improves answer quality.
4. Forgetting hybrid search. Cosine similarity misses on rare keywords (product names, IDs, acronyms). Pinecone supports sparse-dense hybrid search — use it for product catalogs, codebases, anything with specific terminology.
The system prompt I now use
You are a helpful assistant for [COMPANY].
Rules:
1. Answer using ONLY the context below. Do not use outside knowledge.
2. If the context is insufficient, say "I don't have that in my knowledge base. Want me to connect you with someone?"
3. Cite the source filename in [brackets] after each claim.
4. Keep answers under 4 sentences unless the user asks for detail.
5. Never make up product features, prices, or policies.
CONTEXT:
{context}
The "cite sources" instruction is what makes users trust the bot. They click the citation, see it's real, and stop suspecting hallucination.
Cost (real numbers)
For a knowledge base of ~500 documents, ~10K chunks:
- Pinecone: $0 (free tier covers up to 100K vectors)
- Embeddings (one-time ingestion): ~$0.50
- Re-embeddings on doc updates: ~$0.05/month
- Claude per chat: ~$0.01 (avg 1500 tokens in context, 300 out)
- 1000 chats/month: $10
Total: ~$10/month for a production RAG bot. Hard to argue with.
Like this post?
Subscribe for weekly automation breakdowns and production templates.
Join newsletter