Loading…

3/28/2026 · 16 min read

Building a RAG Chatbot in 2026: Claude + LangChain + Pinecone, Step by Step

A working RAG chatbot in under 200 lines of code, deployed and answering questions about your docs. With the prompts, the chunking strategy, and the gotchas.

RAG (Retrieval-Augmented Generation) is the technique that lets a chatbot answer questions about your documents — your help center, your PDFs, your Notion. Without RAG, the bot only knows what was in its training data. With RAG, it knows what's in your knowledge base, in real time.

I've built ~12 of these for clients. Here's the version I now copy-paste as my starting point.

The architecture

Documents → Chunker → Embedder → Pinecone (vector DB)
                                       ↑
User question → Embedder → Similarity search
                              ↓
              Top 5 chunks + question → Claude → Answer

That's it. The whole game.

The code (Node.js / Next.js API route)

// app/api/chat/route.ts
import { Pinecone } from "@pinecone-database/pinecone";
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
 
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
 
const index = pinecone.index("knowledge-base");
 
async function embed(text: string) {
  const res = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });
  return res.data[0].embedding;
}
 
export async function POST(req: Request) {
  const { question } = await req.json();
 
  // 1. Embed the question
  const questionVector = await embed(question);
 
  // 2. Find relevant chunks
  const results = await index.query({
    vector: questionVector,
    topK: 5,
    includeMetadata: true
  });
 
  const context = results.matches
    .map(m =&gt; m.metadata?.text)
    .filter(Boolean)
    .join("\n\n---\n\n");
 
  // 3. Ask Claude with context
  const response = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: `You are a helpful assistant. Answer the user's question using ONLY the context below. If the context doesn't contain the answer, say "I don't have that information in my knowledge base."
 
CONTEXT:
${context}`,
    messages: [{ role: "user", content: question }]
  });
 
  return Response.json({
    answer: (response.content[0] as any).text,
    sources: results.matches.map(m =&gt; m.metadata?.source)
  });
}

The ingestion pipeline (the part everyone gets wrong)

// scripts/ingest.ts
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import fs from "fs";
 
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 150,
  separators: ["\n\n", "\n", ". ", " ", ""]
});
 
async function ingestFile(filepath: string) {
  const text = fs.readFileSync(filepath, "utf-8");
  const chunks = await splitter.splitText(text);
 
  const vectors = await Promise.all(
    chunks.map(async (chunk, i) =&gt; ({
      id: `${filepath}-${i}`,
      values: await embed(chunk),
      metadata: { text: chunk, source: filepath, chunk: i }
    }))
  );
 
  await index.upsert(vectors);
  console.log(`Ingested ${chunks.length} chunks from ${filepath}`);
}

The four mistakes I made on my first three RAG builds

1. Chunks too big. I started with 2000-char chunks. The retrieval was fuzzy because each chunk covered 5 different topics. Dropped to 800 chars with 150 overlap and accuracy jumped 40%.

2. No metadata. I just stored text. Then when someone asked "what does the refund policy say?", I had no way to filter to just the refund-policy doc. Always store source, section, and date in metadata.

3. No reranking. Top-5 from cosine similarity isn't always the most relevant — it's the most similar. For high-stakes use cases (legal, medical), add a reranker like Cohere's rerank-3. It costs $1/1000 reranks and dramatically improves answer quality.

4. Forgetting hybrid search. Cosine similarity misses on rare keywords (product names, IDs, acronyms). Pinecone supports sparse-dense hybrid search — use it for product catalogs, codebases, anything with specific terminology.

The system prompt I now use

You are a helpful assistant for [COMPANY].

Rules:
1. Answer using ONLY the context below. Do not use outside knowledge.
2. If the context is insufficient, say "I don't have that in my knowledge base. Want me to connect you with someone?"
3. Cite the source filename in [brackets] after each claim.
4. Keep answers under 4 sentences unless the user asks for detail.
5. Never make up product features, prices, or policies.

CONTEXT:
{context}

The "cite sources" instruction is what makes users trust the bot. They click the citation, see it's real, and stop suspecting hallucination.

Cost (real numbers)

For a knowledge base of ~500 documents, ~10K chunks:

Pinecone: $0 (free tier covers up to 100K vectors)
Embeddings (one-time ingestion): ~$0.50
Re-embeddings on doc updates: ~$0.05/month
Claude per chat: ~$0.01 (avg 1500 tokens in context, 300 out)
1000 chats/month: $10

Total: ~$10/month for a production RAG bot. Hard to argue with.

Like this post?

Subscribe for weekly automation breakdowns and production templates.

Join newsletter

Automation

Featured

n8n vs Make vs Zapier in 2026: What Actually Scales

A practical comparison for founders choosing workflow automation tools in production teams.

8 min read4/26/2026

AI Systems

Self-host n8n on DigitalOcean: Step-by-Step Production Setup

Deploy n8n with HTTPS, backups, and reliable webhook handling.

10 min read4/26/2026

AI Systems

RAG Chatbot Tutorial with LangChain and Vector Search

Build trustworthy AI responses with retrieval and citations.

12 min read4/26/2026

Loading…

3/28/2026 · 16 min read

Building a RAG Chatbot in 2026: Claude + LangChain + Pinecone, Step by Step

A working RAG chatbot in under 200 lines of code, deployed and answering questions about your docs. With the prompts, the chunking strategy, and the gotchas.

I've built ~12 of these for clients. Here's the version I now copy-paste as my starting point.

The architecture

Documents → Chunker → Embedder → Pinecone (vector DB)
                                       ↑
User question → Embedder → Similarity search
                              ↓
              Top 5 chunks + question → Claude → Answer

That's it. The whole game.

The code (Node.js / Next.js API route)

// app/api/chat/route.ts
import { Pinecone } from "@pinecone-database/pinecone";
import Anthropic from "@anthropic-ai/sdk";
import OpenAI from "openai";
 
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY! });
const claude = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY! });
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
 
const index = pinecone.index("knowledge-base");
 
async function embed(text: string) {
  const res = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  });
  return res.data[0].embedding;
}
 
export async function POST(req: Request) {
  const { question } = await req.json();
 
  // 1. Embed the question
  const questionVector = await embed(question);
 
  // 2. Find relevant chunks
  const results = await index.query({
    vector: questionVector,
    topK: 5,
    includeMetadata: true
  });
 
  const context = results.matches
    .map(m =&gt; m.metadata?.text)
    .filter(Boolean)
    .join("\n\n---\n\n");
 
  // 3. Ask Claude with context
  const response = await claude.messages.create({
    model: "claude-sonnet-4-5",
    max_tokens: 1024,
    system: `You are a helpful assistant. Answer the user's question using ONLY the context below. If the context doesn't contain the answer, say "I don't have that information in my knowledge base."
 
CONTEXT:
${context}`,
    messages: [{ role: "user", content: question }]
  });
 
  return Response.json({
    answer: (response.content[0] as any).text,
    sources: results.matches.map(m =&gt; m.metadata?.source)
  });
}

The ingestion pipeline (the part everyone gets wrong)

// scripts/ingest.ts
import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import fs from "fs";
 
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 800,
  chunkOverlap: 150,
  separators: ["\n\n", "\n", ". ", " ", ""]
});
 
async function ingestFile(filepath: string) {
  const text = fs.readFileSync(filepath, "utf-8");
  const chunks = await splitter.splitText(text);
 
  const vectors = await Promise.all(
    chunks.map(async (chunk, i) =&gt; ({
      id: `${filepath}-${i}`,
      values: await embed(chunk),
      metadata: { text: chunk, source: filepath, chunk: i }
    }))
  );
 
  await index.upsert(vectors);
  console.log(`Ingested ${chunks.length} chunks from ${filepath}`);
}

The four mistakes I made on my first three RAG builds

1. Chunks too big. I started with 2000-char chunks. The retrieval was fuzzy because each chunk covered 5 different topics. Dropped to 800 chars with 150 overlap and accuracy jumped 40%.

The system prompt I now use

You are a helpful assistant for [COMPANY].

Rules:
1. Answer using ONLY the context below. Do not use outside knowledge.
2. If the context is insufficient, say "I don't have that in my knowledge base. Want me to connect you with someone?"
3. Cite the source filename in [brackets] after each claim.
4. Keep answers under 4 sentences unless the user asks for detail.
5. Never make up product features, prices, or policies.

CONTEXT:
{context}

The "cite sources" instruction is what makes users trust the bot. They click the citation, see it's real, and stop suspecting hallucination.

Cost (real numbers)

For a knowledge base of ~500 documents, ~10K chunks:

Pinecone: $0 (free tier covers up to 100K vectors)
Embeddings (one-time ingestion): ~$0.50
Re-embeddings on doc updates: ~$0.05/month
Claude per chat: ~$0.01 (avg 1500 tokens in context, 300 out)
1000 chats/month: $10

Total: ~$10/month for a production RAG bot. Hard to argue with.

Like this post?

Subscribe for weekly automation breakdowns and production templates.

Join newsletter

Automation

Featured

n8n vs Make vs Zapier in 2026: What Actually Scales

A practical comparison for founders choosing workflow automation tools in production teams.

8 min read4/26/2026

AI Systems

Self-host n8n on DigitalOcean: Step-by-Step Production Setup

Deploy n8n with HTTPS, backups, and reliable webhook handling.

10 min read4/26/2026

AI Systems

RAG Chatbot Tutorial with LangChain and Vector Search

Build trustworthy AI responses with retrieval and citations.

12 min read4/26/2026

Building a RAG Chatbot in 2026: Claude + LangChain + Pinecone, Step by Step

The architecture

The code (Node.js / Next.js API route)

The ingestion pipeline (the part everyone gets wrong)

The four mistakes I made on my first three RAG builds

The system prompt I now use

Cost (real numbers)

Related posts

n8n vs Make vs Zapier in 2026: What Actually Scales

Self-host n8n on DigitalOcean: Step-by-Step Production Setup

RAG Chatbot Tutorial with LangChain and Vector Search

Building a RAG Chatbot in 2026: Claude + LangChain + Pinecone, Step by Step

The architecture

The code (Node.js / Next.js API route)

The ingestion pipeline (the part everyone gets wrong)

The four mistakes I made on my first three RAG builds

The system prompt I now use

Cost (real numbers)

Related posts

n8n vs Make vs Zapier in 2026: What Actually Scales

Self-host n8n on DigitalOcean: Step-by-Step Production Setup

RAG Chatbot Tutorial with LangChain and Vector Search