Ask your WhatsApp: build a private RAG with LlamaIndex

Why build a WhatsApp RAG?

I have a very active group chat with my friends on WhatsApp. At the time of writing, it is a bit over half a million messages. Since LLMs became a thing, I always wondered how I could use this data for something useful—or at the very least, prank my friends.

Last year I tried a few different approaches to fine tune a model using the chat data, but it didn’t work all that well. Fine‑tuning a model on commodity hardware is a challenge in itself and the results were underwhelming. So I dropped that idea for a while. While going through the material for the HuggingFace Agents Course though, it became very clear that RAG (Retrieval Augmented Generation) would be a perfect fit for what I was trying to do.

This post shows how easy it is to set up a RAG on top of your WhatsApp chat logs. You are going to export your messages, parse the .txt files, index them with LlamaIndex, generate embeddings and store them in DuckDB, and ask questions locally running Ollama. You’ll end with a small chat application that you can use to ask questions about your conversation log. The best part is that everything can run from your local machine, so you don’t have to upload any of this sensitive data to the cloud.

RAG, embeddings, and vector databases

Before diving into the implementation, here are some definitions of key concepts used in this experiment:

RAG: Retrieval‑Augmented Generation; first retrieves relevant chunks from your data, then generates an answer grounded in those snippets.
Embeddings: Numeric vector representations of text; semantically similar texts map to nearby vectors, enabling semantic similarity search.
Vector database: A store/index optimized for embeddings and fast similarity search (e.g., retrieving the top‑k most relevant chunks). Here LlamaIndex’s DuckDB vector store is used to persist vectors locally.

This is what the workflow looks like once fully implemented:

  flowchart 
  %% Ingestion
  subgraph Ingestion
    A[WhatsApp exports] --> B[Parse/Cleanup]
    B --> C[Chunking]
    C --> D[Generate embeddings]
    D --> E[(DuckDB Vector Store)]
  end

  %% Query
  subgraph Query
    Q[User question] --> Qe[Embed query]
    Qe --> E
    E --> K[Top-k similar chunks]
    K --> L[LLM generation]
    L --> A2[Grounded answer]
  end

Prerequisites

Python 3.10+
One or more WhatsApp chat exports in .txt format
ollama running a local model (e.g., llama3 or gpt-oss)
uv to manage Python dependencies

Create a new project using uv and add the dependencies:

uv init whatsapp-rag
cd whatsapp-rag
uv add \
  llama-index-llms-ollama \
  llama-index-vector-stores-duckdb \
  llama-index-embeddings-huggingface \
  gradio

Export your chats

Create a new directory for the chat data inside your project:

mkdir input

You need to export your chat messages to a text file.

iOS: Chat → Contact info → Export Chat → Without Media → Save/Share .txt
Android: Chat → More → Export chat → Without media
Name each file clearly: family.txt, work.txt, etc., and place them in the ./input folder.

Ingest chat logs

Create a new file named ingest.py and populate with this content:

from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.duckdb import DuckDBVectorStore


vector_store = DuckDBVectorStore("duck.db", persist_dir="./data/")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3")

splitter = TokenTextSplitter(chunk_size=512, separator="\r\n")

documents = SimpleDirectoryReader("./input/").load_data()

index = VectorStoreIndex.from_documents(
    documents,
    storage_context=storage_context,
    transformations=[splitter],
    embed_model=embed_model,
    show_progress=True,
)

The script loads your exported .txt chats, splits them into retrieval‑friendly chunks, calculates the embeddings, and persists everything to DuckDB:

Vector store: DuckDBVectorStore("duck.db", persist_dir="./data/") stores both vectors and metadata on disk under ./data/, so you can reuse the index without re‑ingesting.
Embeddings: BAAI/bge-m3 is a strong multilingual embedding model that runs locally via Hugging Face. You can swap it for a smaller/faster model if needed.
Chunking: TokenTextSplitter(chunk_size=512, separator="\r\n") breaks the raw chat text along line breaks, keeping messages together while limiting token length for better retrieval.
Reader: SimpleDirectoryReader("./input/") loads every .txt file in the folder and attaches basic file metadata (e.g., filename).
Index build: VectorStoreIndex.from_documents(...) generates embeddings for each chunk and writes them to DuckDB with progress reporting.

After running this once, the built index is persisted and can be opened later for querying without reprocessing the input files.

To run the script, use:

uv run ingest.py

Main chat app

Next, you have the actual RAG application that uses the index generated on the previous step.

import gradio
from llama_index.core import VectorStoreIndex
from llama_index.core.prompts import ChatMessage
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.duckdb import DuckDBVectorStore

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-m3", device="cpu")
vector_store = DuckDBVectorStore.from_local("./data/duck.db")
index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)


llm = Ollama(
    model="gpt-oss:20b",
    request_timeout=300,
    context_window=1024 * 10,
)

engine = index.as_chat_engine(
    llm=llm,
    similarity_top_k=5,
    system_prompt=(
        "You are a helpful assistant that searches WhatsApp "
        "messages to answer questions"
    ),
    streaming=True,
)


def stream(input: str, history: list[dict[str, str]]):
    chat_history = [
        ChatMessage(role=item["role"], content=item["content"]) for item in history
    ]
    content = ""
    for token in engine.stream_chat(input, chat_history=chat_history).response_gen:
        content += token
        yield content


chat = gradio.ChatInterface(
    fn=stream,
    type="messages",
    title="RacinhoGPT",
).launch()

This file wires the stored index to an LLM and a simple chat UI:

Embed model: here the embedding model is loaded on the CPU to save VRAM for the LLM model. Also, only the user’s prompt needs to be processed so using the GPU won’t provide much performance benefit.
Load index: DuckDBVectorStore.from_local("./data/duck.db") reopens the previously persisted vectors, and VectorStoreIndex.from_vector_store(...) prepares a retriever over them using the same embedding model.
Local LLM: Ollama(model="gpt-oss:20b") runs a local model for generation. You can replace it with another Ollama model (e.g., llama3) if preferred. Make sure to configure an appropriate context window that fits in your hardware budget. I’m using a GeForce RTX 4060 Ti 16 GB, so 10k tokens was the right number to fit the model, system prompt, the RAG context, and the user prompt.
Chat engine: index.as_chat_engine(...) handles retrieval‑augmented generation, retrieving 5 similar chunks, with a concise system prompt.
Streaming: engine.stream_chat(...) yields tokens as they are generated; the stream function accumulates and streams them back to Gradio for a live UI.
History: Incoming history messages are converted to ChatMessages so the LLM can keep context across turns.
UI: gradio.ChatInterface provides a minimal chat app you can open in the browser. Title is arbitrary—rename freely.

uv run main.py

Once launched, type a question like “Are there any discussions of a ski trip?” The assistant retrieves relevant messages from your chats and answers grounded in those snippets.

Where to go from here

Here are some ideas on how to improve on this example:

Play around with different LLM and embedding models.
Tune the system_prompt, chunk_size, top_k, and the context_window values according to your hardware, and compare which combinations deliver the most reliable results.
Turn the application into an agent. So far, the application only performs a single‑shot call to the model with the retrieved context. You can improve results by using an agentic loop to retrieve information. This can be achieved by transforming the query engine into a tool using the QueryEngineTool and AgentWorkflow classes, both from LlamaIndex.

Wrap‑up

With a few dozen lines of parsing and LlamaIndex’s indexing/query APIs, you get a private, semantic interface to your WhatsApp history. This example isn’t limited to WhatsApp chats; you can easily adapt it to other file formats using LlamaIndex‑provided parsers. I highly recommend checking it out.

Why build a WhatsApp RAG?#

RAG, embeddings, and vector databases#

Prerequisites#

Export your chats#

Ingest chat logs#

Main chat app#

Where to go from here#

Wrap‑up#