Building a local PDF QA bot with LangChain, Chroma, Ollama, and Gradio

How I refactored an IBM AI Engineering capstone project into a local PDF QA bot using Ollama, Chroma, LangChain, and Gradio.
I recently finished the IBM AI Engineer Professional Certificate, and one part I wanted to keep after the course was the final project.
The issue was that the course version was tied to a specific lab structure and to IBM-hosted model choices. That setup is fine for guided exercises, but I wanted something I could run locally, understand end to end, and reuse outside the course environment.
So I refactored the project into a small local script that answers questions about PDF files using a local RAG pipeline. The code is available on GitHub.
What the project does
The application takes a PDF file, reads its content, splits the text into chunks, creates embeddings for those chunks, stores them in Chroma, retrieves the most relevant parts for a user question, and then sends that context to a local LLM running through Ollama.
To keep it easy to test, I exposed the flow through a small Gradio interface. The result is a simple document QA bot that stays local and is much easier to reproduce than the original course setup.
Architecture
Here is the current architecture:
The flow is straightforward:
- A user uploads a PDF and asks a question through Gradio.
PyPDFLoaderreads the document.RecursiveCharacterTextSplittersplits the text into chunks.OllamaEmbeddingsconverts the chunks into vectors.- Chroma stores the vectors and exposes a retriever.
RetrievalQAsends the retrieved context toChatOllama.- The answer is returned to the UI.
This is the core of the app:
def build_retriever(file_path: str):
documents = load_pdf(file_path)
chunks = split_documents(documents)
vectordb = Chroma.from_documents(chunks, create_embeddings())
return vectordb.as_retriever()
def answer_question(file_path: str, query: str):
qa = RetrievalQA.from_chain_type(
llm=create_llm(),
chain_type="stuff",
retriever=build_retriever(file_path),
return_source_documents=False,
)
response = qa.invoke(query)
return response["result"]
What I like about this structure is that each responsibility is still visible. Even though LangChain handles the orchestration, the ingestion, chunking, retrieval, and generation steps are still explicit in the script.
What I changed from the course project
The main change was moving from a course-specific project structure to a single local script with a predictable run flow.
I also added environment-based configuration for the model names, chunking values, Ollama host, and Gradio settings. That makes it easier to experiment without changing code every time:
@dataclass(frozen=True)
class AppConfig:
llm_model: str = os.getenv("LLM_MODEL", "llama3.1:8b")
embedding_model: str = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
chunk_size: int = int(os.getenv("CHUNK_SIZE", "1000"))
chunk_overlap: int = int(os.getenv("CHUNK_OVERLAP", "150"))
Another useful improvement was documenting the project like a normal local app instead of a lab exercise. That includes a README, a .env example, and a simple run sequence:
uv sync --python /usr/bin/python3.12
ollama pull llama3.1:8b
ollama pull nomic-embed-text
uv run python qa_bot_portfolio.py
That part matters more than it looks. A project becomes much more useful once it can be installed and started without depending on hidden notebook state or cloud-only defaults.
What I learned
The biggest lesson is that RAG systems are rarely about “just call an LLM.” Most of the quality comes from retrieval and context management.
If the chunking is weak, the embeddings are not a good fit, or the retriever brings back noisy context, the final answer will also be weak. The LLM is only one part of the pipeline.
I also liked seeing how useful local LLMs can be for development. Running the project with Ollama makes the system easier to test privately and easier to reason about when experimenting with prompts, chunk sizes, or different models.
LangChain was helpful here because it reduced the amount of plumbing code, but it also reminded me that abstraction should stay visible. If the boundaries between loading, retrieval, and generation become too implicit, debugging gets harder very quickly.
Limitations
This is still a small portfolio project, not a production-ready QA system.
Some current limitations are:
- There is no evaluation suite yet.
- There are no source citations in the final answer.
- The app does not keep user sessions or document history.
- The output quality depends heavily on the selected local model.
- The PDF parsing is basic and does not handle more advanced layouts or structured extraction.
Next improvements
The next things I would like to add are evaluation questions, source citations in responses, and a better chunking strategy.
I would also like to package it with Docker and compare the local Ollama-based version with a hosted version using AWS Bedrock. That would make the tradeoffs clearer around setup cost, portability, privacy, and response quality.
Conclusion
This refactor was a good reminder that course projects become much more valuable once they are converted into something local, reproducible, and easier to run without the original teaching environment.
The final result is intentionally simple, but it covers the full loop: PDF ingestion, chunking, embeddings, retrieval, answer generation, and a minimal UI. For me, that made the capstone much more useful than leaving it as a cloud-bound lab exercise.