As part of my work on refstudio, I spent some time prototyping a chatbot that could answer questions about a corpus of PDFs. Tools like LangChain, LlamaIndex, Haystack, and others all have built-in abstractions to simplify this task, but I find that building a simplified version from scratch helps me understand the underlying concepts better.

A basic version of the PDF Chatbot requires two phases with the following steps:

  1. PDF Ingestion
    • Convert PDFs to text
    • Chunk the text into smaller pieces
    • Optional: Generate embeddings for the text chunks
    • Persist the text chunks (or embeddings) in some way so that we can query them later
  2. Chatbot Interaction
    • Take a question from the user
    • Retrieve the most similar text chunks related to the question
      • If we did not create embeddings, we can use a ranking function like BM25 to find the most similar text chunks
      • If we did create embeddings, we can use a nearest neighbors algorithm to find the most similar text chunks
    • Include the most similar text chunks as "context" we provide to the LLM with our question (i.e. our prompt)
    • Return the LLM's response

While embeddings and a vector database are not strictly necessary for this task, I wanted to get a sense of the ergonomics in working with one, so I used this as an excuse to try out LanceDB. LanceDB is an open-source, embedded vector database with the goal of simplifying retrieval, filtering, and management of embeddings. It's built on Apache Arrow, which I'm a big fan of.


You can find the code for this prototype in this github repo.

Here's a quick demo of the chatbot in action: