Overview
The Epstein court archive is enormous, decades-spanning, and split across motions, exhibits, depositions, and unsealed releases. Reading it end-to-end is not realistic. The point of this project is not to draw new conclusions about anybody, but to make the corpus actually queryable. Retrieval-augmented generation is the exact right shape of tool: pull the relevant excerpts from the documents, hand them to a language model, force it to cite the source.
How it's built
- · Pulled every publicly released PDF — court releases, dockets, exhibits, depositions
- · OCR'd the scanned ones; a lot of the older exhibits are image-only and unsearchable without it
- · Chunked by structural boundary where the PDF has one, by token window otherwise, with file + page metadata preserved
- · Embedded the chunks and stored them in a local vector database
- · At query time: retrieve top-K chunks, build a prompt that demands citations, run through a local LLM via Ollama
- · Wired in as a dedicated assistant inside Open WebUI, alongside the other personas
What it does well, what it doesn't
- → Finds where something is mentioned across thousands of pages, fast — the actual high-value use case
- → Surfaces the file name and page reference with every answer, so you can verify it manually
- → OCR quality on older scanned exhibits is the real ceiling on retrieval, not the model
- → Cannot draw conclusions the documents don't already support, and shouldn't pretend to — that's not RAG's job
Stack
Python
Ollama
Open WebUI
Vector DB
PyMuPDF
Tesseract
Live
Epstein AI runs inside Open WebUI on the same self-hosted stack as the rest
of the projects. Access is gated through Patreon or the
Telegram bot.