Home

Published

- 3 min read

SemanTrino: RAG for Data

img of SemanTrino: RAG for Data

Lately I’v been very interested in building applied AI systems. My journey is about more than just writing code; it’s about solving real-world, high-stakes problems for clients. That’s why I’m building Semantrino, an open-source project designed to tackle a common and frustrating issue for data analysts and engineers: the inability to easily discover and understand data within a large-scale data lake.

The Problem:

A Data Lake You Can’t Talk To

  • Modern enterprises store vast amounts of data in distributed systems like Trino, a distributed SQL query engine designed for querying massive datasets. Trino makes data accessible, but it doesn’t make it understandable. A data analyst might know they need “customer order data,” but they have no way of knowing where that data lives within the thousands of schemas and tables in a data lake. The process of finding the right data is a manual, tedious, and frustrating one. My solution is a semantic search tool that lets you find what you’re looking for by asking a simple question, just like you would with Google.

The Solution:

A RAG Pipeline for Data

  • Semantrino is a Retrieval-Augmented Generation (RAG) pipeline. A RAG pipeline combines the power of a large language model (LLM) with a custom knowledge base, ensuring the LLM’s responses are grounded in accurate, up-to-date information. In my case, the knowledge base is the metadata and schema of the data lake.

Here’s how the architecture works:

Ingestion:

  • I connect to a Trino cluster and retrieve its schema—the catalogs, schemas, tables, and columns—along with their data types. I also pull small samples of raw data from key tables.

Vectorization:

  • Using a sentence-transformer model, I convert this information into vector embeddings. These vectors are numerical representations that capture the semantic meaning of the text. For example, the vector for “customer_id” would be semantically close to “cust_id,” even though the words are different.

Vector Store:

  • These vectors are then stored in a vector store (ChromaDB), which is optimized for fast similarity search. It allows me to query the database using a vector and retrieve the most relevant vectors.

Retrieval:

  • When a user asks a question like “Which table has customer order information?”, Semantrino converts the question into a vector. It then queries the vector store for the most semantically similar pieces of information. This process retrieves the relevant table names and metadata.

Generation:

  • Finally, the retrieved context is passed to an LLM, which uses this information to formulate a coherent, natural-language response, such as “Customer order information is available in the tpch.sf1.customer table.”

The Workflow: A Multi-Part System

  • Semantrino is designed as a series of microservices to ensure modularity and scalability.

The TrinoMetadataRetriever:

  • This service is responsible for connecting to the Trino cluster and ingesting the metadata and sample data. It’s the first step in building the knowledge base.

The VectorTrino Microservice:

  • This is the heart of the system. It orchestrates the entire process: taking the raw data from the retriever, vectorizing it, and writing it to the ChromaDB vector store.

The QueryProcessor:

  • This is the front-facing service that takes a user’s natural language query, processes it through the RAG pipeline, and provides a final answer.

My next article in this series will focus on the VectorTrino microservice. I’ll dive deep into the technical challenges I faced in building it, from handling abstract classes and Pydantic validation to tackling common typos and distributed system errors. It’s a journey that has taught me invaluable lessons and is a major milestone in my transition from a systems support role to a forward-deployed engineer who can build and troubleshoot complex AI systems in the field.