Published
- 3 min read
SemanTrino: Spotlighting the VectorTrino
My last article introduced the overall RAG pipeline; now, I’ll dive into the heart of the system: the VectorTrino microservice. This is where the magic happens, turning raw data and metadata into a searchable knowledge base.
The VectorTrino Microservice:
The Architect’s View :
- The core purpose of VectorTrino is to orchestrate the vectorization process. It’s a self-contained service that acts as the bridge between the Trino data source and the ChromaDB vector store.
Its architecture is built around three key components:
The Retriever:
- A dedicated component (TrinoMetadataRetriever) that connects to our Trino cluster. Its job is to ingest two types of data:
Metadata:
- Information about the schema, tables, and columns. This helps us understand the structure of the data lake.
Sample Data:
- Small, representative samples from key tables. This helps the LLM understand the content and context of the data.
The Vectorizer:
- This component uses an embedding model, specifically a SentenceTransformer, to convert the ingested text into dense vector embeddings. This is the crucial step that gives our data semantic meaning.
The Loader:
-
This component takes the vectorized data and loads it into our ChromaDB vector store. It’s responsible for storing the data in a way that’s optimized for fast similarity search.
-
This microservice-based design ensures that VectorTrino is modular, scalable, and easy to maintain. It can be run as a standalone process, allowing us to update our knowledge base on a schedule without affecting other parts of the system.
The Vectorization Workflow
-
The workflow within VectorTrino is a systematic process designed to handle a large and complex data schema. It’s an elegant dance between data ingestion, vectorization, and storage.
-
Orchestrating the Ingestion:
-
The process begins by iterating through a predefined list of catalogs and schemas in our Trino cluster. For each schema, we call the TrinoMetadataRetriever to perform two key tasks:
-
Retrieve the schema metadata for all tables within that schema.
-
Retrieve sample data from each table.
-
-
Converting to Documents:
- As the data is retrieved, it’s immediately transformed into LangChain Document objects. This is a crucial step. A Document object is a simple container with two parts: page_content (the text to be vectorized, e.g., “The customer table contains a customer_id and a customer_name column.”) and metadata (a dictionary of attributes, e.g., {‘source’: ‘trino_schema’, ‘catalog’: ‘hive’, ‘table’: ‘customer’}).
The Vectorization and Loading Loop:
-
For each Document object:
-
VectorTrino sends its page_content to the SentenceTransformer model, which converts the text into a numerical vector.
-
The vector, along with its associated Document and metadata, is then sent to the ChromaDB loader.
-
The ChromaDB Collections:
- The loader intelligently splits the data into two distinct ChromaDB collections:
Metadata Collection:
- Stores the vectorized schema information. This collection is for high-level, structural queries like “What is in the orders table?”
Raw Data Collection:
- Stores the vectorized sample data. This is for content-based queries like “Find all documents related to the customer table.”
This two-collection logic is a deliberate architectural choice. It allows us to perform targeted, efficient searches. We can query the metadata for table names and then perform a second, more focused search on the raw data, a powerful pattern known as HyDE (Hypothetical Document Embeddings).
The Next Steps: Troubleshooting the System Building VectorTrino was an invaluable experience, but it wasn’t without its challenges. The journey from idea to working code was a continuous cycle of trial and error. My next two articles will dive deep into this process, providing a behind-the-scenes look at the problems I faced and the systematic approach I used to solve them. We’ll cover everything from tricky Pydantic validation errors to frustrating distributed system failures. It’s a real-world look at the debugging process that defines a forward-deployed engineer’s role.