Home

Published

- 3 min read

SemanTrino: Spotlighting the VectorTrino

img of SemanTrino: Spotlighting the VectorTrino

My last article introduced the overall RAG pipeline; now, I’ll dive into the heart of the system: the VectorTrino microservice. This is where the magic happens, turning raw data and metadata into a searchable knowledge base.

The VectorTrino Microservice:

The Architect’s View :

  • The core purpose of VectorTrino is to orchestrate the vectorization process. It’s a self-contained service that acts as the bridge between the Trino data source and the ChromaDB vector store.

Its architecture is built around three key components:

The Retriever:

  • A dedicated component (TrinoMetadataRetriever) that connects to our Trino cluster. Its job is to ingest two types of data:

Metadata:

  • Information about the schema, tables, and columns. This helps us understand the structure of the data lake.

Sample Data:

  • Small, representative samples from key tables. This helps the LLM understand the content and context of the data.

The Vectorizer:

  • This component uses an embedding model, specifically a SentenceTransformer, to convert the ingested text into dense vector embeddings. This is the crucial step that gives our data semantic meaning.

The Loader:

  • This component takes the vectorized data and loads it into our ChromaDB vector store. It’s responsible for storing the data in a way that’s optimized for fast similarity search.

  • This microservice-based design ensures that VectorTrino is modular, scalable, and easy to maintain. It can be run as a standalone process, allowing us to update our knowledge base on a schedule without affecting other parts of the system.

The Vectorization Workflow

  • The workflow within VectorTrino is a systematic process designed to handle a large and complex data schema. It’s an elegant dance between data ingestion, vectorization, and storage.

  • Orchestrating the Ingestion:

    • The process begins by iterating through a predefined list of catalogs and schemas in our Trino cluster. For each schema, we call the TrinoMetadataRetriever to perform two key tasks:

    • Retrieve the schema metadata for all tables within that schema.

    • Retrieve sample data from each table.

  • Converting to Documents:

    • As the data is retrieved, it’s immediately transformed into LangChain Document objects. This is a crucial step. A Document object is a simple container with two parts: page_content (the text to be vectorized, e.g., “The customer table contains a customer_id and a customer_name column.”) and metadata (a dictionary of attributes, e.g., {‘source’: ‘trino_schema’, ‘catalog’: ‘hive’, ‘table’: ‘customer’}).

The Vectorization and Loading Loop:

  • For each Document object:

    • VectorTrino sends its page_content to the SentenceTransformer model, which converts the text into a numerical vector.

    • The vector, along with its associated Document and metadata, is then sent to the ChromaDB loader.

The ChromaDB Collections:

  • The loader intelligently splits the data into two distinct ChromaDB collections:

Metadata Collection:

  • Stores the vectorized schema information. This collection is for high-level, structural queries like “What is in the orders table?”

Raw Data Collection:

  • Stores the vectorized sample data. This is for content-based queries like “Find all documents related to the customer table.”

This two-collection logic is a deliberate architectural choice. It allows us to perform targeted, efficient searches. We can query the metadata for table names and then perform a second, more focused search on the raw data, a powerful pattern known as HyDE (Hypothetical Document Embeddings).

The Next Steps: Troubleshooting the System Building VectorTrino was an invaluable experience, but it wasn’t without its challenges. The journey from idea to working code was a continuous cycle of trial and error. My next two articles will dive deep into this process, providing a behind-the-scenes look at the problems I faced and the systematic approach I used to solve them. We’ll cover everything from tricky Pydantic validation errors to frustrating distributed system failures. It’s a real-world look at the debugging process that defines a forward-deployed engineer’s role.