The Infrastructure Corner

The LLMOps Transform: Chunking Metadata, and Vector Generation

This is the third article in this series and we’ll be delving into the crucial data transformation step, where data is prepped for the llm using embeddings and structured metadata

Transformation Logic: _get_relevant_documents

The _get_relevant_documents method performs the Transformation (T) step of the vectorization ETL:

Data-to-Text Chunking
- It iterates over Pandas DataFrames(schema)df, raw_data_df) and converts structured rows into unstructured text strings (content). This is the unit of information the embeddingmodel can process
Metadata Enrichment
- For each document, we atacch rich, structured data. This is for two reasons
  - This allows the RAG query to filter search results before the vector search
  - Provides the LLM with structured facts about the source of the text chunk

Data Loading Logic: Vectorization.LoadData method

this method si where the abstract concepts of data lineage and semantic representation converge.
Data Lineage and Reproducibility
- We assign a unique ID to every eingested document chunk for data tovernance and auditability
  - Action
    - The method uses Python’s uuid.uuid4() to generate a universally unique identifier for every text chunk
  - Value in RAG
    - This UUID serves as the non-negotiable primary key for the entire lifecycle of that knowledge fragment
  - Audit Trail
    - if the Semantrino API generates a sql query based on a retrieved chunk, the query result can be traced backeward using this UUID.
Vector Insertion and Semantic Finalization
- The collection.add() is the final, atomic step where the data is transformed into a usable resource for semantic search
  - Transformation
    - When the add() method is called on the chromadb collection, two things happen
      - The collection client passes the incoming documents to the pre-configured Embedding Model
      - The model converts each string into a high-dimensional numeric array(the vector)
    - Storage
      - Chroma DB stores the following three components together as a single record
        
        vector
        
        The numerical embedding
        
        The structured information that enables filtered search later
        
        The generated UUID

By performing vector insertion in this way, the system guarantees that the search space is consistently gnerated and that every piece of info has the necessary structured context atttached. Thanks for reading and I’ll see you in part 4 of this series!