Home

Published

- 2 min read

Trino Troubleshooting: Runtime Adhoc Schema Inference

img of Trino Troubleshooting: Runtime Adhoc Schema Inference

How to deal with runtime adhoc schema inference in Starburst/Trino

True runtime adhoc schema inference is not inherently available for formats like parquet,ORC, JSON , or CSV in trino. The reason for this is that when creating tables from these file formats, it’s necessary to explicitly declare the schema in the CREATE TABLE statement. THis means that the user must specify column names and types.

Starbursts Schema Discovery Solution

The discover_schema procedure in Starburst is designed to address the challenges of schema inference when querying data stored in specific formats within object storage. The discover_schema procedure automatically scans files located in objet storage to infer the schema.

Workflow

Procedure Call: To use the discover_schema, you initiate it with a SQL command like:

CALL hive.system.discover_schema(
    catalog => 'hive',               -- Target catalog
    schema => 'my_schema',           -- Target schema/database
    table => 'my_table',             -- Desired name for the new table
    location => 's3a://my-bucket/path/',  -- S3 location to analyze
    format => 'PARQUET'              -- Data file format (supports PARQUET, CSV, JSON, etc.)
);

Return Value: The discover_schema procedure returns a CREATE TABLE Data Definition Language (DDL) statement, which contains inferred columns and types based on the scanned data files. Manual Execution Required: After receiving the DDL statement, users must manually execute it to register the table in the catalog. This step is critical for making the inferred schema available for querying.

Key Points to consider

The Discover_schema procedure supports formats such as Parquet, ORC, JSON and CSV It’s important to note that schema inference is not dynamic. If the structure of the underlying files changes, the schema discovery process will then have to be re-run and the DDL executed manually to reflect the updates Conclusion

The discover schema procedure streamlines the process of schema management within Starburst by automatically inferring file schemas.