Mage AI - DB profiling - Parquet file generation for Quality profiling

Goals:

Define the payload for the new approach with the following attributes.

URL of the object storage repository to obtain the data. The object storage will be Datamite internal. We need to discuss internally the use of external object storage for Datamite, but for the moment we'll start with the internal one.
Some more attributes to help you in the processing (maybe the format)

Update the current implementation for DBs based on the column approach to the Parquet file generation.

Payload examples:

Parquet file:

{
  "dataset_id": "123e4567-e89b-12d3-a456-426614174000",
  "object_storage_url": "https://example.com/data/mydata.parquet",
  "mime_type": "application/vnd.apache.parquet",
  "file_name": "mydata.parquet",
  "column_types": {
    "column_1": "string"
    "column_2": "numeric"
  },
  "charset": "UTF_8"
}

CSVs:

{
  "dataset_id": "123e4567-e89b-12d3-a456-426614174001",
  "object_storage_url": "https://example.com/data/myfile.csv",
  "mime_type": "text/csv",
  "file_name": "myfile.csv",
  "csv_options": {
    "field_delimiter": ",",
    "decimal_delimiter": "."
  },
  "column_types": {
    "column_1": "string"
    "column_2": "numeric"
  },
  "charset": "UTF_8"
}

Summary of the attributes for the different formats

Attribute	Type	Description	Mandatory
`dataset_id`	`string`	Unique identifier for the dataset (UUID).	✅ Yes
`object_storage_url`	`string`	URL pointing to the file in object storage.	✅ Yes
`mime_type`	`string`	MIME type of the file (e.g., `text/csv`, `pplication/vnd.apache.parquet`, `application/json`). It must be a standard mime type codification: https://www.iana.org/assignments/media-types/media-types.xhtml	✅ Yes
`file_name`	`string`	Name of the file, including extension.	✅ Yes
`csv_options`	`object`	Options specific to CSV files (only applicable if `mime_type = text/csv`).	❌ No
├ `field_delimiter`	`string`	Character used to separate fields in CSV (e.g., `,`, `;`).	✅ If CSV
├ `decimal_delimiter`	`string`	Character used as the decimal separator (e.g., `.`, `,`).	✅ If CSV
`parquet_options`	`object`	Options specific to Parquet files (only applicable if `mime_type = application/parquet`).	❌ No
├ `compression`	`string`	Compression codec for Parquet file (`snappy`, `gzip`, `brotli`, etc.).	❌ No
`json_options`	`object`	Options specific to JSON files (only applicable if `mime_type = application/json`).	❌ No
├ `compression`	`string`	Compression format for JSON file (`none`, `gzip`, `brotli`).	❌ No
├ `json_format`	`string`	Defines JSON structure: `"records"` (array of objects) or `"lines"` (newline-separated objects).	❌ No
`column_types`	`object`	Mapping of column names to their data types.	✅ Yes
├ `column_1`, `column_2`, ...	`string`	Type of data in the column (`string`, `numeric`, `date`). This types must be used to compute the relevant KPIs based on the data type	✅ Yes
`charset`	`string`	Character encoding of the file (e.g., `UTF_8`).	✅ Yes

Sequence Diagram

Edited Feb 25, 2025 by Jerónimo Pla