Skip to content

Mage AI - DB profiling - Parquet file generation for Quality profiling

Goals:

  • Define the payload for the new approach with the following attributes.
  1. URL of the object storage repository to obtain the data. The object storage will be Datamite internal. We need to discuss internally the use of external object storage for Datamite, but for the moment we'll start with the internal one.
  2. Some more attributes to help you in the processing (maybe the format)
  • Update the current implementation for DBs based on the column approach to the Parquet file generation.

Payload examples:

  • Parquet file:
{
  "dataset_id": "123e4567-e89b-12d3-a456-426614174000",
  "object_storage_url": "https://example.com/data/mydata.parquet",
  "mime_type": "application/vnd.apache.parquet",
  "file_name": "mydata.parquet",
  "column_types": {
    "column_1": "string"
    "column_2": "numeric"
  },
  "charset": "UTF_8"
}

CSVs:

{
  "dataset_id": "123e4567-e89b-12d3-a456-426614174001",
  "object_storage_url": "https://example.com/data/myfile.csv",
  "mime_type": "text/csv",
  "file_name": "myfile.csv",
  "csv_options": {
    "field_delimiter": ",",
    "decimal_delimiter": "."
  },
  "column_types": {
    "column_1": "string"
    "column_2": "numeric"
  },
  "charset": "UTF_8"
}

Summary of the attributes for the different formats

Attribute Type Description Mandatory
dataset_id string Unique identifier for the dataset (UUID). Yes
object_storage_url string URL pointing to the file in object storage. Yes
mime_type string MIME type of the file (e.g., text/csv, pplication/vnd.apache.parquet, application/json). It must be a standard mime type codification: https://www.iana.org/assignments/media-types/media-types.xhtml Yes
file_name string Name of the file, including extension. Yes
csv_options object Options specific to CSV files (only applicable if mime_type = text/csv). No
field_delimiter string Character used to separate fields in CSV (e.g., ,, ;). If CSV
decimal_delimiter string Character used as the decimal separator (e.g., ., ,). If CSV
parquet_options object Options specific to Parquet files (only applicable if mime_type = application/parquet). No
compression string Compression codec for Parquet file (snappy, gzip, brotli, etc.). No
json_options object Options specific to JSON files (only applicable if mime_type = application/json). No
compression string Compression format for JSON file (none, gzip, brotli). No
json_format string Defines JSON structure: "records" (array of objects) or "lines" (newline-separated objects). No
column_types object Mapping of column names to their data types. Yes
column_1, column_2, ... string Type of data in the column (string, numeric, date). This types must be used to compute the relevant KPIs based on the data type Yes
charset string Character encoding of the file (e.g., UTF_8). Yes

Sequence Diagram

Datamite_Discovery_Connectors_10_

Edited by Jerónimo Pla