Skip to content

Mage AI - DB pipelines improvements on DB Sampling

Goals:

  • Define sampling strategy (SQL, NoSQL and Object Storage).
  • Implement a solution for sample data based on the above strategy.
  • Pass the extracted data to the Quality Evaluator service and store all the information through Governance API.

In this scenario, a strategy should be defined in order to manage large volumes of data. The default approach will be sampling.

DATAMITE strategy:

datamite-sampling-strategy.md

Pseudo-code proposal:

# Configuration parameters (defined in the framework)
LOWER_THRESHOLD = <value>  # e.g., 10000 lines
UPPER_THRESHOLD = <value>  # e.g., 500000 lines
SAMPLING = <percentage>    # Percentage of data to sample (e.g., 10%), 

# User-defined parameters via interface
ALL_DATA = <true/false>  # User choice to force full profiling
SCOPE = <"NEWEST" or "ALL">  # Profiling scope
IF SCOPE == "NEWEST":
    NEWEST = <percentage>  # Percentage of newest data to profile

# Determine the data size based on the selected scope
IF SCOPE == "NEWEST":
    COUNT = GET_TOP_NEWEST_DATA(NEWEST%)  # Apply NEWEST filter
ELSE:
    COUNT = GET_TOTAL_DATA_COUNT()  # Use full dataset

# Processing logic based on COUNT
IF COUNT < LOWER_THRESHOLD:
    PROFILE_ALL_DATA()

ELSE IF COUNT < UPPER_THRESHOLD:
    IF ALL_DATA:
        PROFILE_ALL_DATA()
    ELSE:
        PROFILE_SAMPLED_DATA(SAMPLING%)

ELSE:  # COUNT ≥ UPPER_THRESHOLD
    IF ALL_DATA:
        # Execute the advanced mode (not implemented yet)
        PROFILE_ADVANCED_MODE()
    ELSE:
        # Profile a limited sample of the data to prevent excessive processing.
        # The number of profiled rows is either:
        # - The upper threshold (to cap processing time)
        # - Or the calculated sample size (SAMPLING% * COUNT), whichever is smaller.
        PROFILE_SAMPLED_DATA(MIN(UPPER_THRESHOLD, SAMPLING% * COUNT))

/cc @jpla

Edited by Antoni Gimeno