Mage AI - DB pipelines improvements on DB Sampling
Goals:
- Define sampling strategy (SQL, NoSQL and Object Storage).
- Implement a solution for sample data based on the above strategy.
- Pass the extracted data to the Quality Evaluator service and store all the information through Governance API.
In this scenario, a strategy should be defined in order to manage large volumes of data. The default approach will be sampling.
DATAMITE strategy:
Pseudo-code proposal:
# Configuration parameters (defined in the framework)
LOWER_THRESHOLD = <value> # e.g., 10000 lines
UPPER_THRESHOLD = <value> # e.g., 500000 lines
SAMPLING = <percentage> # Percentage of data to sample (e.g., 10%),
# User-defined parameters via interface
ALL_DATA = <true/false> # User choice to force full profiling
SCOPE = <"NEWEST" or "ALL"> # Profiling scope
IF SCOPE == "NEWEST":
NEWEST = <percentage> # Percentage of newest data to profile
# Determine the data size based on the selected scope
IF SCOPE == "NEWEST":
COUNT = GET_TOP_NEWEST_DATA(NEWEST%) # Apply NEWEST filter
ELSE:
COUNT = GET_TOTAL_DATA_COUNT() # Use full dataset
# Processing logic based on COUNT
IF COUNT < LOWER_THRESHOLD:
PROFILE_ALL_DATA()
ELSE IF COUNT < UPPER_THRESHOLD:
IF ALL_DATA:
PROFILE_ALL_DATA()
ELSE:
PROFILE_SAMPLED_DATA(SAMPLING%)
ELSE: # COUNT ≥ UPPER_THRESHOLD
IF ALL_DATA:
# Execute the advanced mode (not implemented yet)
PROFILE_ADVANCED_MODE()
ELSE:
# Profile a limited sample of the data to prevent excessive processing.
# The number of profiled rows is either:
# - The upper threshold (to cap processing time)
# - Or the calculated sample size (SAMPLING% * COUNT), whichever is smaller.
PROFILE_SAMPLED_DATA(MIN(UPPER_THRESHOLD, SAMPLING% * COUNT))
/cc @jpla
Edited by Antoni Gimeno