[Quality Evaluator] module missing in Spark Job when processing Parquet files
We're having some issues dealing with parquet files. It seems that some needed dependencies are missing
Spark Job logs:
Proceeding to load the file...
Entered the parquet branch.
Traceback (most recent call last):
File "/app/spark_jobs/evaluator_kpi_library_data_payload.py", line 397, in <module>
main(sys.argv)
File "/app/spark_jobs/evaluator_kpi_library_data_payload.py", line 312, in main
df = pd.read_parquet(io.BytesIO(response.content))
File "/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 493, in read_parquet
impl = get_engine(engine)
File "/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 60, in get_engine
raise ImportError(
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
- Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
- Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
25/03/14 12:53:44 INFO SparkContext: Invoking stop() from shutdown hook
You can test it by your own with this CURLs. Please, forward your local ports to your infra and remember you need to refresh the URL from MinIO to avoid expiration. You can find above the CURL to get a new URL for the object.
curl --request POST \
--url http://172.17.0.1:18001/evaluate-data \
--header 'content-type: application/json' \
--data '{
"object_storage_url": "http://91.235.109.231:9000/default/51e73b5c-76fb-450a-b5a6-5951eaa25ecd/streaming.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250314T124709Z&X-Amz-SignedHeaders=host&X-Amz-Credential=minioadmin%2F20250314%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Expires=600&X-Amz-Signature=c85498dddd8116eede19dc7a97d06bb023c4c3618c4f752fac2322d3368a19f5",
"mime_type": "application/vnd.apache.parquet",
"resource_name": "streaming.parquet",
"dataset_id": "2601098b-cc93-4d02-8aaf-079dcaff545a",
"column_types": {
"id": "text",
"name": "text",
"description": "text",
"file_type": "text",
"consolidation_type": "text",
"consolidation_value": "numeric",
"resource_name": "text",
"status": "text",
"status_message": "text",
"dataset_id": "text",
"field_delimiter": "text",
"decimal_delimiter": "text",
"source_type": "text",
"source_data": "text",
"created_at": "date",
"updated_at": "date"
}
}'
CURL for request a new URL for the MinIO object (parquet file):
curl --request GET
--url http://91.235.109.231:8089/storage/artifacts/51e73b5c-76fb-450a-b5a6-5951eaa25ecd/files/streaming.parquet/location
--header 'accept: */*'
/cc @jpla
Edited by Antoni Gimeno