[Quality Evaluator] module missing in Spark Job when processing Parquet files

We're having some issues dealing with parquet files. It seems that some needed dependencies are missing

Spark Job logs:

Proceeding to load the file...
Entered the parquet branch.
Traceback (most recent call last):
  File "/app/spark_jobs/evaluator_kpi_library_data_payload.py", line 397, in <module>
    main(sys.argv)
  File "/app/spark_jobs/evaluator_kpi_library_data_payload.py", line 312, in main
    df = pd.read_parquet(io.BytesIO(response.content))
  File "/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    impl = get_engine(engine)
  File "/.local/lib/python3.10/site-packages/pandas/io/parquet.py", line 60, in get_engine
    raise ImportError(
ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.
25/03/14 12:53:44 INFO SparkContext: Invoking stop() from shutdown hook

You can test it by your own with this CURLs. Please, forward your local ports to your infra and remember you need to refresh the URL from MinIO to avoid expiration. You can find above the CURL to get a new URL for the object.

curl --request POST \
  --url http://172.17.0.1:18001/evaluate-data \
  --header 'content-type: application/json' \
  --data '{
  "object_storage_url": "http://91.235.109.231:9000/default/51e73b5c-76fb-450a-b5a6-5951eaa25ecd/streaming.parquet?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20250314T124709Z&X-Amz-SignedHeaders=host&X-Amz-Credential=minioadmin%2F20250314%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Expires=600&X-Amz-Signature=c85498dddd8116eede19dc7a97d06bb023c4c3618c4f752fac2322d3368a19f5",
  "mime_type": "application/vnd.apache.parquet",
  "resource_name": "streaming.parquet",
  "dataset_id": "2601098b-cc93-4d02-8aaf-079dcaff545a",
  "column_types": {
    "id": "text",
    "name": "text",
    "description": "text",
    "file_type": "text",
    "consolidation_type": "text",
    "consolidation_value": "numeric",
    "resource_name": "text",
    "status": "text",
    "status_message": "text",
    "dataset_id": "text",
    "field_delimiter": "text",
    "decimal_delimiter": "text",
    "source_type": "text",
    "source_data": "text",
    "created_at": "date",
    "updated_at": "date"
  }
}'

CURL for request a new URL for the MinIO object (parquet file):

curl --request GET 
  --url http://91.235.109.231:8089/storage/artifacts/51e73b5c-76fb-450a-b5a6-5951eaa25ecd/files/streaming.parquet/location 
  --header 'accept: */*'

/cc @jpla

Edited Mar 14, 2025 by Antoni Gimeno