DESCRIPTION
This project will host the developments for the Data Discovery and Ingestion component, included in the Data Support Tools, as can be seen in the architecture figure below.
This project is intended for connect to different pilot storage solutions.
Once the connection is established, extracts all the metadata information into a storage_solution_type.json file as a result.
For now, it is possible to connect to ElasticSearch or PostgreSQL databases. In future work, new scripts will be added to connect to other databases.
This is the main flow:
Check the file architecture_storage_connector.drawio
inside doc folder for further info about the whole functionality.
INSTALLATION & REQUIREMENTS
Although the main language of the project is Python, it is not necessary to have it installed in order to test the project on your local environment. For that purpose, the project uses Docker for a containerized execution.
Download recommended docker version from Docker Official Site
USAGE
-
Download or clone this repository on your local machine.
-
Fill in the config file the right values of the intended storage solution on that you want to connect. For example,
elasticsearch
orpostgresql
as value ondb_type
parameter and the proper data connection values on the rest of the parameters of the config file. -
Open a terminal pointing to the downloaded folder of step 1 and run
docker compose up
. -
Wait until the process finishes and find the metadata information .json file as a result, inside
/app/metadata
folder. Notice that the file is named with thedb_type
value filled on the config file (for example,elasticsearch_metadata_info.json
)
Note: If you don't have any storage solution for test the application, you can use samples inside devdb
folder (not available under main
branch). In that folder you find specific README.md
file with further information.
ROADMAP
Future versions of this project has to resolve the following points:
-
Resolve pending issues. More info on the main board.
-
Allow connection to the following storage solutions:
- PostgreSQL
- ElasticSearch
- Cassandra
- MongoDB
- Cloudera Data Platform
- InfluxDB
- Azure Data Lake Storage Gen2
- MinIO
- Redis
- Arkimet
- Db-all.e
To do so, the idea is to replicate the script of one of the developed connectors (located on connectors folder) and adapt the code to the storage solution that has to connect.
Take into account that this new connector has to have a simple db_type
identifier to be used on the main script.
On that script, pay attention to the selector of connectors and add the new piece of code that points to the new connector script.
For example, this is the code for elasticsearch on the main script:
match db_type:
case "elasticsearch":
from connectors import elastic_conn
result = elastic_conn.db_conn(config_conn)