Skip to content
Snippets Groups Projects
Raul Palma's avatar
Raul Palma authored
d7fe45bf

DESCRIPTION

This project will host the developments for the Data Discovery and Ingestion component, included in the Data Support Tools, as can be seen in the architecture figure below.

architecture_DDTools

This project is intended for connect to different pilot storage solutions.

Once the connection is established, extracts all the metadata information into a storage_solution_type.json file as a result.

For now, it is possible to connect to ElasticSearch or PostgreSQL databases. In future work, new scripts will be added to connect to other databases.

This is the main flow:

main_flow_chart

Check the file architecture_storage_connector.drawio inside doc folder for further info about the whole functionality.

INSTALLATION & REQUIREMENTS

Although the main language of the project is Python, it is not necessary to have it installed in order to test the project on your local environment. For that purpose, the project uses Docker for a containerized execution.

Download recommended docker version from Docker Official Site

USAGE

  1. Download or clone this repository on your local machine.

  2. Fill in the config file the right values of the intended storage solution on that you want to connect. For example, elasticsearch or postgresql as value on db_type parameter and the proper data connection values on the rest of the parameters of the config file.

  3. Open a terminal pointing to the downloaded folder of step 1 and run docker compose up.

  4. Wait until the process finishes and find the metadata information .json file as a result, inside /app/metadata folder. Notice that the file is named with the db_type value filled on the config file (for example, elasticsearch_metadata_info.json)

Note: If you don't have any storage solution for test the application, you can use samples inside devdb folder (not available under main branch). In that folder you find specific README.md file with further information.

ROADMAP

Future versions of this project has to resolve the following points:

  1. Resolve pending issues. More info on the main board.

  2. Allow connection to the following storage solutions:

  • PostgreSQL
  • ElasticSearch
  • Cassandra
  • MongoDB
  • Cloudera Data Platform
  • InfluxDB
  • Azure Data Lake Storage Gen2
  • MinIO
  • Redis
  • Arkimet
  • Db-all.e

To do so, the idea is to replicate the script of one of the developed connectors (located on connectors folder) and adapt the code to the storage solution that has to connect.

Take into account that this new connector has to have a simple db_type identifier to be used on the main script. On that script, pay attention to the selector of connectors and add the new piece of code that points to the new connector script.

For example, this is the code for elasticsearch on the main script:

match db_type:
    case "elasticsearch":
        from connectors import elastic_conn
        result = elastic_conn.db_conn(config_conn)