Merge branch 'main' of gitlab.eclipse.org:eclipse/graphene/tutorials into main

30395e4d · Sangamithra Panneer Selvam · 1390a1de · 934e20cf · 30395e4d · 30395e4d
Commit 30395e4d authored 1 year ago by Sangamithra Panneer Selvam
--- a/Container_Specification/README.md
+++ b/Container_Specification/README.md
-# AI4EU EXPERIMENTS
-
-## AI4EU Experiments Container Format
+# Graphene Container Format

 ## Table of Contents

@@ -19,7 +17,7 @@

 ### [1. Introduction](#1-introduction)

-This document specifies the docker container format for tools and models that can be onboarded on the AI4EU Experiments platform so they can be used in the visual composition editor as re-usable, highly interoperable building blocks for AI pipelines.
+This document specifies the docker container format for tools and models that can be onboarded on Graphene instances so they can be used in the visual composition editor as re-usable, highly interoperable building blocks for AI pipelines.

 ![image](src/images/intro_pipeline.PNG)

@@ -194,7 +192,7 @@ In this way, expanding the container specification for the databroker's data inp

 Note : 

-1. Please be aware that the logging should be enabled in the python script (`logging.basicConfig(level=logging.INFO`)). 
+1. Please be aware that the logging should be enabled in the python script (`logging.basicConfig(level=logging.INFO`)). For instance, we need to enable logging in the script containing the function/method that reads the dataset metadata.

 1. Please refer the [10. Metrics Aggregation](#8-metrics-aggregation) section to understand the changes implemented in the playground-app.

@@ -334,7 +332,7 @@ The model provider should ensure the following additions in order to accomplish
 1. Have the metrics collected and updated after the training process.
 For this purpose, a gRPC routine/method,     `get_metrics_metadata(self, request, context)`, is subsequently called after the training process concludes.

-1. The Python logging should be enabled in order to see all the logs that are obtained.  
+1. The Python logging should be enabled in the script that contains metrics aggregation order to see all the logs that are obtained.  

 Please refer to the Additional Information section to further comprehend the topics mentioned above.

@@ -344,7 +342,7 @@ When the Pipeline is started, the initial logs indicated in 2. are captured, all

 A sample representation of the metrics logs for the news_training pipeline is as follows,

-`INFO:root:{'metrics': {'date_time': '2023-09-15 08:17:22', 'type': 'classificationMetrics/v1', 'accuracy': 0.9247897267341614, 'validation_loss': 0.9067514538764954, 'status_text': 'success'}}`
+`INFO:root:{'metrics': {'date_time': '2023-09-15 08:17:22', 'type': 'classification-metrics/v1', 'accuracy': 0.9247897267341614, 'validation_loss': 0.9067514538764954, 'status_text': 'success'}}`

 A sample representation of the output in the metadata file is as shown below,

@@ -369,7 +367,7 @@ A sample representation of the output in the metadata file is as shown below,
            "checksum": "docker-pullable://cicd.ai4eu-dev.eu:7444/training_pipeline/news_classifier@sha256:3b4c88e571abbb0e536d8048ffb82ae4126d52dafa1e03eb13b6d8c22bf3e859",
            "metrics": {
                "date_time": "2023-09-12 06:00:09",
-                "type": "classificationMetrics/v1",
+                "type": "classification-metrics/v1",
                "accuracy": 0.9239237904548645,
                "validation_loss": 0.9049152731895447,
                "status_text": "success"
@@ -377,10 +375,173 @@ A sample representation of the output in the metadata file is as shown below,
        }
 ```
 <details>
+<summary>Appendix : Information on metadata </summary>
+
+In summary, we extract metadata such as dataset features and model metrics. This part of the section talks about the different types of model metrics and dataset features. 
+
+**Model Metrics:**
+The models in the playground can be broadly classified into two different types: Regression and Classification. Metrics are used for evaluating the performance of your own model and make improvements to get the desired result. 
+
+Classification Model:
+
+<div align="center">
+<table >
+    <thead>
+        <tr>
+            <th align="center">S.no</th>
+            <th align="center"> Metrics</th>
+            <th align="center">Functionality</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td align="left">1</td>
+            <td align="left">Accuracy</td>
+            <td align="left">Accuracy is a metric that measures the percentage of correctly predicted instances in your dataset. It's a straightforward way to evaluate how well your model is performing in terms of getting predictions right.</td>
+        </tr>
+        <tr>
+            <td align="left">2</td>
+            <td align="left">validation_loss</td>
+            <td align="left">Validation loss is a metric that quantifies how well your machine learning model fits the training data.</td>
+        </tr>        
+    </tbody>
+</table>
+<p></p>
+</div>
+
+
+A sample:
+```json
+    "metrics": {
+            "date_time": "2023-09-15 08:17:22", 
+            "type": "classification-metrics/v1", 
+            "accuracy": 0.9247897267341614, 
+            "validation_loss": 0.9067514538764954, 
+            "status_text": "success"
+            }
+```
+
+
+In conclusion, accuracy can be used as a guiding metric during model development and hyperparameter tuning. validation_loss is typically used during the training process to monitor whether the model is improving or overfitting. A decreasing validation loss indicates that the model is learning and improving its ability to make predictions. However, if the validation loss starts increasing, it may be a sign of overfitting.
+
+Regressor model:
+
+
+<div align="center">
+<table >
+    <thead>
+        <tr>
+            <th align="center">S.no</th>
+            <th align="center"> Metrics</th>
+            <th align="center">Functionality</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td align="left">1</td>
+            <td align="left">Root Mean Square Error</td>
+            <td align="left">RMSE is a measure of the average deviation (error) between predicted values and actual values in a regression model. RMSE provides a sense of how much the model's predictions deviate from the true values, with lower RMSE values indicating better model performance.</td>
+        </tr>
+        <tr>
+            <td align="left">2</td>
+            <td align="left">Mean Squared Error</td>
+            <td align="left">MSE is another metric that quantifies the average squared difference between predicted values and actual values. Unlike RMSE, MSE does not take the square root, so it is sensitive to large errors.</td>
+        </tr>
+        <tr>
+            <td align="left">3</td>
+            <td align="left">R2 Score</td>
+            <td align="left">R2 (R-squared) is a metric that represents the proportion of the variance in the dependent variable (target) that is explained by the independent variables (features) in a regression model. It ranges between 0 and 1, with higher values indicating a better fit of the model to the data. A value of 1 indicates a perfect fit.</td>
+        </tr>
+        <tr>
+            <td align="left">4</td>
+            <td align="left">Adjusted R2 Score</td>
+            <td align="left">Adjusted R2 is a modification of the R2 score that accounts for the number of independent variables (features) in the model.It penalizes the inclusion of irrelevant features and generally provides a more accurate assessment of model performance, especially in multiple regression</td>
+        </tr>        
+    </tbody>
+</table>
+<p></p>
+</div>
+
+
+A sample:
+```json
+  "metrics": {
+                "date_time": "2023-09-28 13:56:21",
+                "type": "regression-metrics/v1", 
+                "mse": 0.0025680503998073007,
+                "rmse": 0.05067593511527242,
+                "r_squared": 0.831178119365146,
+                "adjusted_r_squared": 0.8270170166734419,
+                "status_text": "success"
+            }
+```
+
+In conclusion, RMSE and MSE measure the accuracy of predictions, R2 quantifies the goodness of fit, and Adjusted R2 adjusts R2 for the complexity of the model. These metrics are crucial for evaluating regression models and selecting the most appropriate model for a given dataset.
+
+**Dataset Features:**
+
+<div align="center">
+<table >
+    <thead>
+        <tr>
+            <th align="center">S.no</th>
+            <th align="center">Features</th>
+            <th align="center">Functionality</th>
+        </tr>
+    </thead>
+    <tbody>
+       <tr>
+            <td align="left">1</td>
+            <td align="left">Type</td>
+            <td align="left">Specifies the type or category information that the dataset belongs to.</td>
+        </tr>
+        <tr>
+            <td align="left">2</td>
+            <td align="left">Dataset name</td>
+            <td align="left">This is the name or label given to a specific dataset, often indicating its content, source, or purpose.</td>
+        </tr>
+        <tr>
+            <td align="left">3</td>
+            <td align="left">Description</td>
+            <td align="left">A brief or detailed explanation of what the dataset contains, its origin, format, and any other relevant information for potential users.</td>
+        </tr>
+        <tr>
+            <td align="left">4</td>
+            <td align="left">Size</td>
+            <td align="left">The size of the dataset, typically measured in terms of the number of records, rows, columns, or the total file size in bytes or other appropriate units.</td>
+        </tr>
+        <tr>
+            <td align="left">5</td>
+            <td align="left">DOI (Digital Object Identifier) or ID</td>
+            <td align="left">A unique and persistent identifier assigned to the dataset, often used for citation and reference purposes, ensuring its accessibility and traceability.</td>
+        </tr>        
+    </tbody>
+</table>
+<p></p>
+</div>
+
+
+
+A sample:
+```json
+   "dataset_features": {
+                "type": "aiod-dataset/v1(Kaggle dataset)",
+                "datasetname": "House Prices dataset",
+                "description": "https://www.kaggle.com/datasets/lespin/house-prices-dataset",
+                "size": "204 kB",
+                "DOI_ID": "Not available"
+            }
+```
+
+In conclusion, dataset features helps to get an overview about the dataset that the model use.
+</details>
+
+<details>
+
 <summary>Additional Information - Type of metrics</summary>
-Note : Another field in the metadata is 'type,' which represents the model's metrics type. This field is critical not only for model evaluation but also for the model's performance to the precise goals and requirements of the application. Adding this feature allows one to group the metrics into a primary group or sub-group of metrics indicating the training and testing phases.
+Note : Another field in the metadata is 'type,' which represents the dataset and model's metrics type. This field is critical not only for model evaluation but also for the model's performance to the precise goals and requirements of the application. Adding this feature allows one to group the metrics into a primary group or sub-group of metrics indicating the training and testing phases. Determining the dataset type aids in discerning the source of the dataset.

-A sample further extension of the metadata is as shown below,
+A sample extension of the metadata for the metrics is shown below. The metrics can be grouped into training and testing phases using the "type" parameter. 

 ```json
 {

--- a/news_training/classifier/news_classifier.proto
+++ b/news_training/classifier/news_classifier.proto
@@ -10,9 +10,10 @@ message TrainingConfig {
 }

 message TrainingStatus {
-  double accuracy = 1;
-  double validation_loss = 2;
-  string status_text = 3;
+  string type = 1;
+  double accuracy = 2;
+  double validation_loss = 3;
+  string status_text = 4;
 }

 message NewsText {

--- a/news_training/classifier/news_classifier.py
+++ b/news_training/classifier/news_classifier.py
@@ -134,6 +134,7 @@ class NewsClassifier(news_classifier_pb2_grpc.NewsClassifierServicer):
            print(f"model is available here: {request.model_filename + '.onnx'}")

        response = news_classifier_pb2.TrainingStatus()
+        response.type = 'classification-metrics/v1' # Type-in the type of metrics here
        response.accuracy = history.history['accuracy'][-1]
        response.validation_loss = history.history['val_loss'][-1]
        response.status_text = 'success'

--- a/news_training/classifier/news_classifier_pb2.py
+++ b/news_training/classifier/news_classifier_pb2.py
@@ -19,7 +19,7 @@ DESCRIPTOR = _descriptor.FileDescriptor(
  syntax='proto3',
  serialized_options=None,
  create_key=_descriptor._internal_create_key,
-  serialized_pb=b'\n\x15news_classifier.proto\"\xa8\x01\n\x0eTrainingConfig\x12\x1e\n\x16training_data_filename\x18\x01 \x01(\t\x12 \n\x18training_labels_filename\x18\x02 \x01(\t\x12\x0e\n\x06\x65pochs\x18\x03 \x01(\x05\x12\x12\n\nbatch_size\x18\x04 \x01(\x05\x12\x18\n\x10validation_ratio\x18\x05 \x01(\x01\x12\x16\n\x0emodel_filename\x18\x06 \x01(\t\"P\n\x0eTrainingStatus\x12\x10\n\x08\x61\x63\x63uracy\x18\x01 \x01(\x01\x12\x17\n\x0fvalidation_loss\x18\x02 \x01(\x01\x12\x13\n\x0bstatus_text\x18\x03 \x01(\t\"\x18\n\x08NewsText\x12\x0c\n\x04text\x18\x01 \x01(\t\"<\n\x0cNewsCategory\x12\x15\n\rcategory_code\x18\x01 \x01(\x05\x12\x15\n\rcategory_text\x18\x02 \x01(\t2\xa3\x01\n\x0eNewsClassifier\x12\x31\n\rstartTraining\x12\x0f.TrainingConfig\x1a\x0f.TrainingStatus\x12$\n\x08\x63lassify\x12\t.NewsText\x1a\r.NewsCategory\x12\x38\n\x14get_metrics_metadata\x12\x0f.TrainingStatus\x1a\x0f.TrainingStatusb\x06proto3'
+  serialized_pb=b'\n\x15news_classifier.proto\"\xa8\x01\n\x0eTrainingConfig\x12\x1e\n\x16training_data_filename\x18\x01 \x01(\t\x12 \n\x18training_labels_filename\x18\x02 \x01(\t\x12\x0e\n\x06\x65pochs\x18\x03 \x01(\x05\x12\x12\n\nbatch_size\x18\x04 \x01(\x05\x12\x18\n\x10validation_ratio\x18\x05 \x01(\x01\x12\x16\n\x0emodel_filename\x18\x06 \x01(\t\"^\n\x0eTrainingStatus\x12\x0c\n\x04type\x18\x01 \x01(\t\x12\x10\n\x08\x61\x63\x63uracy\x18\x02 \x01(\x01\x12\x17\n\x0fvalidation_loss\x18\x03 \x01(\x01\x12\x13\n\x0bstatus_text\x18\x04 \x01(\t\"\x18\n\x08NewsText\x12\x0c\n\x04text\x18\x01 \x01(\t\"<\n\x0cNewsCategory\x12\x15\n\rcategory_code\x18\x01 \x01(\x05\x12\x15\n\rcategory_text\x18\x02 \x01(\t2\xa3\x01\n\x0eNewsClassifier\x12\x31\n\rstartTraining\x12\x0f.TrainingConfig\x1a\x0f.TrainingStatus\x12$\n\x08\x63lassify\x12\t.NewsText\x1a\r.NewsCategory\x12\x38\n\x14get_metrics_metadata\x12\x0f.TrainingStatus\x1a\x0f.TrainingStatusb\x06proto3'
 )


@@ -101,22 +101,29 @@ _TRAININGSTATUS = _descriptor.Descriptor(
  create_key=_descriptor._internal_create_key,
  fields=[
    _descriptor.FieldDescriptor(
-      name='accuracy', full_name='TrainingStatus.accuracy', index=0,
-      number=1, type=1, cpp_type=5, label=1,
-      has_default_value=False, default_value=float(0),
+      name='type', full_name='TrainingStatus.type', index=0,
+      number=1, type=9, cpp_type=9, label=1,
+      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
    _descriptor.FieldDescriptor(
-      name='validation_loss', full_name='TrainingStatus.validation_loss', index=1,
+      name='accuracy', full_name='TrainingStatus.accuracy', index=1,
      number=2, type=1, cpp_type=5, label=1,
      has_default_value=False, default_value=float(0),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
    _descriptor.FieldDescriptor(
-      name='status_text', full_name='TrainingStatus.status_text', index=2,
-      number=3, type=9, cpp_type=9, label=1,
+      name='validation_loss', full_name='TrainingStatus.validation_loss', index=2,
+      number=3, type=1, cpp_type=5, label=1,
+      has_default_value=False, default_value=float(0),
+      message_type=None, enum_type=None, containing_type=None,
+      is_extension=False, extension_scope=None,
+      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
+    _descriptor.FieldDescriptor(
+      name='status_text', full_name='TrainingStatus.status_text', index=3,
+      number=4, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
@@ -134,7 +141,7 @@ _TRAININGSTATUS = _descriptor.Descriptor(
  oneofs=[
  ],
  serialized_start=196,
-  serialized_end=276,
+  serialized_end=290,
 )


@@ -165,8 +172,8 @@ _NEWSTEXT = _descriptor.Descriptor(
  extension_ranges=[],
  oneofs=[
  ],
-  serialized_start=278,
-  serialized_end=302,
+  serialized_start=292,
+  serialized_end=316,
 )


@@ -204,8 +211,8 @@ _NEWSCATEGORY = _descriptor.Descriptor(
  extension_ranges=[],
  oneofs=[
  ],
-  serialized_start=304,
-  serialized_end=364,
+  serialized_start=318,
+  serialized_end=378,
 )

 DESCRIPTOR.message_types_by_name['TrainingConfig'] = _TRAININGCONFIG
@@ -251,8 +258,8 @@ _NEWSCLASSIFIER = _descriptor.ServiceDescriptor(
  index=0,
  serialized_options=None,
  create_key=_descriptor._internal_create_key,
-  serialized_start=367,
-  serialized_end=530,
+  serialized_start=381,
+  serialized_end=544,
  methods=[
  _descriptor.MethodDescriptor(
    name='startTraining',

--- a/news_training/databroker/dataset_features.txt
+++ b/news_training/databroker/dataset_features.txt
+type,aiod-dataset/v1(TensorFlow Dataset(tfds))
 datasetname,The Reuters Dataset
 description,http://kdd.ics.uci.edu/databases/reuters21578/README.txt
 size,4MB

--- a/news_training/databroker/news_databroker.proto
+++ b/news_training/databroker/news_databroker.proto
@@ -8,10 +8,11 @@ message NewsText {
 }

 message DatasetFeatues{
-  string datasetname = 1;
-  string description = 2;
-  string size = 3;
-  string DOI_ID = 4;
+  string type = 1;
+  string datasetname = 2;
+  string description = 3;
+  string size = 4;
+  string DOI_ID = 5;
 }

 service NewsDatabroker {

--- a/news_training/databroker/news_databroker_pb2.py
+++ b/news_training/databroker/news_databroker_pb2.py
@@ -19,7 +19,7 @@ DESCRIPTOR = _descriptor.FileDescriptor(
  syntax='proto3',
  serialized_options=None,
  create_key=_descriptor._internal_create_key,
-  serialized_pb=b'\n\x15news_databroker.proto\"\x07\n\x05\x45mpty\"\x18\n\x08NewsText\x12\x0c\n\x04text\x18\x01 \x01(\t\"X\n\x0e\x44\x61tasetFeatues\x12\x13\n\x0b\x64\x61tasetname\x18\x01 \x01(\t\x12\x13\n\x0b\x64\x65scription\x18\x02 \x01(\t\x12\x0c\n\x04size\x18\x03 \x01(\t\x12\x0e\n\x06\x44OI_ID\x18\x04 \x01(\t2`\n\x0eNewsDatabroker\x12\x1d\n\x08get_next\x12\x06.Empty\x1a\t.NewsText\x12/\n\x14get_dataset_metadata\x12\x06.Empty\x1a\x0f.DatasetFeatuesb\x06proto3'
+  serialized_pb=b'\n\x15news_databroker.proto\"\x07\n\x05\x45mpty\"\x18\n\x08NewsText\x12\x0c\n\x04text\x18\x01 \x01(\t\"f\n\x0e\x44\x61tasetFeatues\x12\x0c\n\x04type\x18\x01 \x01(\t\x12\x13\n\x0b\x64\x61tasetname\x18\x02 \x01(\t\x12\x13\n\x0b\x64\x65scription\x18\x03 \x01(\t\x12\x0c\n\x04size\x18\x04 \x01(\t\x12\x0e\n\x06\x44OI_ID\x18\x05 \x01(\t2`\n\x0eNewsDatabroker\x12\x1d\n\x08get_next\x12\x06.Empty\x1a\t.NewsText\x12/\n\x14get_dataset_metadata\x12\x06.Empty\x1a\x0f.DatasetFeatuesb\x06proto3'
 )


@@ -91,33 +91,40 @@ _DATASETFEATUES = _descriptor.Descriptor(
  create_key=_descriptor._internal_create_key,
  fields=[
    _descriptor.FieldDescriptor(
-      name='datasetname', full_name='DatasetFeatues.datasetname', index=0,
+      name='type', full_name='DatasetFeatues.type', index=0,
      number=1, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
    _descriptor.FieldDescriptor(
-      name='description', full_name='DatasetFeatues.description', index=1,
+      name='datasetname', full_name='DatasetFeatues.datasetname', index=1,
      number=2, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
    _descriptor.FieldDescriptor(
-      name='size', full_name='DatasetFeatues.size', index=2,
+      name='description', full_name='DatasetFeatues.description', index=2,
      number=3, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
    _descriptor.FieldDescriptor(
-      name='DOI_ID', full_name='DatasetFeatues.DOI_ID', index=3,
+      name='size', full_name='DatasetFeatues.size', index=3,
      number=4, type=9, cpp_type=9, label=1,
      has_default_value=False, default_value=b"".decode('utf-8'),
      message_type=None, enum_type=None, containing_type=None,
      is_extension=False, extension_scope=None,
      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
+    _descriptor.FieldDescriptor(
+      name='DOI_ID', full_name='DatasetFeatues.DOI_ID', index=4,
+      number=5, type=9, cpp_type=9, label=1,
+      has_default_value=False, default_value=b"".decode('utf-8'),
+      message_type=None, enum_type=None, containing_type=None,
+      is_extension=False, extension_scope=None,
+      serialized_options=None, file=DESCRIPTOR,  create_key=_descriptor._internal_create_key),
  ],
  extensions=[
  ],
@@ -131,7 +138,7 @@ _DATASETFEATUES = _descriptor.Descriptor(
  oneofs=[
  ],
  serialized_start=60,
-  serialized_end=148,
+  serialized_end=162,
 )

 DESCRIPTOR.message_types_by_name['Empty'] = _EMPTY
@@ -169,8 +176,8 @@ _NEWSDATABROKER = _descriptor.ServiceDescriptor(
  index=0,
  serialized_options=None,
  create_key=_descriptor._internal_create_key,
-  serialized_start=150,
-  serialized_end=246,
+  serialized_start=164,
+  serialized_end=260,
  methods=[
  _descriptor.MethodDescriptor(
    name='get_next',