Add an API to get a list of repositories for a project

I totally agree that we need this functionality would be useful!

As you've highlighted, the IP Team needs this and I can see this endpoint being used across multiple web services and properties that we manage, including the PMI and some of our WG sites.

From my perspective, the best location to create this API is within our existing API infrastructure for projects in the PMI (Drupal).

However, I currently don't see the benefit of implementing this in Drupal 7. This implies that if we decide on Drupal as our platform, we could only start working on this task once the D9 version of the PMI is deployed to production.

@wbeaton , are your existing solutions good enough to buy us enough time to allow us to work on this next year or do you need this sooner?

With this said, from a performance perspective, we'll need to evaluate whether we should build a separate tool for data aggregation, which may be independent of Drupal.

If we proceed in this direction, the Drupal setup could be relatively straightforward. It would simply serve data stored in a database, while the task of data aggregation would be delegated to an independent service that operates at regular intervals.

If we decide to go in that direction, we may be able to start working on that data aggregation tool before the D9 instance of the PMI is deployed.

I understand that it's a bit soon for me to talk about implementation details but wanted to share my preliminary thoughts with the team to help us define the priority level for this task.

mentioned in issue #190 (closed)

added team:webdev webdev:java webdev:php labels

After reading @wbeaton comment, I think there are 2 deliverables here:

Provide an API that exposes all the git repositories URLs for all our projects (I believe we are solving this with D10, see below)
Replace/take ownership of @wbeaton sync script for syncing repo URLs in the dashboard database.

@wbeaton Am I missing the point/something here?

@epoirier @theodoreb,

Unless I am missing something, my understanding is deliverable 1 will be solved once we deploy PMI Drupal 10?

Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?

For deliverable 2, we should discuss whether we want to maintain a separate sync script or create cron jobs in the PMI to sync that data.

Sounds about right.

Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?

But remember that this is not complete information. You actually have to query GitLab/GitHub to get all of the repos in the group/org.

@wbeaton Sure but that would fit under deliverable 2:

Replace/take ownership of @wbeaton sync script for syncing repo URLs in the dashboard database.

You currently maintain a script that does this right? My understanding is that data exists in the dashboard database...

The next step is figuring out if we want to maintain your existing solution or perhaps move that functionality in Drupal or something else.

Unless you are saying that the data in the dashboard database is incomplete?

I disagree. This is deliverable 1.

Provide an API that exposes all the git repositories URLs for all our projects

The GitHub Org and GitLab Group information in the PMI only tell us how to find the Git repositories, not the URLs of the Git repositories themselves.

I'm thinking that it needs to be moved and made an official API that is accessible from the usual place.

I have both a PHP and a Java implementation.

I use the PHP implementation to populate a database table as a cache, because the computation is potentially time consuming with multiple calls to GitHub and GitLab APIs.

Deliverable 1 is about exposing all the git repositories which I believe we've already implemented the solution via issue #217 (closed) and this MR from @epoirier https://gitlab.eclipse.org/eclipsefdn/it/websites/drupal/eclipsefdn/-/merge_requests/265

This deliverable should go live with PMI on Drupal 10.

In comment eclipsefdn/it/api/eclipsefdn-api-common#68 (comment 1984433), I am asking @epoirier and @theodoreb to confirm that since #217 (closed) is an MBO for the webdev team in Q1 2024.

Note that we also agreed on the implementation details based on my comment here and your thumbs up reaction: #217 (comment 1523099)

In my opinion, the next step would be to implement a solution to take over the synchronization of git URLs in the dashboard database which I identified as deliverable 2:

Replace/take ownership of @wbeaton sync script for syncing repo URLs in the dashboard database.

I will admit that having these similar issues is very confusing. The way I see it, deliverable 1 is being done via #217 (closed) and we can use this issue to focus on deliverable 2.

Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?

Yes, using the MR you mention in your comment, I'm adding the missing Github and Gitlat repos to the projects API.

The way I see it, deliverable 1 is being done via #217 (closed) and we can use this issue to focus on deliverable 2.

We can also close this issue as a duplicate of #217 (closed) and create a new issue that solely focuses on deliverable 2 since the topic of this issue is "Add an API to get a list of repositories for a project".

added statein discussion label

I don't think that this is currently a problem, but there's a circular relationship in the current solution that may bite us at some point.

The PMI specifies GitHub repositories using three fields:

GitHub org, which manifests in the API as github/org
Ignored GitHub repositories, which manifests in the API as github/ignored_repos
GitHub Repositories, which manifests in the API as github_repos.

The "Dashboard" process uses this information in the following way:

The value github/org is sent to the GitHub API to get the list of repositories. We add the repositories listed in github_repos to this list. Then, we remove every repository that is inaccessible, archived, or is in github/ignored_repos from the list. The list is further pruned to remove, for example, forks of JDK content from Adoptium projects (this is just hard-coded).

The resulting list of repositories is stored in the GitRepo table of the dashboard database as the repositories associated with the project.

The API adds the contents of the GitRepo table the Git repositories that are specified in the PMI, and server the combined results in the github_repos field.

There's no general means in the API output of separating the GitHub repositories that are specified in the PMI from those that are discovered in the org.

For example...

$ wget https://projects.eclipse.org/api/projects/technology_dash -O - | json_pp
...
      "github" : {
         "ignored_repos" : [],
         "org" : "eclipse-dash"
      },
      "github_repos" : [
         {
            "url" : "https://github.com/eclipse/dash-licenses"
         },
         {
            "url" : "https://github.com/eclipse-dash/nodejs-wrapper"
         }
      ],
...

If you look in the PMI, you'll see only the first repository explicitly listed in the GitHub Repositories field.

The potential issue is that every time the process runs, the contents of the github_repos field is used to generate the contents of the github_repos field.

Since I'm identifying when a repository no longer exists or is archived, I think that most of the bad things that might happen are unlikely to happen. The one case that may bite is is if a the owner of a repository is changed (but the repository is not itself moved). I'm pretty sure that magic should just happen if a repository is moved.

Again, I don't think that this is imminent problem, but is certainly something that we should keep an eye on. If something does go wrong, debugging this will be a special sort of fun.

I ran into the special sort of fun that I had anticipated today.

The https://github.com/adoptium/openj9-systemtest repository is claimed by both the Eclipse Aqavit and Eclipse OpenJ9 projects. I'm not sure how we arrived at this situation. There is no mention of this repository in the OpenJ9 project, but I assume that there must've been a reference at some point. Regardless, the database has this association and circular relationship that I described above is feeding it back into the script which causes the relationship to persist.

I've hacked the script to remove the association, so it should disappear when the script runs tonight.

What do you mean by claimed by both projects? I currently only see that the repo listed under Eclipse Aqavit:

Never mind @wbeaton - I forgot how to read.

The PMI doesn't do any validation to check if another project has "claimed" a repo, group or organization but if you think we should, we could add it to the form validation.

Check the first github_repos entry from the API. As of 10 pm on August 21, this is what I get...

$ wget https://projects.eclipse.org/api/projects/technology_openj9 -O -
...
     "github_repos" : [
         {
            "url" : "https://github.com/adoptium/openj9-systemtest"
         },
         {
            "url" : "https://github.com/eclipse-openj9/build-openj9"
         },
         {
            "url" : "https://github.com/eclipse-openj9/openj9"
         },
...

No current configuration in the PMI accounts for this. I assume that this association existed at one point, and -- because of the way that we populate the github_repos field -- it will never go away without intervention.

I've intervened. The tweaks that I've made to the script should cause it to disappear overnight.

Thanks Wayne - I better understand now.

It's because we combine the data that exists in the PMI with the Dashboard: https://gitlab.eclipse.org/eclipsefdn/it/websites/drupal/eclipsefdn/-/blob/9.1.x/eclipsefdn_projects/src/Entity/ProjectNode.php#L1000

https://gitlab.eclipse.org/eclipsefdn/it/websites/drupal/eclipsefdn/-/blob/9.1.x/eclipsefdn_projects/src/Plugin/rest/resource/ProjectApiBaseRestResource.php#L290

To address this issue, the next step would probably be to fetch all the additional repositories with the PMI. Since the PMI should already have the necessary data, the challenge here lies in doing this efficiently.

This should allow us to move away from relying on data from the dashboard and ultimately eliminate this circular relationship.

This could be something the Drupal team takes on in Q1 2025 once we are done with the accounts/packages migration.

mentioned in issue eclipsefdn/helpdesk#4764

marked this issue as related to eclipsefdn/helpdesk#4764

moved from eclipsefdn/it/api/eclipsefdn-api-common#68 (moved)

@epoirier @cguindon

I've observed that the contents of the gitlab_repos field doesn't always match what's in the database.

$ wget https://projects.eclipse.org/api/projects/technology.osee -O - | json_pp
...
      "github" : {
         "ignored_repos" : [],
         "org" : "eclipse-osee"
      },
      "github_repos" : [
         {
            "url" : "https://github.com/eclipse-osee/org.eclipse.osee"
         },
         {
            "url" : "https://github.com/eclipse-osee/org.eclipse.ote"
         },
         {
            "url" : "https://github.com/eclipse-osee/osee-website"
         }
      ],
      "gitlab" : {
         "ignored_sub_groups" : [],
         "project_group" : ""
      },
      "gitlab_repos" : [
         {
            "url" : "https://gitlab.eclipse.org/eclipse/osee/osee-website"
         }
      ],

Where does the gitlab_repos entry come from? It's not, AFAICT, in the Dashboard database.

Similarly,

$ wget https://projects.eclipse.org/api/projects/oniro.oniro-blueprints -O - | json_pp
...
      "gitlab_repos" : [
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/oniro-blueprints"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-cats"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-core"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-dj-gesture"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-doorlock"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-eddie"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-energy-gateway"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-gateway"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/meta-oniro-blueprints-vending-machine"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/context-aware-touch-screen"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/doorlock"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/doorlock-factoryreset"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/openthread-node-zephyr"
         },
         {
            "url" : "https://gitlab.eclipse.org/eclipse/oniro-blueprints/vending-machine"
         }
      ],

14 repositories are included here, but the Dashboard database lists 25. At least one of the listed "repositories" is actually a group:

https://gitlab.eclipse.org/eclipse/oniro-blueprints/context-aware-touch-screen

I walked through the code in the 9.1.x branch, but don't see where it grabs any of this content from anywhere but the Dashboard DB.

Could the API be looking at an old version of the DB?

I'll have a look at the database credentials and see where we are getting this data.

Looking at the settings.php file, I can see that we connect to the database located in dbapi. @mward Is this the right place? Or is there a more up to date version of the database somewhere else? Thanks.

That would really depend on which DB this needs to connect to.

@chills2023 has been working to transition to our new DB servers so if you should ask him.

@epoirier the Dashboard database is still on dbapi. There is a dashboard database on the new cluster but still empty, not migrated yet

Thanks for clarifying.

@wbeaton I'm going to go through the code and see if there's a bug in there.

@wbeaton after looking at the code, I'm seeing there is a typo that prevents adding repos from dash to the list of gitlab repos.

After fixing the typo, I now end up with more then 25 repos since the code is also fetching repos from the field_project_gitlab_repos field which is hidden in the form because it was archived in favor of using the gitlab project group field. Note that the code is taking care of duplicates are is listing urls only once.

Is this still the correct behaviour we want? Or should we only get repos from Dashboard and ignore data from the hidden field_project_gitlab_repos field?

Thanks.

Either way, I'm fixing the typo in the following MR:

https://gitlab.eclipse.org/eclipsefdn/it/websites/drupal/eclipsefdn/-/merge_requests/493

We should not be listing repositories when we have no control over the list. So, let's not include the items in field_project_gitlab_repos. IMHO, this field should just be deleted.

Is this still the correct behaviour we want?

The output that we get by grabbing the GitLab repositories from dash should be correct.

The results are correct, but the behaviour is wrong. The circular relationship with the Dash process is going to hurt us at some point. The solution that I want is to generate this list using the data available and the GitLab API, and not get this from Dash.

If I recall correctly, the code that searches repositories uses a simple pattern search for "github" and "gitlab". This will break if anybody creates a repository on GitLab with "github" in the name, or vice versa.

Ok sounds good, I'll update my MR to remove the field_project_gitlab_repos field. I'll also create a separate issue to clean up the the field_project_gitlab_repos from the database and code.

The results are correct, but the behaviour is wrong. The circular relationship with the Dash process is going to hurt us at some point. The solution that I want is to generate this list using the data available and the GitLab API, and not get this from Dash.

As @cguindon was saying above, this is something we will be able to tackle in Q1 of 2025.

@wbeaton I updated the API to get only repos from dash.

https://projects.eclipse.org/api/projects/oniro.oniro-blueprints

Let me know if there's anything else.

Thanks.

@epoirier We've noticed one GitHub repository that's being reported as a GitLab repository:

$ wget https://projects.eclipse.org/api/projects/ecd.opensmartclide -O - | json_pp
...
      "gitlab" : {
         "ignored_sub_groups" : [],
         "project_group" : ""
      },
      "gitlab_repos" : [
         {
            "url" : "https://github.com/eclipse-opensmartclide/smartclide-cicd-gitlab"
         }
      ],

As I recall, the query pattern matches on github or gitlab.

That must be what is in the dash database?

Right.. Cause the repo has gitlab in it. We're gonna have to fix the pattern matches so that it cares only have urls.

I created the following MR to reduce the changes of url mismatch:

https://gitlab.eclipse.org/eclipsefdn/it/websites/drupal/eclipsefdn/-/merge_requests/495/diffs

@wbeaton my changes are live and the github repo is no longer in the gitlab repos list.

marked this issue as related to eclipsefdn/helpdesk#4593

@wbeaton - I’m trying to clarify the current status of this issue

My understanding is that our project API now exposes all repository URLs, so the first deliverables is technically complete. Would you agree? #351 (comment 2748471)

To close the issue, I believe the next step is to develop a mechanism to fetch this data automatically, so we can stop relying on your script to manually add or remove URLs from the dashboard database.

To set expectation, I don't believe my team has cycle to take this on in Q4 but I would like to confirm what is left for us to do here.

The output is generally (but not entirely) correct, but the behaviour is wrong.

The circular relationship has bitten us at least once, requiring hours of my time to untangle and resolve. That we cannot distinguish between GitHub repositories set in the PMI and those that were discovered means that this will almost certainly happen again.

Unless I am misunderstanding the problem, I feel that this is something that we can only solve by replacing/updating your sync script?

At least that was my conclusion a month ago: #351 (comment 2748486)

I asked @epoirier to reach out to you for a call to discuss this problem further to see what we can do in the meantime to fix this issue.

I'd like to withdraw this request. There are a number of mitigating issues that make this less useful than I'd originally hoped.

Before we close this, however, we need to undo what's been done already. Specifically, we need to undo the coupling with the dashboard database, and return the API to a state where is it just returning the data as it is represented in the PMI.

I'm pretty sure that the impact of undoing the change should be minimal. Consumers should be interpreting the Git metadata answered by the Eclipse Projects API (e.g., using GitHub APIs to get lists of repositories based on the GitHub org specified in the API data). If they are doing so, then they should not be impacted by this change.

Before we undo anything, though, we need to make sure that we limit the impact on consumers. AFAIK, the only external consumer of this data is the Bitergia dashboard. @bbaldassari2kd can you please confirm with our Bitergia friends that they are interpreting the Git metadata in the Eclipse Project API in its entirety and not just reading the list of GitHub repositories?

Are there other external consumers that we're aware of?

FYI, I've attempted to capture how one should interpret the Git metadata provided by the project API as an example specification document.

Note that the intention of the document is to provide an example of a specification document; the intention is not to produce an actual specification, and whether or not the actual content is correct is not a priority for the project. Having said that, I'm definitely interested in trying to make the content accurate, so comments/issues and merge requests are welcome.

@wbeaton From what I know, our Bitergia friends use many git-related fields of the API: github or gitlab org, list of github and gitlab repos, and even the ignored_sub_groups/ignore_repos. If I understand correctly, once the changes have been rolled back to the initial pure-PMI data, only the github/gitlab repos with the ignore* fields will be filled. Is that correct?

I can start a thread with them to inform / prepare them for the change, if needed.

PS: For the records, but unrelated, they also use other fields, like technology types and working groups. I'm assuming there will be no change to the non-git fields, right?

The only change should be that the list of github_repos will contain fewer (redundant) entries in some cases.

None of the other metadata will be affected by this change. I have no current plans to request changes to any of the other fields.

Add an API to get a list of repositories for a project

Designs

Child items ...

Activity

Add an API to get a list of repositories for a project

Relates to

Activity