We have a need to get a list of repositories for individual projects. Currently, we need this to gather project metrics and we need this to identify repositories that require review by the IP Team. As we move to implement metrics via an external provider, we're going to need a consistent and fully-supported means of getting the list of repositories.
Here's what the scripts that I maintain currently do.
We start by getting the project metadata via API call to projects.eclipse.org (e.g., https://projects.eclipse.org/json/project/adoptium.aqavit). From that, we get:
A list of repository urls from the source_repositories field;
Zero or more GitHub organisations from the github_org field; and
Zero or more GitLab groups from the gl_project_group field and excluded groups from the gl_excl_sub_groups field.
There is potentially some duplication between what is listed in the source_repositories field and what we find in the GitHub group, so I filter for that.
For each of the GitHub organisations, I use the GitHub API to get the list of repositories.
For each of the GitLab organisations, I use the GitLab API to get the list of repositories. I do this recursively, so that subgroups are included. I prune branches in excluded groups during the recursion. Note that, AFAICT, this "excluded groups" feature isn't actually exploited by any projects currently.
With a full list of repositories gathered in this manner, I exclude some repositories that I know to be mirrors or otherwise do not include Eclipse project code. This includes a number of OpenJDK repositories from the Adoptium subprojects and a bunch of mirror/third-party repositories under Oniro. I also skip all "website" repositories under an assumption that they do not contain project code. These exclusions are all done with a nasty bit of hard coded regular expressions that an earlier version of myself decided would be a temporary hack.
I currently have this all implemented twice: once in PHP, and once in Java.
I would very much like to have this all implemented once and maintained by the same team that decides how all of this information is represented.
I totally agree that we need this functionality would be useful!
As you've highlighted, the IP Team needs this and I can see this endpoint being used across multiple web services and properties that we manage, including the PMI and some of our WG sites.
From my perspective, the best location to create this API is within our existing API infrastructure for projects in the PMI (Drupal).
However, I currently don't see the benefit of implementing this in Drupal 7. This implies that if we decide on Drupal as our platform, we could only start working on this task once the D9 version of the PMI is deployed to production.
@wbeaton , are your existing solutions good enough to buy us enough time to allow us to work on this next year or do you need this sooner?
With this said, from a performance perspective, we'll need to evaluate whether we should build a separate tool for data aggregation, which may be independent of Drupal.
If we proceed in this direction, the Drupal setup could be relatively straightforward. It would simply serve data stored in a database, while the task of data aggregation would be delegated to an independent service that operates at regular intervals.
If we decide to go in that direction, we may be able to start working on that data aggregation tool before the D9 instance of the PMI is deployed.
I understand that it's a bit soon for me to talk about implementation details but wanted to share my preliminary thoughts with the team to help us define the priority level for this task.
Unless I am missing something, my understanding is deliverable 1 will be solved once we deploy PMI Drupal 10?
Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?
For deliverable 2, we should discuss whether we want to maintain a separate sync script or create cron jobs in the PMI to sync that data.
Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?
But remember that this is not complete information. You actually have to query GitLab/GitHub to get all of the repos in the group/org.
Provide an API that exposes all the git repositories URLs for all our projects
The GitHub Org and GitLab Group information in the PMI only tell us how to find the Git repositories, not the URLs of the Git repositories themselves.
I'm thinking that it needs to be moved and made an official API that is accessible from the usual place.
I have both a PHP and a Java implementation.
I use the PHP implementation to populate a database table as a cache, because the computation is potentially time consuming with multiple calls to GitHub and GitLab APIs.
Note that we also agreed on the implementation details based on my comment here and your thumbs up reaction:
#217 (comment 1523099)
In my opinion, the next step would be to implement a solution to take over the synchronization of git URLs in the dashboard database which I identified as deliverable 2:
Replace/take ownership of @wbeaton sync script for syncing repo URLs in the dashboard database.
I will admit that having these similar issues is very confusing. The way I see it, deliverable 1 is being done via #217 (closed) and we can use this issue to focus on deliverable 2.
Can you guys confirm that in Drupal 10 project API, the project endpoint now returns the Gitlab, Gerrit and Github URL by merging the data we have in the PMI and the Dashboard database?
Yes, using the MR you mention in your comment, I'm adding the missing Github and Gitlat repos to the projects API.
The way I see it, deliverable 1 is being done via #217 (closed) and we can use this issue to focus on deliverable 2.
We can also close this issue as a duplicate of #217 (closed) and create a new issue that solely focuses on deliverable 2 since the topic of this issue is "Add an API to get a list of repositories for a project".
I don't think that this is currently a problem, but there's a circular relationship in the current solution that may bite us at some point.
The PMI specifies GitHub repositories using three fields:
GitHub org, which manifests in the API as github/org
Ignored GitHub repositories, which manifests in the API as github/ignored_repos
GitHub Repositories, which manifests in the API as github_repos.
The "Dashboard" process uses this information in the following way:
The value github/org is sent to the GitHub API to get the list of repositories. We add the repositories listed in github_repos to this list. Then, we remove every repository that is inaccessible, archived, or is in github/ignored_repos from the list. The list is further pruned to remove, for example, forks of JDK content from Adoptium projects (this is just hard-coded).
The resulting list of repositories is stored in the GitRepo table of the dashboard database as the repositories associated with the project.
The API adds the contents of the GitRepo table the Git repositories that are specified in the PMI, and server the combined results in the github_repos field.
There's no general means in the API output of separating the GitHub repositories that are specified in the PMI from those that are discovered in the org.
If you look in the PMI, you'll see only the first repository explicitly listed in the GitHub Repositories field.
The potential issue is that every time the process runs, the contents of the github_repos field is used to generate the contents of the github_repos field.
Since I'm identifying when a repository no longer exists or is archived, I think that most of the bad things that might happen are unlikely to happen. The one case that may bite is is if a the owner of a repository is changed (but the repository is not itself moved). I'm pretty sure that magic should just happen if a repository is moved.
Again, I don't think that this is imminent problem, but is certainly something that we should keep an eye on. If something does go wrong, debugging this will be a special sort of fun.
I ran into the special sort of fun that I had anticipated today.
The https://github.com/adoptium/openj9-systemtest repository is claimed by both the Eclipse Aqavit and Eclipse OpenJ9 projects. I'm not sure how we arrived at this situation. There is no mention of this repository in the OpenJ9 project, but I assume that there must've been a reference at some point. Regardless, the database has this association and circular relationship that I described above is feeding it back into the script which causes the relationship to persist.
I've hacked the script to remove the association, so it should disappear when the script runs tonight.
The PMI doesn't do any validation to check if another project has "claimed" a repo, group or organization but if you think we should, we could add it to the form validation.
No current configuration in the PMI accounts for this. I assume that this association existed at one point, and -- because of the way that we populate the github_repos field -- it will never go away without intervention.
I've intervened. The tweaks that I've made to the script should cause it to disappear overnight.
To address this issue, the next step would probably be to fetch all the additional repositories with the PMI. Since the PMI should already have the necessary data, the challenge here lies in doing this efficiently.
This should allow us to move away from relying on data from the dashboard and ultimately eliminate this circular relationship.
This could be something the Drupal team takes on in Q1 2025 once we are done with the accounts/packages migration.
Looking at the settings.php file, I can see that we connect to the database located in dbapi. @mward Is this the right place? Or is there a more up to date version of the database somewhere else? Thanks.
@wbeaton after looking at the code, I'm seeing there is a typo that prevents adding repos from dash to the list of gitlab repos.
After fixing the typo, I now end up with more then 25 repos since the code is also fetching repos from the field_project_gitlab_repos field which is hidden in the form because it was archived in favor of using the gitlab project group field. Note that the code is taking care of duplicates are is listing urls only once.
Is this still the correct behaviour we want? Or should we only get repos from Dashboard and ignore data from the hidden field_project_gitlab_repos field?
We should not be listing repositories when we have no control over the list. So, let's not include the items in field_project_gitlab_repos. IMHO, this field should just be deleted.
Is this still the correct behaviour we want?
The output that we get by grabbing the GitLab repositories from dash should be correct.
The results are correct, but the behaviour is wrong. The circular relationship with the Dash process is going to hurt us at some point. The solution that I want is to generate this list using the data available and the GitLab API, and not get this from Dash.
If I recall correctly, the code that searches repositories uses a simple pattern search for "github" and "gitlab". This will break if anybody creates a repository on GitLab with "github" in the name, or vice versa.
Ok sounds good, I'll update my MR to remove the field_project_gitlab_repos field. I'll also create a separate issue to clean up the the field_project_gitlab_repos from the database and code.
The results are correct, but the behaviour is wrong. The circular relationship with the Dash process is going to hurt us at some point. The solution that I want is to generate this list using the data available and the GitLab API, and not get this from Dash.
As @cguindon was saying above, this is something we will be able to tackle in Q1 of 2025.
@wbeaton - I’m trying to clarify the current status of this issue
My understanding is that our project API now exposes all repository URLs, so the first deliverables is technically complete. Would you agree? #351 (comment 2748471)
To close the issue, I believe the next step is to develop a mechanism to fetch this data automatically, so we can stop relying on your script to manually add or remove URLs from the dashboard database.
To set expectation, I don't believe my team has cycle to take this on in Q4 but I would like to confirm what is left for us to do here.
The output is generally (but not entirely) correct, but the behaviour is wrong.
The circular relationship has bitten us at least once, requiring hours of my time to untangle and resolve. That we cannot distinguish between GitHub repositories set in the PMI and those that were discovered means that this will almost certainly happen again.
I'd like to withdraw this request. There are a number of mitigating issues that make this less useful than I'd originally hoped.
Before we close this, however, we need to undo what's been done already. Specifically, we need to undo the coupling with the dashboard database, and return the API to a state where is it just returning the data as it is represented in the PMI.
I'm pretty sure that the impact of undoing the change should be minimal. Consumers should be interpreting the Git metadata answered by the Eclipse Projects API (e.g., using GitHub APIs to get lists of repositories based on the GitHub org specified in the API data). If they are doing so, then they should not be impacted by this change.
Before we undo anything, though, we need to make sure that we limit the impact on consumers. AFAIK, the only external consumer of this data is the Bitergia dashboard. @bbaldassari2kd can you please confirm with our Bitergia friends that they are interpreting the Git metadata in the Eclipse Project API in its entirety and not just reading the list of GitHub repositories?
Are there other external consumers that we're aware of?
FYI, I've attempted to capture how one should interpret the Git metadata provided by the project API as an example specification document.
Note that the intention of the document is to provide an example of a specification document; the intention is not to produce an actual specification, and whether or not the actual content is correct is not a priority for the project. Having said that, I'm definitely interested in trying to make the content accurate, so comments/issues and merge requests are welcome.
@wbeaton From what I know, our Bitergia friends use many git-related fields of the API: github or gitlab org, list of github and gitlab repos, and even the ignored_sub_groups/ignore_repos. If I understand correctly, once the changes have been rolled back to the initial pure-PMI data, only the github/gitlab repos with the ignore* fields will be filled. Is that correct?
I can start a thread with them to inform / prepare them for the change, if needed.
PS: For the records, but unrelated, they also use other fields, like technology types and working groups. I'm assuming there will be no change to the non-git fields, right?