To better make sure that we don't miss any of the projects, and keep clean and accurate data, we will create a scheduled task in Quarkus (see Scheduler) to reconcile data that is either missing or out of date. This task should use the Gitlab Projects API with a visibility filter in place to fetch all private projects in the instance. For each page of results, we will make SQL queries to fetch the rows in the database in batches (using something like an ids filter in the DtoFilter) and check that they are up to date.
The different operations we care about:
Current path is up to date (create entry if none match exactly)
If not found, mark all associated entries as deleted
We want to be able to update records as deleted if the system hook endpoint somehow missed a deletion event.
Currently the way Martin and I discussed to do this is to request the GL API for all private projects (which we already do in the scheduled task), grab all ids, query the DB for all PrivateProjectEvent records that do not contain the ids using NOT IN. This would mean they've been deleted. And we could update the record with a deletion date.
This solution might not be ideal, since there could potentially be 100+ ids in the query string, but it shouldn't become a problem unless the number of private projects reaches a significant size. Also, the scheduled task will only run once a week.
Do you have any ideas on how to go about this? Do you expect there to be more than 100-200 private projects at one time?
Another more expensive way of doing this is to query the DB for all PrivateProjectEvent records, request the GL API for all private projects, check each DB record and see if it exists in the list of private repos. If it exists in the DB but not on GL, update the record with a deletion date.
This solution might not be ideal, since there could potentially be 100+ ids in the query string, but it shouldn't become a problem unless the number of private projects reaches a significant size. Also, the scheduled task will only run once a week.
I think you can optimize the query by adding an index to your id column. However, that not needED if the id column is a primary key. Those are automatically indexed.
How big is this table?
Is this the table with all the events from the GL webhook or is this a table with only private repo data?
The number of rows in the table does make a difference here and how often we write to that table should also be considered. With that said, it will be important to use the secondary (read-only) databases for this query.
You can also optimize your query by adding conditions such as deletetion_date is not NULL.
Do you have any ideas on how to go about this? Do you expect there to be more than 100-200 private projects at one time?
Do we know how many we have now? It's hard to tell, we are building this reporting tool to give us some more insight into all of this. However, I would expect the number to increase overtime as more and more projects migrate to Eclipse Gitlab.
The table is for all the hook events. It isn't very large, only 6 columns. Here's what it looks like:
CREATE TABLE IF NOT EXISTS `PrivateProjectEvent` ( `userId` int(10) NOT NULL, `projectId` int(10) NOT NULL, `projectPath` varchar(255) NOT NULL, `parentProject` int(10) DEFAULT NULL, `creationDate` datetime NOT NULL, `deletionDate` datetime DEFAULT NULL, PRIMARY KEY (`userId`, `projectId`, `projectPath`) );
This currently isn't live, and I have no idea how quickly this table will fill. I'll do what I can do what I can to optimize the query for now.
I can't currently do this without a valid token, but this is how we would check on our private projects: https://gitlab.eclipse.org/api/v4/projects?visibility=private&per_page=100
I think Chris is asking about the projected size in rows, not the columns. This could be a moderately sized table or large depending on how many projects go through it, but it isn't very large or in the realms of BigData by any means. We'll have 1->*, where each rename event we catch will add another row for tracking history, but most private projects will probably only have 1 row.
Reopening as this isn't live, and is awaiting a key. The key has been requested on the parent issue, and have pinged the people involved for generating the key!