Add a scheduled job to reconcile data that is missed by system hooks

added statetodo label

assigned to @zacharysabourin

added to epic eclipsefdn/it&5 (closed)

marked this issue as related to eclipsefdn/helpdesk#1717 (closed)

added statewip label and removed statetodo label

added #93 (closed) as child task

added #94 (closed) as child task

added #95 (closed) as child task

added #96 (closed) as child task

added #97 (closed) as child task

mentioned in merge request !110 (merged)

@cguindon

We want to be able to update records as deleted if the system hook endpoint somehow missed a deletion event. Currently the way Martin and I discussed to do this is to request the GL API for all private projects (which we already do in the scheduled task), grab all ids, query the DB for all PrivateProjectEvent records that do not contain the ids using NOT IN. This would mean they've been deleted. And we could update the record with a deletion date.

This solution might not be ideal, since there could potentially be 100+ ids in the query string, but it shouldn't become a problem unless the number of private projects reaches a significant size. Also, the scheduled task will only run once a week.

Do you have any ideas on how to go about this? Do you expect there to be more than 100-200 private projects at one time?

Another more expensive way of doing this is to query the DB for all PrivateProjectEvent records, request the GL API for all private projects, check each DB record and see if it exists in the list of private repos. If it exists in the DB but not on GL, update the record with a deletion date.

This solution might not be ideal, since there could potentially be 100+ ids in the query string, but it shouldn't become a problem unless the number of private projects reaches a significant size. Also, the scheduled task will only run once a week.

I think you can optimize the query by adding an index to your id column. However, that not needED if the id column is a primary key. Those are automatically indexed.

How big is this table?

Is this the table with all the events from the GL webhook or is this a table with only private repo data?

The number of rows in the table does make a difference here and how often we write to that table should also be considered. With that said, it will be important to use the secondary (read-only) databases for this query.

You can also optimize your query by adding conditions such as deletetion_date is not NULL.

Do you have any ideas on how to go about this? Do you expect there to be more than 100-200 private projects at one time?

Do we know how many we have now? It's hard to tell, we are building this reporting tool to give us some more insight into all of this. However, I would expect the number to increase overtime as more and more projects migrate to Eclipse Gitlab.

The table is for all the hook events. It isn't very large, only 6 columns. Here's what it looks like:

CREATE TABLE IF NOT EXISTS `PrivateProjectEvent` ( `userId` int(10) NOT NULL, `projectId` int(10) NOT NULL, `projectPath` varchar(255) NOT NULL, `parentProject` int(10) DEFAULT NULL, `creationDate` datetime NOT NULL, `deletionDate` datetime DEFAULT NULL, PRIMARY KEY (`userId`, `projectId`, `projectPath`) );

This currently isn't live, and I have no idea how quickly this table will fill. I'll do what I can do what I can to optimize the query for now.

I can't currently do this without a valid token, but this is how we would check on our private projects: https://gitlab.eclipse.org/api/v4/projects?visibility=private&per_page=100

I think Chris is asking about the projected size in rows, not the columns. This could be a moderately sized table or large depending on how many projects go through it, but it isn't very large or in the realms of BigData by any means. We'll have 1->*, where each rename event we catch will add another row for tracking history, but most private projects will probably only have 1 row.

So we believe/estimate it going to take years before we reach 100k rows?

Unless we get botted supremely hard for a long time, yes

I think we can do some performance testing to see how much time these requests will take. I also think there is an option here to split the queries.

added statereview label and removed statewip label

mentioned in commit ceb411b2

closed with merge request !110 (merged)

reopened

Reopening as this isn't live, and is awaiting a key. The key has been requested on the parent issue, and have pinged the people involved for generating the key!

this is live

closed

Add a scheduled job to reconcile data that is missed by system hooks

Designs

Child items ...

Activity

Add a scheduled job to reconcile data that is missed by system hooks

Relates to

Activity