Merge hotfix/release-19286bf into /eclipseFdn/open-vsx.org main branch

added IT-prioritymedium IT-severitymajor Infra labels

assigned to @mward

As discussed on https://github.com/EclipseFdn/open-vsx.org/issues/979#issuecomment-1189228162, this issue is blocking us from deploying new updates to open-vsx.org.

@mward I know that you were assigned today on this issue but is it possible to give us an ETA on when we can expect this to be done?

It will help us plan when we can expect our updates to go live...

/cc @kineticSquid

This is on my list of things to try and look at today.

I ran the SQL commands above(on the staging db) and then re-ran the build/deployment job for staging but the pod has failed to start apparently due to errors in the db migration:

SQL State  : 42P07
Error Code : 0
Message    : ERROR: relation "extract_resources_migration_item" already exists
Location   : db/migration/V1_23__FileResource_Extract_Resources.sql (/home/openvsx/server/BOOT-INF/classes/db/migration/V1_23__FileResource_Extract_Resources.sql)
Line       : 1
Statement  : CREATE TABLE public.extract_resources_migration_item (
    id bigint NOT NULL,
    extension_id bigint NOT NULL,
    migration_scheduled boolean NOT NULL
)
2022-07-20 15:33:39.766  INFO 1 --- [           main] o.apache.catalina.core.StandardService   : Stopping service [Tomcat]
2022-07-20 15:33:39.770  WARN 1 --- [           main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat
2022-07-20 15:33:39.772  INFO 1 --- [           main] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown initiated...
2022-07-20 15:33:39.775  INFO 1 --- [           main] com.zaxxer.hikari.HikariDataSource       : HikariPool-1 - Shutdown completed.
2022-07-20 15:33:39.848  INFO 1 --- [           main] ConditionEvaluationReportLoggingListener : 

Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.
2022-07-20 15:33:39.868  WARN 1 --- [           main] o.s.boot.SpringApplication               : Unable to close ApplicationContext

java.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [db/migration/V1_23__FileResource_Extract_Resources.sql]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.

The sql error repeats a few times(with some slightly different traceback details) and the container enters the 'crashloopbackoff' state.

@mward I've updated the SQL commands to reset migration 1.23 too. Can you try again?

I'm still seeing the same errors when I run the build/deployment.

The SQL commands above haven't changed(AFAICT), but perhaps I'm missing something (else)?

@mward I didn't save the changes The description is now updated.

Ah, that explains it.

I've run the updated commands and re-run the build/deploy and the pod is now up and running.

I'm not absolutely certain on the process here, but I'm going to suggest we let this updated instance run overnight and if it's still happy in the morning we can merge your PR, backup the production db, run any of the above commands required, then run the production build/deploy.

Does that sound reasonable @amvanbaren?

@mward sounds like a plan

I've created https://github.com/EclipseFdn/open-vsx.org/pull/1136 to try and merge the main branch into production(My Git skills are not that advanced, so I may have gotten this 'backwards'). @amvanbaren can you confirm the merge is going in the right 'direction'?

@mward Yes, the merge is in the right direction. However, I don't have sufficient permissions in the open-vsx.org repo to approve your PR.

Fortunately I do have those permissions :). Thanks for confirming, now I'll start the deployment process.

changed the description

The build of the production branch has succeeded and 1 new pod is running and the logs look ok(so far). At this time I think we just need to wait for the db 'update' process to finish.

Looks like the new deployment has finished and the pods have been stable since my last update. A quick peek at the logs reveals mostly what look like http errors:

2022-07-21 15:53:53.601 ERROR 1 --- [.0-8080-exec-47] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.http.InvalidMediaTypeException: Invalid mime type "api-version=6.1-preview.1": does not contain '/'] with root cause

org.springframework.util.InvalidMimeTypeException: Invalid mime type "api-version=6.1-preview.1": does not contain '/'

But since things are operational I think we're done here.

I'll open an issue for that exception. Thanks @mward!

@mward The server is returning 503 responses. Can you take a look? https://github.com/EclipseFdn/open-vsx.org/pull/1136#issuecomment-1191859283

@kineticSquid tagged me on a different channel. Just to test I restarted the pods and the error he reported has gone away(presumably temporarily) and I don't see a 503 when I request the main page at least.

The last error entry(in my shell history at least) that looks relevant is:

org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer

@mward Can you see how many concurrent requests there are for the /api/-/publish endpoint? I think some people got very excited that target platforms are now supported.
In that case you should also see in the logs that hikari (database connection pool) runs out of connections.

Based on the gateway logs there have been only 70 requests for that URL in the last 16 hours.

If I capture the logs I see a lot of:

2022-07-21 19:58:03.935 ERROR 1 --- [0-8080-exec-160] o.a.c.c.C.[.[.[/].[dispatcherServlet]    : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.data.elasticsearch.UncategorizedElasticsearchException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]];] with root cause

org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]

I also see:

2022-07-21 20:07:23.184 ERROR 1 --- [taskScheduler-1] o.e.o.storage.AzureDownloadCountService  : Failed to process BlobItem: resourceId=/subscriptions/c09b2dac-dbcd-4efa-b9cb-97aaeb57dba8/resourceGroups/open-vsx.org/providers/Microsoft.Storage/storageAccounts/openvsxorg/blobServices/default/y=2022/m=03/d=15/h=20/m=00/PT1H.json

java.lang.IllegalArgumentException: Illegal character in path at index 105:

and

2022-07-21 20:07:23.189 ERROR 1 --- [taskScheduler-1] o.e.o.storage.AzureDownloadCountService  : Transaction failed

Not sure how relevant any of those are.

@mward PR #485 fixes the ElasticSearch issue. I can merge it into master and we can then deploy it.
The AzureDownloadCountService hasn't changed in this release. Did the AzureDownloadCountService exceptions start to pop up today after the release?

I don't know when the AzureDownload exceptions started as the logs only go back a day or so, or back to the start of the pod if it's 'new'.

If you think that PR will help we can give it a try. My question is: if it doesn't fix things do we try to revert things?

@mward and the logs of old pods are immediately deleted after the pod is shutdown?
Ok, I'll merge the PR. If that doesn't fix things then try to revert things.
Please do save the log of the pod, so that I can take a look at it.

@mward I've opened a new PR: https://github.com/EclipseFdn/open-vsx.org/pull/1138

I"ve merged the change and kicked off a build for the staging service.

The staging service should now be updated. I'm going to let it run for 5ish minutes before creating a PR to merge in this change to production.

I called the /extensionquery endpoint, it returns a 400 bad request as expected when the result window exceeds 10000.

Ok merging into production.

Rollout to production has finished.

It's slow, but it works for me

So far the logs are clear of the previous errors so that's a good thing. I think we need to wait a see a bit at this stage.

Yes, let's check back in ~30 minutes.

@mward open-vsx.org isn't loading for me

I'm seeing that as well. From the gateway it's like the http server process on the pods has crashed but hasn't actually died as it's still accepting connections, even if it isn't answering them.

Ok, let's revert back to the previous release, including the database. Please do save the logs for further investigation.

I'll see what I can do.

Ok I think the reversion is done, and open-vsx.org is responding again.

Yes, working for me

Database load is back down to a reasonable figure.

@mward The reversion didn't help. I'm now getting a 502 when the front-end tries to get user data.

@droy Can you check what was causing the increase in database load?

Tons of these queries: select versions0_.extension_id as extensi24_7_0_, versions0_.id as id1_7_0_, versions0_.id as id1_7_1_, versions0_.active as active2_7_1_, versions0_.bugs as bugs3_7_1_, versions0_.bundled_extensions as bundled_4_7_1_, versions0_.categories as categori5_7_1_, versions0_.dependencies as dependen6_7_1_, versions0_.description as descript7_7_1_, versions0_.displ ay_name as display_8_7_1_, versions0_.engines as engines9_7_1_, versions0_.extension_id as extensi24_7_1_, versions0_.extension_kind as extensi10_7_1_, versions0_.gallery_color as gallery11_7_1_, versions0_.gallery_theme as gallery12_7_ 1_, versions0_.homepage as homepag13_7_1_, versions0_.license as license14_7_1_, versions0_.markdown as markdow15_7_1_, versions0_.pre_release as pre_rel16_7_1_, versions0_.preview as preview17_7_1_, versions0_.published_with_id as publ ish25_7_1_, versions0_.qna as qna18_7_1_, versions0_.repository as reposit19_7_1_, versions0_.tags as tags20_7_1_, versions0_.target_platform as target_21_7_1_, versions0_.timestamp as times

I think part of the issue is that for reasons of naivete I presumed that if I re-ran the older builds they would 're-run' from the codebase at that point in time. This clearly isn't the case, so I'm currently deleting/recreating the deployment(and the db as well) using the different already published containers trying to find one that will work.

This is the error that I had been seeing (two different browsers each with a new private session)

Now I'm seeing it again (when I attempt to login).

Tomcat on production pods is constantly spitting out:

Caused by: java.io.IOException: Connection reset by peer

@fgurr That would indicate NGINX is closing the connection, because Open VSX takes too long to respond. Are there also 'upstream timed out' errors in the NGINX error.log file?

Do you get this exception for the current reverted release or is this from yesterday evening's logs?

Do you get this exception for the current reverted release or is this from yesterday evening's logs?

That's from the currently running version.

I've updated the deployment to use the image tagged as '9ab8c88-2', and that seems to have gotten things back into a running state without any DB upgrades.

I'm still seeing the ElasticSearch window size errors, but I'm going to let this run for a few hours instead of continuing to experiment with different images.

Did you try? https://github.com/eclipsefdn/open-vsx.org/pkgs/container/openvsx-website/15002919?tag=922bf97-49

Based on a status update from Mika from June 7th on slack, he did mention that he reverted to that specific image.

The version I am referring to is about 5 months old whereas the version you are referring to is about 3 months old.

It's possible that downgrading too far might cause more issues if the database schema was changed between these 2 versions.

Yes I did try that image, but I also tried the 'newer' images as well in case they contained fixes, and the one I settled on seemed to be the newest image that worked without issue(although I didn't test all of them). I was specifically looking for an image that didn't 'update' the db.

I swear I searched slack to try and find his comments, but clearly I cannot operate the search feature, or I keep picking lousy search terms.

I sent you a DM with a link to his message.

@mward have there been any updates on this? Seems to be working now for me

At this time things seem stable, so we're just going to let it run over the weekend. We'll need to get the log data to @amvanbaren so he can try and figure out what's failing with the latest updates. But that will have to wait until Monday at the earliest.

added wg:ECD TOOLS label

added statewip label

marked this issue as related to #1598 (closed)

@mward I've opened PR #1170 to get a performance baseline on staging.

@mward Hello, could you share ETA when it is going to be done?

I've requested a review of the PR Aart linked to from @mbarbero, just in case I'm missing some details.

Beyond that I've reset the staging db and tried to roll out a deployment with the latest image, but I'm getting errors on image pull, so I'm not sure if Aarts PR is meant to address the image pull error or something else.

@mward It's meant to deploy to staging the same image as what is currently running on production. You can find the image here: https://github.com/eclipse/openvsx/pkgs/container/openvsx-server/14365564?tag=153a429

The PR has been merged and the staging instance updated.

@amvanbaren Have you had a chance to test the staging deployment?

@kineticSquid Yes, I got a performance baseline for staging.
@mward I deployed PR #1186 to staging this morning. It contains fixes that should improve the performance of the new release. I just checked if everything is running OK, but I'm getting internal server errors. Can you check for exceptions in the staging logs?

The container restarted about 6h ago, but I've captured the current logs and sent them to you, since it's probably faster for you to figure out what's a problem.

@mward I've opened PR #1197. The EHCACHE_CONFIG environment variable needs to point to the configuration/ehcache.xml file, so that the path to the file is included in the application.properties file. It should be similar to the way the DEPLOYMENT_CONFIG variable is configured.

mentioned in issue #1698 (closed)

mentioned in issue #1723 (closed)

mentioned in issue #1726 (closed)

closed

Merge hotfix/release-19286bf into /eclipseFdn/open-vsx.org main branch

Designs

Child items ...

Activity

Merge hotfix/release-19286bf into /eclipseFdn/open-vsx.org main branch

Blocks

Activity