-- Reset migration 1.23DROP TABLE public.extract_resources_migration_item;-- Reset migration 1.24ALTER TABLE extension DROP COLUMN published_date;ALTER TABLE extension DROP COLUMN last_updated_date;-- Rerun database migrationsDELETE FROM flyway_schema_history WHERE version = '1.23' OR version = '1.24';
If the deployment to staging runs without exception (maybe let it run for 5 min), then it can be deployed to production.
Before deploying to production:
Please make a database backup (just to be safe).
If migrations 1.23 and 1.24 have run to completion on production (SELECT * FROM flyway_schema_history WHERE version = '1.23' OR version = '1.24';), then use the same database script as above.
I ran the SQL commands above(on the staging db) and then re-ran the build/deployment job for staging but the pod has failed to start apparently due to errors in the db migration:
SQL State : 42P07Error Code : 0Message : ERROR: relation "extract_resources_migration_item" already existsLocation : db/migration/V1_23__FileResource_Extract_Resources.sql (/home/openvsx/server/BOOT-INF/classes/db/migration/V1_23__FileResource_Extract_Resources.sql)Line : 1Statement : CREATE TABLE public.extract_resources_migration_item ( id bigint NOT NULL, extension_id bigint NOT NULL, migration_scheduled boolean NOT NULL)2022-07-20 15:33:39.766 INFO 1 --- [ main] o.apache.catalina.core.StandardService : Stopping service [Tomcat]2022-07-20 15:33:39.770 WARN 1 --- [ main] ConfigServletWebServerApplicationContext : Exception encountered during context initialization - cancelling refresh attempt: org.springframework.context.ApplicationContextException: Unable to start web server; nested exception is org.springframework.boot.web.server.WebServerException: Unable to start embedded Tomcat2022-07-20 15:33:39.772 INFO 1 --- [ main] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown initiated...2022-07-20 15:33:39.775 INFO 1 --- [ main] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown completed.2022-07-20 15:33:39.848 INFO 1 --- [ main] ConditionEvaluationReportLoggingListener : Error starting ApplicationContext. To display the conditions report re-run your application with 'debug' enabled.2022-07-20 15:33:39.868 WARN 1 --- [ main] o.s.boot.SpringApplication : Unable to close ApplicationContextjava.lang.IllegalStateException: Illegal access: this web application instance has been stopped already. Could not load [db/migration/V1_23__FileResource_Extract_Resources.sql]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
The sql error repeats a few times(with some slightly different traceback details) and the container enters the 'crashloopbackoff' state.
I've run the updated commands and re-run the build/deploy and the pod is now up and running.
I'm not absolutely certain on the process here, but I'm going to suggest we let this updated instance run overnight and if it's still happy in the morning we can merge your PR, backup the production db, run any of the above commands required, then run the production build/deploy.
I've created https://github.com/EclipseFdn/open-vsx.org/pull/1136 to try and merge the main branch into production(My Git skills are not that advanced, so I may have gotten this 'backwards'). @amvanbaren can you confirm the merge is going in the right 'direction'?
The build of the production branch has succeeded and 1 new pod is running and the logs look ok(so far). At this time I think we just need to wait for the db 'update' process to finish.
Looks like the new deployment has finished and the pods have been stable since my last update. A quick peek at the logs reveals mostly what look like http errors:
2022-07-21 15:53:53.601 ERROR 1 --- [.0-8080-exec-47] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.http.InvalidMediaTypeException: Invalid mime type "api-version=6.1-preview.1": does not contain '/'] with root causeorg.springframework.util.InvalidMimeTypeException: Invalid mime type "api-version=6.1-preview.1": does not contain '/'
But since things are operational I think we're done here.
@kineticSquid tagged me on a different channel. Just to test I restarted the pods and the error he reported has gone away(presumably temporarily) and I don't see a 503 when I request the main page at least.
The last error entry(in my shell history at least) that looks relevant is:
org.apache.catalina.connector.ClientAbortException: java.io.IOException: Connection reset by peer
@mward Can you see how many concurrent requests there are for the /api/-/publish endpoint?
I think some people got very excited that target platforms are now supported.
In that case you should also see in the logs that hikari (database connection pool) runs out of connections.
Based on the gateway logs there have been only 70 requests for that URL in the last 16 hours.
If I capture the logs I see a lot of:
2022-07-21 19:58:03.935 ERROR 1 --- [0-8080-exec-160] o.a.c.c.C.[.[.[/].[dispatcherServlet] : Servlet.service() for servlet [dispatcherServlet] in context with path [] threw exception [Request processing failed; nested exception is org.springframework.data.elasticsearch.UncategorizedElasticsearchException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]];] with root causeorg.elasticsearch.ElasticsearchException: Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [737730]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]
I also see:
2022-07-21 20:07:23.184 ERROR 1 --- [taskScheduler-1] o.e.o.storage.AzureDownloadCountService : Failed to process BlobItem: resourceId=/subscriptions/c09b2dac-dbcd-4efa-b9cb-97aaeb57dba8/resourceGroups/open-vsx.org/providers/Microsoft.Storage/storageAccounts/openvsxorg/blobServices/default/y=2022/m=03/d=15/h=20/m=00/PT1H.jsonjava.lang.IllegalArgumentException: Illegal character in path at index 105:
@mward PR #485 fixes the ElasticSearch issue. I can merge it into master and we can then deploy it.
The AzureDownloadCountService hasn't changed in this release. Did the AzureDownloadCountService exceptions start to pop up today after the release?
@mward and the logs of old pods are immediately deleted after the pod is shutdown?
Ok, I'll merge the PR. If that doesn't fix things then try to revert things.
Please do save the log of the pod, so that I can take a look at it.
I'm seeing that as well. From the gateway it's like the http server process on the pods has crashed but hasn't actually died as it's still accepting connections, even if it isn't answering them.
Tons of these queries:
select versions0_.extension_id as extensi24_7_0_, versions0_.id as id1_7_0_, versions0_.id as id1_7_1_, versions0_.active as
active2_7_1_, versions0_.bugs as bugs3_7_1_, versions0_.bundled_extensions as bundled_4_7_1_, versions0_.categories as categori5_7_1_, versions0_.dependencies as dependen6_7_1_, versions0_.description as descript7_7_1_, versions0_.displ
ay_name as display_8_7_1_, versions0_.engines as engines9_7_1_, versions0_.extension_id as extensi24_7_1_, versions0_.extension_kind as extensi10_7_1_, versions0_.gallery_color as gallery11_7_1_, versions0_.gallery_theme as gallery12_7_
1_, versions0_.homepage as homepag13_7_1_, versions0_.license as license14_7_1_, versions0_.markdown as markdow15_7_1_, versions0_.pre_release as pre_rel16_7_1_, versions0_.preview as preview17_7_1_, versions0_.published_with_id as publ
ish25_7_1_, versions0_.qna as qna18_7_1_, versions0_.repository as reposit19_7_1_, versions0_.tags as tags20_7_1_, versions0_.target_platform as target_21_7_1_, versions0_.timestamp as times
I think part of the issue is that for reasons of naivete I presumed that if I re-ran the older builds they would 're-run' from the codebase at that point in time. This clearly isn't the case, so I'm currently deleting/recreating the deployment(and the db as well) using the different already published containers trying to find one that will work.
@fgurr That would indicate NGINX is closing the connection, because Open VSX takes too long to respond.
Are there also 'upstream timed out' errors in the NGINX error.log file?
Do you get this exception for the current reverted release or is this from yesterday evening's logs?
I've updated the deployment to use the image tagged as '9ab8c88-2', and that seems to have gotten things back into a running state without any DB upgrades.
I'm still seeing the ElasticSearch window size errors, but I'm going to let this run for a few hours instead of continuing to experiment with different images.
Yes I did try that image, but I also tried the 'newer' images as well in case they contained fixes, and the one I settled on seemed to be the newest image that worked without issue(although I didn't test all of them). I was specifically looking for an image that didn't 'update' the db.
I swear I searched slack to try and find his comments, but clearly I cannot operate the search feature, or I keep picking lousy search terms.
At this time things seem stable, so we're just going to let it run over the weekend. We'll need to get the log data to @amvanbaren so he can try and figure out what's failing with the latest updates. But that will have to wait until Monday at the earliest.
I've requested a review of the PR Aart linked to from @mbarbero, just in case I'm missing some details.
Beyond that I've reset the staging db and tried to roll out a deployment with the latest image, but I'm getting errors on image pull, so I'm not sure if Aarts PR is meant to address the image pull error or something else.
@kineticSquid Yes, I got a performance baseline for staging.
@mward I deployed PR #1186 to staging this morning. It contains fixes that should improve the performance of the new release. I just checked if everything is running OK, but I'm getting internal server errors. Can you check for exceptions in the staging logs?
The container restarted about 6h ago, but I've captured the current logs and sent them to you, since it's probably faster for you to figure out what's a problem.
@mward I've opened PR #1197. The EHCACHE_CONFIG environment variable needs to point to the configuration/ehcache.xml file, so that the path to the file is included in the application.properties file. It should be similar to the way the DEPLOYMENT_CONFIG variable is configured.