Upgrade from 13.0 to 13.1 breaks the box
SMA Principal Developer here. The dev team is following this issue closely, but we cannot reproduce it in-house and are unable to diagnose it without investigating an affected system pre- and post-upgrade. There are only 2 customers in this thread with the same upgrade issue/symptoms (described in the original post) so far, and both are in touch with Support to continue investigating. If you are affected and have not yet reached out to our Support team, we encourage you to do so. We'll need your cooperation to get to the root cause.
This appears to be an isolated issue, as hundreds of customers have upgraded to 13.1 without issue. Unfortunately, until we understand what is going on here we cannot offer any sort of solution.
Rest assured we are taking this issue very seriously, and we apologize for the inconvenience caused to our affected customers.
EDIT 5/11/23 @ 12:19PM CDT:
Root cause discovered with assistance from dstarrisom (thanks so much!). This issue is being tracked as K1-34089 and can be referenced when discussing with support. The issue is caused by the secure remote database access certificate (configurable under Control Panel -> Security Settings) being either expired or too weak for the upgraded versions of openssl and MySQL in SMA 13.1.74. Customers can preemptively regenerate these certificates on that page prior to upgrade (if you've bumped into this and reverted snapshots or restored from backups) by choosing the "Override Default Certificates" option under the "Enable Secure database access (SSL)" option on Control Panel -> Security Settings page, checking the box labeled "Reset to Default Certificate Files" and clicking "Save and Restart Services". This will regenerate the certificates prior to upgrade and everything should come up fine after your next upgrade attempt. If you are unable to or would rather not revert/restore, the issue can also be resolved post-upgrade by Support staff but requires backend access and is not something you can DIY.
Thank you to everyone for your patience and assistance in finding a swift solution for this issue. We will take steps to prevent this from occurring with future upgrades.
Can we send you a backup of our DB? You could restore it to a 13.0 VM and then attempt an upgrade.
Our Case number is 02097115 - ager01 2 weeks ago
I've added a lengthy edit to my original post and want to make sure you see the details. Support should be reaching out to you as well. - airwolf 2 weeks ago
I have an 11:30 AM (eastern) call set up with support (case 02104841) and one of your engineers to GoToMeeting into the system with root credentials. Are you interested in joining? It's a VM and I take a snapshot of it powered off, so we can do all sorts of testing on it afterwards. We have already declared a maintenance window to facilitate this meeting. - dstarrisom 2 weeks ago
I am happy to report that "Override Default Certificates" prior to upgrade resolved our upgrade issue and we are now running version 13.1. Please post this guidance on the download page to save us hours of downtime. Thanks! - ager01 2 weeks ago
I'm glad you were able to upgrade successfully! Support is working on publishing the details of this issue and workaround to the portal / knowledge base. - airwolf 2 weeks ago
this is really unusual. but support is the only who can help.
Is this a physical or virtual appliance?
And usually restoring is the fastest way. So verify with support what happened.
It is helpful to open a tether before the update (since after it iseems not to be possible according what you wrote) and let
Thanks for posting your findings team,
These screenshots point to mysqld daemon being down\not running.
Without mysql many other servies will not start.
We could start by checking the upgrade.log file, to see how far it went, and if it failed in the middle ground, might be a good a idea to restore from backups and perform a mysqlcheck.
Contact support for this.
Also to everyone here, snapshots are not supported:
See this post:
Snapshotting a VM with MYSQLD writing transactions into the DB, is a VIP ticket to Filesystem and DB issues; ideally you should power off the VM, then take a snapshot, and then power ON the VM... but this scenario does not occur in real life, where 24/7 business time and downtimes are limited.
Back to the first Link.
See this post for a related topic:
You might want to share your case numbers here, since part of the support team is monitoring ITNinja.
(the outcome might be the same, upgrade failed\crashed the SMA, but the reason behind might be different for some of you).
My snapshots were taken while the system was off (during a maintenance window for the upgrade).
The problem has been trying to find a coordinated time for support to review my appliance while it's in the broken state. This session with support is now scheduled for late morning eastern time, tomorrow.
My case number (02104841) is also in my first reply to this question from earlier today. I've also specifically asked support to review this question/posting. - dstarrisom 2 weeks ago
We had the same problem at the weekend. (VMWare)
[Sun May 7 11:59:56 CEST 2023] [notice] applying Infrastructure Upgrades...
[Sun May 7 11:59:50 CEST 2023] [notice] Starting software updates ...
[Sun May 7 11:59:50 CEST 2023] [notice] DB update completed.
[Sun May 7 11:59:50 CEST 2023] [notice] restore_report_schedules done
After this nothing more happend. So i contacted the Support. After they looked the appliance they said, that we need to reinstall the appliance with the last backup. I hope that the next try runs fine. But this time with a snapshot of the VM.
and this is why we snapshot before applying patches. If your virtualisation environment can snapshot, always snapshot before apply updates. Just remember to remove them otherwise you will cause yourself other issues. - Norlag 2 weeks ago
We used to do that all the time, but for some reason we forgot this time. - CarstenBuscher 2 weeks ago
This issue is not related to the issue the other customers are having in this thread. If your upgrade is hanging in the middle like that, it's not something that can be diagnosed without inspecting the system while it's hung. - airwolf 2 weeks ago
I reported this and a support technician got in touch and restarted the VM. After that, we re-established the SMA and restored the backup, which caused the agents not to update the inventory.
All in all, I have to say that the quality of the releases has dropped significantly in the last two years. Such problems did not exist before. - CarstenBuscher 2 weeks ago
Bugs exist, have always existed, and will always exist in all software. There has not been any sort of increase in agent communication issues or bugs, in general, between recent versions. In fact, we fixed a record breaking number of bugs in the 13.0 release.
Please continue to work with support and feel free to have them pass along any feedback you'd like to the product management team. - airwolf 2 weeks ago
I restored the backup last night and everything seemed to be working fine again. However, over the course of the day it became apparent that the computers create the inventory locally, but this is not updated in the SMA. We uninstalled and reinstalled the agent on several computers as a test, but that didn't bring any improvement either. - CarstenBuscher 2 weeks ago
You'll have to work with support to diagnose that issue, as there are countless factors that come into play with agent inventory uploading and processing, including environmental concerns. - airwolf 2 weeks ago
I believe we have the same issue. Case#02104841
Our system is virtual, so I did a snapshot pre-upgrade and just kept rolling back whenever it'd fail.
Support was given tether access to the system. They did something to the DB and asked us to try upgrading again. Still broken. Support gave up and asked us to start with a new OVF deployment and restore the backup.
I originally asked support if multiple customers were complaining about this and was told 'no, it's just you." I'll be replying with a link to this thread in my case suggesting they have a systemic issue.
Absolutely identical! The errors start appearing when mysql is getting initialized. The box retains its network connection and can be ssh'd into, but when I initially contacted support hoping they'd want to ssh and see what's happening they instead suggested starting with a new 13.0 image and DB restore. The restore worked as expected, but another update to 13.1 failed with same symptoms. What was your experience with Quest support?
I cannot leave it broken for any length of time and coordinating with support could take hours. After a failed attempt, I usually just roll back. Might have to consider some additional coordination or just rolling out the 13.1 OVF and trying to reload the 13.0 backups.
Per support, we updated to 13.0 to see if an issue with shellscript dependencies NOT pulling from replication shares would be fixed... (NOT)
Then we updated to 13.1 (had NO ISSUES with the update) shellscript dependencies issue still BROKEN (and support confirmed the test lab also has same issues)
However, after the update to 13.0 (and 13.1) if we try to email in a process starting ticket to our employment queue the process does NOT launch properly!!
And support confirmed that they are have the same issue on their end... it has been over a week... many many logs, tether enabled etc...
just FYI to anyone else that might trigger major processes through email...
Luckily the 1st ticket (parent ticket) is created, and we can start a manual process with the info from the parent ticket, but what a pain in the arse, considering all the custom work (and paid professional services work) that is now not 100% functional because of their crappy update(s)
Also if you have reports that are scheduled to e-mail the results (in hmtl or txt format) into a ticket queue the .html or .txt file is somehow removed during the process ugh! and you get a ticket without the attachment ugh!
Hope they get this fixed soon!