Following the scheduled chain halt at Block Height 182 000, Dry Run chain stopped making blocks from 15:17 utc Thu until ~16:00 utc Fri, so ~26 hours
It may seem bad, but it actually went incredibly well. We wanted to squeeze a state migration test in before the Dec 3 mainnet launch, so there wasn’t much time to do it. Thanks to everyone that stuck it out amidst the challenges
Some observations:
1. This upgrade was pretty short-notice
and the notice wasn’t widespread (ie. email, announcements channel); also it was USA’s Thanksgiving day.
- unless it’s an emergency, let’s coordinate upgrades well in advance for Tuesdays/Wednesdays/Thursdays that are not major holidays
- let’s use a governance proposal to decide yes / no, followed by a second governance proposal to decide date/time
- let’s use the validator alerts mailing list (here), the validator alerts Discord channel, and seek readiness confirmation from validators that did not vote on the second governance proposal
2. One validator dominated the consensus set & signed incorrectly
The evening prior to upgrade, a validator increased their stake and dominated the set with 18.75% of the voting power. As a result, we needed to rally a lot more voting power from the tail to progress.
- at mainnet we anticipate a substantial increase in stake, which should better distribute voting power
3. Operator struggles
The lack of correctly-signing voting power revealed struggling validator operators.
- some validators learned the importance of careful verification (and carefully following the guide)
- verify the sha256sum of the migration json file before launching
- verify scripts in advance
- verify (and maybe rename) the binaries that are being replaced (so that they aren’t accidentally used)
- the Heliax engineers learned about some weaknesses that needed to be addressed in the migration code and a minor issue in the protocol itself
4. Recovery was very slow
- the state was 211 gb, which meant that people had to make and distribute large snapshots of the pre-upgrade chain → v0.46.0 fixes this size issue
- the client corrupted the database when it was restarted
- the migration process and code have been improved to prevent this problem
Note: if you can spare the hard drive space, making a snapshot of the pre-upgrade chain before advancing to the upgrade could help save the day.
@brentstone does the above seem about right? Let me know if I missed anything. Great job organizing this!
Given the above, it’s amazing to me that we had ~84% voting power online, and the vast majority of these validators were voting correctly, but we didn’t have enough voting power signing correctly.
This was ++helpful! Thanks to our large set of helpful and capable validators
This event gave us the chance to work through a challenge together, and some people really shone We were able to identify how and where people are slipping through the cracks, and now we’re ready for any state migration that may be needed post-mainnet.