Coordinating the Dry Run upgrade to v0.46.0

Gavin · 30 November 2024 21:03

Following the scheduled chain halt at Block Height 182 000, Dry Run chain stopped making blocks from 15:17 utc Thu until ~16:00 utc Fri, so ~26 hours

It may seem bad, but it actually went incredibly well. We wanted to squeeze a state migration test in before the Dec 3 mainnet launch, so there wasn’t much time to do it. Thanks to everyone that stuck it out amidst the challenges

Some observations:

1. This upgrade was pretty short-notice

and the notice wasn’t widespread (ie. email, announcements channel); also it was USA’s Thanksgiving day.

unless it’s an emergency, let’s coordinate upgrades well in advance for Tuesdays/Wednesdays/Thursdays that are not major holidays
let’s use a governance proposal to decide yes / no, followed by a second governance proposal to decide date/time
let’s use the validator alerts mailing list (here), the validator alerts Discord channel, and seek readiness confirmation from validators that did not vote on the second governance proposal

2. One validator dominated the consensus set & signed incorrectly

The evening prior to upgrade, a validator increased their stake and dominated the set with 18.75% of the voting power. As a result, we needed to rally a lot more voting power from the tail to progress.

at mainnet we anticipate a substantial increase in stake, which should better distribute voting power

3. Operator struggles

The lack of correctly-signing voting power revealed struggling validator operators.

some validators learned the importance of careful verification (and carefully following the guide)
- verify the sha256sum of the migration json file before launching
- verify scripts in advance
- verify (and maybe rename) the binaries that are being replaced (so that they aren’t accidentally used)
the Heliax engineers learned about some weaknesses that needed to be addressed in the migration code and a minor issue in the protocol itself

4. Recovery was very slow

the state was 211 gb, which meant that people had to make and distribute large snapshots of the pre-upgrade chain → v0.46.0 fixes this size issue
the client corrupted the database when it was restarted
- the migration process and code have been improved to prevent this problem

Note: if you can spare the hard drive space, making a snapshot of the pre-upgrade chain before advancing to the upgrade could help save the day.

@brentstone does the above seem about right? Let me know if I missed anything. Great job organizing this!

Given the above, it’s amazing to me that we had ~84% voting power online, and the vast majority of these validators were voting correctly, but we didn’t have enough voting power signing correctly.

This was ++helpful! Thanks to our large set of helpful and capable validators

This event gave us the chance to work through a challenge together, and some people really shone We were able to identify how and where people are slipping through the cracks, and now we’re ready for any state migration that may be needed post-mainnet.

Topic		Replies	Views
Coordinating the pre-Phase 4 Hard Fork Upgrade to namada v101 Network Operations	9	596	26 May 2025
Proposing mainnet software upgrade to v1.1.1 at block 894,000 Network Operations	26	842	18 February 2025
Dry Run Extension Network Operations	0	95	22 November 2024
[Testnet] Coordinating Upgrade to v201.0.0 Network Operations	3	96	24 July 2025
Dry Run Completion Network Operations	3	738	1 December 2024

Coordinating the Dry Run upgrade to v0.46.0

1. This upgrade was pretty short-notice

2. One validator dominated the consensus set & signed incorrectly

3. Operator struggles

4. Recovery was very slow

Related topics