Coordinating the Dry Run upgrade to v0.46.0

brentstone · 27 November 2024 07:19

This thread is to coordinate the upgrade of the dry-run network from the current v0.45.1 to v0.46.0.

I am proposing we target block 182000 as the hard-fork height, which should happen sometime roughly around Thursday Nov 28 @ 16:00 UTC.

In order to accommodate this, let’s tentatively extend the dry-run network by one day to Friday Nov 29 @ 15:00 UTC. We can also choose to extend it through the weekend now that the mainnet date has been announced.

Performing the upgrade will require a few steps for validator operators and involve a state migration to update the VPs in storage:

Preparation

Download the binaries for v0.46.0 from the release page and have the current binaries for v0.45.1 handy.
Download the state migration json file from here. Ensure that you have the correct file with sha256sum dryrun_migration46.json, which should yield acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6.

Performing the upgrade
It might first be helpful to set some environment variables:

export BLOCK_HEIGHT=182000
export MIGRATION_JSON=</path/to/downloaded/migration/file>

Then, execute the following steps

Stop your node and restart it with
namadan-0.45.1 ledger run-until --block-height $BLOCK_HEIGHT --halt, where namadan-0.45.1 is the namadan binary for v0.45.1. Once block height $BLOCK_HEIGHT is reached, the node will halt automatically.
Then restart your node with the 0.46.0 binaries, doing the following:
namadan-0.46.0 ledger run --height 182000 --path $MIGRATION_JSON --hash acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6

Once 2/3 of the total voting power comes back online, the chain should resume with the altered state and new v0.46.0 runtime. Then, please test to ensure that all functionality works as expected.

brentstone · 27 November 2024 07:20

In the future and for mainnet, the process for orchestrating a software upgrade and hard fork will be done more meticulously and discussed over a longer period of time, but since we are limited in time, we ask for your quick response to execute this upgrade.

brentstone · 27 November 2024 07:32

The proposal for gathering social consensus is on-chain with an ID of 14!

> namadac query-proposal --proposal-id 14 --node $RPC_TUDUDES
Last committed epoch: 174
Proposal Id: 14
Type: Default with Wasm
Author: tnam1qqgll8x8rz9fvtdv8q6n985za0vjvgyu0udxh7fp
Content: {"abstract": "Upgrade the dry-run network to namada v0.46.0", "details": "Attempting to reach social consensus to upgrade the dry-run network to namada v0.46.0. This requires a small state migration and then restarting nodes and clients with the v0.46.0 binaries. Proposing that the hard fork block height be 182000.", "discussions-to": "https://forum.namada.net/t/coordinating-the-dry-run-upgrade-to-v0-46-0/1189", "title": "Upgrade to namada v0.46.0"}
Start Epoch: 175
End Epoch: 187
Activation Epoch: 188
Status: pending
Data: Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855

I did not attach any code to the proposal, so I am not sure why it is of type Default with Wasm. Perhaps a small client bug that I will look into fixing soon.

5ElementsNodes · 27 November 2024 10:02

All set, see you there.

sirouk · 28 November 2024 04:32

Ok, ready for the halt! We also staged a service template update to continue from that height after we run the update to 0.46.0

sigv · 28 November 2024 10:25

Is this being rescheduled? Or should we really keep the halt height in place and upgrade, even though the gov proposal did not gather enough turnout?

Encapsulate · 28 November 2024 13:24

Ready to execute the upgrade steps for v0.46.0 as outlined.
Block 182000 it is!

Let’s keep the momentum going!

Gavin · 28 November 2024 14:44

by my calculation, i estimate that Block Height 182 000 will happen around 18:18 utc today

i used 7.9 second block times for my calc

the block time originally used to calculate the target time was 7.5 seconds per block, and it was actually 7.895 seconds per block, hence the variance

the original target calc was done at 20:51 utc on Tuesday, so that shows how much variance we can have when using block heights, versus target times. target times will be fairly rough estimates only!! it’s always the block height we use

Brendan-Whispernode · 28 November 2024 15:19

Standing by with halt height set.

nodemeister · 28 November 2024 15:54

Ready for the update

sirouk · 29 November 2024 20:07

Upgrade complete! We are moving again

FTR:
The issue with Message: Merkle tree should be restored: Custom(CustomError(MerkleTree("Invalid MerkleTreeStoresRead"))) was caused by nodes restarting after the node reached the block height for the agreed halt. It could have been avoided by allowing the node to stop as instructed, and having the service template set to Restart=False, then performing the rest of the upgrade!

Gavin · 30 November 2024 21:03

Following the scheduled chain halt at Block Height 182 000, Dry Run chain stopped making blocks from 15:17 utc Thu until ~16:00 utc Fri, so ~26 hours

It may seem bad, but it actually went incredibly well. We wanted to squeeze a state migration test in before the Dec 3 mainnet launch, so there wasn’t much time to do it. Thanks to everyone that stuck it out amidst the challenges

Some observations:

1. This upgrade was pretty short-notice

and the notice wasn’t widespread (ie. email, announcements channel); also it was USA’s Thanksgiving day.

unless it’s an emergency, let’s coordinate upgrades well in advance for Tuesdays/Wednesdays/Thursdays that are not major holidays
let’s use a governance proposal to decide yes / no, followed by a second governance proposal to decide date/time
let’s use the validator alerts mailing list (here), the validator alerts Discord channel, and seek readiness confirmation from validators that did not vote on the second governance proposal

2. One validator dominated the consensus set & signed incorrectly

The evening prior to upgrade, a validator increased their stake and dominated the set with 18.75% of the voting power. As a result, we needed to rally a lot more voting power from the tail to progress.

at mainnet we anticipate a substantial increase in stake, which should better distribute voting power

3. Operator struggles

The lack of correctly-signing voting power revealed struggling validator operators.

some validators learned the importance of careful verification (and carefully following the guide)
- verify the sha256sum of the migration json file before launching
- verify scripts in advance
- verify (and maybe rename) the binaries that are being replaced (so that they aren’t accidentally used)
the Heliax engineers learned about some weaknesses that needed to be addressed in the migration code and a minor issue in the protocol itself

4. Recovery was very slow

the state was 211 gb, which meant that people had to make and distribute large snapshots of the pre-upgrade chain → v0.46.0 fixes this size issue
the client corrupted the database when it was restarted
- the migration process and code have been improved to prevent this problem

Note: if you can spare the hard drive space, making a snapshot of the pre-upgrade chain before advancing to the upgrade could help save the day.

@brentstone does the above seem about right? Let me know if I missed anything. Great job organizing this!

Given the above, it’s amazing to me that we had ~84% voting power online, and the vast majority of these validators were voting correctly, but we didn’t have enough voting power signing correctly.

This was ++helpful! Thanks to our large set of helpful and capable validators

This event gave us the chance to work through a challenge together, and some people really shone We were able to identify how and where people are slipping through the cracks, and now we’re ready for any state migration that may be needed post-mainnet.

tmark446 · 17 April 2025 19:51

this would make future upgrades more robust

Gavin · 21 April 2025 12:00

This topic was automatically closed after 2 days. New replies are no longer allowed.

Topic		Replies	Views
Proposing mainnet software upgrade to v1.1.1 at block 894,000 Network Operations	26	809	18 February 2025
Coordinating the pre-Phase 4 Hard Fork Upgrade to namada v101 Network Operations	9	581	26 May 2025
Dry Run Instructions Network Operations	18	2110	29 January 2025
Dry Run Extension Network Operations	0	95	22 November 2024
Dry Run Completion Network Operations	3	735	1 December 2024

Coordinating the Dry Run upgrade to v0.46.0

1. This upgrade was pretty short-notice

2. One validator dominated the consensus set & signed incorrectly

3. Operator struggles

4. Recovery was very slow

Related topics