Coordinating the Dry Run upgrade to v0.46.0

This thread is to coordinate the upgrade of the dry-run network from the current v0.45.1 to v0.46.0.

I am proposing we target block 182000 as the hard-fork height, which should happen sometime roughly around Thursday Nov 28 @ 16:00 UTC.

In order to accommodate this, let’s tentatively extend the dry-run network by one day to Friday Nov 29 @ 15:00 UTC. We can also choose to extend it through the weekend now that the mainnet date has been announced.

Performing the upgrade will require a few steps for validator operators and involve a state migration to update the VPs in storage:

Preparation

  1. Download the binaries for v0.46.0 from the release page and have the current binaries for v0.45.1 handy.
  2. Download the state migration json file from here. Ensure that you have the correct file with sha256sum dryrun_migration46.json, which should yield acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6.

Performing the upgrade
It might first be helpful to set some environment variables:

export BLOCK_HEIGHT=182000
export MIGRATION_JSON=</path/to/downloaded/migration/file>

Then, execute the following steps

  1. Stop your node and restart it with
    namadan-0.45.1 ledger run-until --block-height $BLOCK_HEIGHT --halt, where namadan-0.45.1 is the namadan binary for v0.45.1. Once block height $BLOCK_HEIGHT is reached, the node will halt automatically.
  2. Then restart your node with the 0.46.0 binaries, doing the following:
    namadan-0.46.0 ledger run --height 182000 --path $MIGRATION_JSON --hash acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6

Once 2/3 of the total voting power comes back online, the chain should resume with the altered state and new v0.46.0 runtime. Then, please test to ensure that all functionality works as expected.

13 Likes

In the future and for mainnet, the process for orchestrating a software upgrade and hard fork will be done more meticulously and discussed over a longer period of time, but since we are limited in time, we ask for your quick response to execute this upgrade.

3 Likes

The proposal for gathering social consensus is on-chain with an ID of 14!

> namadac query-proposal --proposal-id 14 --node $RPC_TUDUDES
Last committed epoch: 174
Proposal Id: 14
Type: Default with Wasm
Author: tnam1qqgll8x8rz9fvtdv8q6n985za0vjvgyu0udxh7fp
Content: {"abstract": "Upgrade the dry-run network to namada v0.46.0", "details": "Attempting to reach social consensus to upgrade the dry-run network to namada v0.46.0. This requires a small state migration and then restarting nodes and clients with the v0.46.0 binaries. Proposing that the hard fork block height be 182000.", "discussions-to": "https://forum.namada.net/t/coordinating-the-dry-run-upgrade-to-v0-46-0/1189", "title": "Upgrade to namada v0.46.0"}
Start Epoch: 175
End Epoch: 187
Activation Epoch: 188
Status: pending
Data: Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855

I did not attach any code to the proposal, so I am not sure why it is of type Default with Wasm. Perhaps a small client bug that I will look into fixing soon.

All set, see you there.

2 Likes

Ok, ready for the halt! We also staged a service template update to continue from that height after we run the update to 0.46.0 :+1:

Is this being rescheduled? Or should we really keep the halt height in place and upgrade, even though the gov proposal did not gather enough turnout? :thinking:

Ready to execute the upgrade steps for v0.46.0 as outlined.
Block 182000 it is!

Let’s keep the momentum going! :muscle:

by my calculation, i estimate that Block Height 182 000 will happen around 18:18 utc today

i used 7.9 second block times for my calc

the block time originally used to calculate the target time was 7.5 seconds per block, and it was actually 7.895 seconds per block, hence the variance

the original target calc was done at 20:51 utc on Tuesday, so that shows how much variance we can have when using block heights, versus target times. target times will be fairly rough estimates only!! it’s always the block height we use

1 Like

Standing by with halt height set.

Ready for the update

Upgrade complete! We are moving again :rocket:

FTR:
The issue with Message: Merkle tree should be restored: Custom(CustomError(MerkleTree("Invalid MerkleTreeStoresRead"))) was caused by nodes restarting after the node reached the block height for the agreed halt. It could have been avoided by allowing the node to stop as instructed, and having the service template set to Restart=False, then performing the rest of the upgrade!

2 Likes

Following the scheduled chain halt at Block Height 182 000, Dry Run chain stopped making blocks from 15:17 utc Thu until ~16:00 utc Fri, so ~26 hours

It may seem bad, but it actually went incredibly well. We wanted to squeeze a state migration test in before the Dec 3 mainnet launch, so there wasn’t much time to do it. Thanks to everyone that stuck it out amidst the challenges :raised_hands:

Some observations:

1. This upgrade was pretty short-notice

and the notice wasn’t widespread (ie. email, announcements channel); also it was USA’s Thanksgiving day. :hourglass:

  • unless it’s an emergency, let’s coordinate upgrades well in advance for Tuesdays/Wednesdays/Thursdays that are not major holidays
  • let’s use a governance proposal to decide yes / no, followed by a second governance proposal to decide date/time
  • let’s use the validator alerts mailing list (here), the validator alerts Discord channel, and seek readiness confirmation from validators that did not vote on the second governance proposal

2. One validator dominated the consensus set & signed incorrectly

The evening prior to upgrade, a validator increased their stake and dominated the set with 18.75% of the voting power. As a result, we needed to rally a lot more voting power from the tail to progress.

  • at mainnet we anticipate a substantial increase in stake, which should better distribute voting power :bar_chart:

3. Operator struggles

The lack of correctly-signing voting power revealed struggling validator operators.

  • some validators learned the importance of careful verification (and carefully following the guide)
    • verify the sha256sum of the migration json file before launching :rotating_light:
    • verify scripts in advance :rotating_light:
    • verify (and maybe rename) the binaries that are being replaced (so that they aren’t accidentally used) :rotating_light:
  • the Heliax engineers learned about some weaknesses that needed to be addressed in the migration code and a minor issue in the protocol itself :muscle:

4. Recovery was very slow

  • the state was 211 gb, which meant that people had to make and distribute large snapshots of the pre-upgrade chain → v0.46.0 fixes this size issue
  • the client corrupted the database when it was restarted
    • the migration process and code have been improved to prevent this problem

Note: if you can spare the hard drive space, making a snapshot of the pre-upgrade chain before advancing to the upgrade could help save the day.

@brentstone does the above seem about right? Let me know if I missed anything. Great job organizing this!

Given the above, it’s amazing to me that we had ~84% voting power online, and the vast majority of these validators were voting correctly, but we didn’t have enough voting power signing correctly.

This was ++helpful! Thanks to our large set of helpful and capable validators :raised_hands:

This event gave us the chance to work through a challenge together, and some people really shone :high_brightness: We were able to identify how and where people are slipping through the cracks, and now we’re ready for any state migration that may be needed post-mainnet. :rocket:

1 Like