This thread is to coordinate the upgrade of the dry-run network from the current v0.45.1 to v0.46.0.
I am proposing we target block 182000 as the hard-fork height, which should happen sometime roughly around Thursday Nov 28 @ 16:00 UTC.
In order to accommodate this, let’s tentatively extend the dry-run network by one day to Friday Nov 29 @ 15:00 UTC. We can also choose to extend it through the weekend now that the mainnet date has been announced.
Performing the upgrade will require a few steps for validator operators and involve a state migration to update the VPs in storage:
Preparation
Download the binaries for v0.46.0 from the release page and have the current binaries for v0.45.1 handy.
Download the state migration json file from here. Ensure that you have the correct file with sha256sum dryrun_migration46.json, which should yield acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6.
Performing the upgrade
It might first be helpful to set some environment variables:
Stop your node and restart it with namadan-0.45.1 ledger run-until --block-height $BLOCK_HEIGHT --halt, where namadan-0.45.1 is the namadan binary for v0.45.1. Once block height $BLOCK_HEIGHT is reached, the node will halt automatically.
Then restart your node with the 0.46.0 binaries, doing the following: namadan-0.46.0 ledger run --height 182000 --path $MIGRATION_JSON --hash acadd7d124804448b3c7f7b3dc0c3bc536fe6dbd1594757d5a992d47ecacc5c6
Once 2/3 of the total voting power comes back online, the chain should resume with the altered state and new v0.46.0 runtime. Then, please test to ensure that all functionality works as expected.
In the future and for mainnet, the process for orchestrating a software upgrade and hard fork will be done more meticulously and discussed over a longer period of time, but since we are limited in time, we ask for your quick response to execute this upgrade.
The proposal for gathering social consensus is on-chain with an ID of 14!
> namadac query-proposal --proposal-id 14 --node $RPC_TUDUDES
Last committed epoch: 174
Proposal Id: 14
Type: Default with Wasm
Author: tnam1qqgll8x8rz9fvtdv8q6n985za0vjvgyu0udxh7fp
Content: {"abstract": "Upgrade the dry-run network to namada v0.46.0", "details": "Attempting to reach social consensus to upgrade the dry-run network to namada v0.46.0. This requires a small state migration and then restarting nodes and clients with the v0.46.0 binaries. Proposing that the hard fork block height be 182000.", "discussions-to": "https://forum.namada.net/t/coordinating-the-dry-run-upgrade-to-v0-46-0/1189", "title": "Upgrade to namada v0.46.0"}
Start Epoch: 175
End Epoch: 187
Activation Epoch: 188
Status: pending
Data: Hash: E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855
I did not attach any code to the proposal, so I am not sure why it is of type Default with Wasm. Perhaps a small client bug that I will look into fixing soon.
by my calculation, i estimate that Block Height 182 000 will happen around 18:18 utc today
i used 7.9 second block times for my calc
the block time originally used to calculate the target time was 7.5 seconds per block, and it was actually 7.895 seconds per block, hence the variance
the original target calc was done at 20:51 utc on Tuesday, so that shows how much variance we can have when using block heights, versus target times. target times will be fairly rough estimates only!! it’s always the block height we use
FTR:
The issue with Message: Merkle tree should be restored: Custom(CustomError(MerkleTree("Invalid MerkleTreeStoresRead"))) was caused by nodes restarting after the node reached the block height for the agreed halt. It could have been avoided by allowing the node to stop as instructed, and having the service template set to Restart=False, then performing the rest of the upgrade!
Following the scheduled chain halt at Block Height 182 000, Dry Run chain stopped making blocks from 15:17 utc Thu until ~16:00 utc Fri, so ~26 hours
It may seem bad, but it actually went incredibly well. We wanted to squeeze a state migration test in before the Dec 3 mainnet launch, so there wasn’t much time to do it. Thanks to everyone that stuck it out amidst the challenges
Some observations:
1. This upgrade was pretty short-notice
and the notice wasn’t widespread (ie. email, announcements channel); also it was USA’s Thanksgiving day.
unless it’s an emergency, let’s coordinate upgrades well in advance for Tuesdays/Wednesdays/Thursdays that are not major holidays
let’s use a governance proposal to decide yes / no, followed by a second governance proposal to decide date/time
let’s use the validator alerts mailing list (here), the validator alerts Discord channel, and seek readiness confirmation from validators that did not vote on the second governance proposal
2. One validator dominated the consensus set & signed incorrectly
The evening prior to upgrade, a validator increased their stake and dominated the set with 18.75% of the voting power. As a result, we needed to rally a lot more voting power from the tail to progress.
at mainnet we anticipate a substantial increase in stake, which should better distribute voting power
3. Operator struggles
The lack of correctly-signing voting power revealed struggling validator operators.
some validators learned the importance of careful verification (and carefully following the guide)
verify the sha256sum of the migration json file before launching
verify scripts in advance
verify (and maybe rename) the binaries that are being replaced (so that they aren’t accidentally used)
the Heliax engineers learned about some weaknesses that needed to be addressed in the migration code and a minor issue in the protocol itself
4. Recovery was very slow
the state was 211 gb, which meant that people had to make and distribute large snapshots of the pre-upgrade chain → v0.46.0 fixes this size issue
the client corrupted the database when it was restarted
the migration process and code have been improved to prevent this problem
Note: if you can spare the hard drive space, making a snapshot of the pre-upgrade chain before advancing to the upgrade could help save the day.
@brentstone does the above seem about right? Let me know if I missed anything. Great job organizing this!
Given the above, it’s amazing to me that we had ~84% voting power online, and the vast majority of these validators were voting correctly, but we didn’t have enough voting power signing correctly.
This was ++helpful! Thanks to our large set of helpful and capable validators
This event gave us the chance to work through a challenge together, and some people really shone We were able to identify how and where people are slipping through the cracks, and now we’re ready for any state migration that may be needed post-mainnet.