Jailing/unjailing issue - post-mortem
After the patch and chain restart on Feb 19, there were reports that nodes were timing out and restarting repeatedly: Discord
We discovered that once the database has grown substantially in size, it takes longer for the node to process when starting up. The client was hard-coded to time-out after 30 seconds, and a patch needed to be issued to enable a longer time-out period: fix: increase timeout from 30 to 180 by rllola · Pull Request #2668 · anoma/namada · GitHub
Operators had to either change the change the code and recompile, or await the patch release. Some operators had to sync from the beginning. All affected operators were jailed because all [May 27 edit: reasonably apparent] options available to them took too much time to prevent being jailed. We estimate that approximately 100 validators were jailed from this client issue. These circumstances caused affected validators to be offline for 2 - 3 epochs of the 90 epochs, or for 2.22 to 3.33% of the chain’s total uptime.
However, the jailed validators then could not unjail: Rejected by VP after trying to unjail-validator caused by ``change-consensus-key`` command · Issue #2642 · anoma/namada · GitHub
Heliax engineers told us: “The unjailing issue is not related to the broadcaster timeout. There’s a bug in the user VP that incorrectly rejects this if the validator has enough stake to be in consensus set after unjailing. This can occur even if the validator didn’t change their consensus key”
The fix for unjailing required a hard fork. On March 26, following the hard fork, the operators with jailed validators were able to initiate unjailing (which took two epochs to complete): Discord
In total, we have 86 reported (and unknown unreported) validators that were jailed from Feb 19 to Mar 26: Rejected by VPs - Google Sheets