Concerns about uptime mission removals

Two Pilot uptime missions were removed for three reasons, and in this order:

  1. Primary reason: Client-restart bug → jailed ~100 validators and kept them jailed for a month. I’ve asked for more details about the bug to share here, because I don’t know there were some validators that were not affected.
    Edit: please see this post-mortem for more info

  2. Secondary reason: Asymmetric communications → coordination for restarting the chain was exclusively done in the private SE100 channels at least twice. This affect Pilots who were not in the SE100 channels, because they were not notified about what client changes needed to be made until after the chain had resumed making blocks.

  3. Tertiary reason: Uptime calculation → blocks_validator_signed / total_blocks essentially meant that you could only compete for the uptime missions if you were in the active set for the entire competition. This affected the Pilots that were not in the genesis validator set, because the active set had 256 slots and there were thousands of Pilots with the same number of tokens that had to find ways to get enough NAAN to compete with one another to get into and stay in the active set. Genesis validators did not have to compete because for a slot because 115 of them had substantially more NAAN.

We think that the set of participants could not reasonably compete for these two missions, so we (Knowable) proposed that the Anoma Foundation remove the missions.

We recognize that this is bad to intervene after the competition has ended, but we decided that it’s worse not to intervene. We also know that operators went to extreme lengths to maintain their uptime, and we think it’s really important to recognize this, regardless of the Shielded Expedition flaws. We will propose rewarding participants that hit 95% and 99%.

Please use this topic to express concerns about uptime mission removals. Please try to be clear and concise, and please do not equivocate with other competition decisions. There are many people with concerns, so it will be difficult to address your concerns if you are not clear and concise and on topic.

1 Like

Hi Gavin, It seems to me a big reason is missing above, namely the unjail-bug. (which is not irrelevant contrary to assertions made by some). Also a little doubtful if we should comment here or on the “proposal” made by some SE-100?

Not everyone affected by the restart bug experienced significant issues. Operators with a backup node and proper monitoring would be promptly alerted to the problem and could switch to the running node, minimizing any disruptions. The fact that 100 people did not take these precautions does not mean that the issue was unfixable.

From the outset, the SE-100 cohort was intentionally kept small to facilitate initial operations and reduce the need for extensive coordination. Despite the PvP nature of this testnet, some SE-100 participants posted updates in the general channel, demonstrating engagement. Additionally, some postgen validators successfully met the uptime metrics, proving that maintaining these standards was achievable. On a personal note, I had requested the team to stop using @here in Discord. While it’s important to retrospectively acknowledge the team’s misses, it is unjust to punish participants who had no control over these oversights. Recognizing past mistakes is valuable, but punishing the participants for issues beyond their control is not a reasonable approach.

This design flaw has been present from the beginning, and retroactively punishing participants for it is unfair. Instead, I propose taking a broader approach by calculating misses from the moment a validator posted the create-validator transaction to the marked end block of the testnet. However, this raises questions about when to determine eligibility, especially if a validator joined in later epochs like 10, 30, or 40. There was significant incentivization for this task, leading to the bounty being completed due to this motivation. Namada received prompt responses from validators during the testnet. Validators cannot submit more tasks once the testnet has ended, nor should the rules be changed after the fact.

In most jurisdictions, it is generally considered unethical and potentially illegal to change the rules of a contest after it has ended in a way that prevents winners from claiming their prizes. Contest rules form a binding agreement between the organizer and the participants.

3 Likes

Here is our 2 cents

→ The way the shielded expedition was held was mostly waiting for chain to start again, may be build was not at a quality to measure things. This should have been better.

→ Because it was very buggy, the amount of attention it deserved was too long, given we are all her e as validators (pilots) one would think this should have been the primary measure. Encouraging other tasks is all good, but secondary the way we see it.

I can go on but what is the solution, there is none which can satisfy all.
So we would go with what the majority agrees but our vote is not to get rid of it

Cheers.

2 Likes

Jailing/unjailing issue - post-mortem

After the patch and chain restart on Feb 19, there were reports that nodes were timing out and restarting repeatedly: Discord

We discovered that once the database has grown substantially in size, it takes longer for the node to process when starting up. The client was hard-coded to time-out after 30 seconds, and a patch needed to be issued to enable a longer time-out period: fix: increase timeout from 30 to 180 by rllola · Pull Request #2668 · anoma/namada · GitHub

Operators had to either change the change the code and recompile, or await the patch release. Some operators had to sync from the beginning. All affected operators were jailed because all [May 27 edit: reasonably apparent] options available to them took too much time to prevent being jailed. We estimate that approximately 100 validators were jailed from this client issue. These circumstances caused affected validators to be offline for 2 - 3 epochs of the 90 epochs, or for 2.22 to 3.33% of the chain’s total uptime.

However, the jailed validators then could not unjail: Rejected by VP after trying to unjail-validator caused by ``change-consensus-key`` command · Issue #2642 · anoma/namada · GitHub

Heliax engineers told us: “The unjailing issue is not related to the broadcaster timeout. There’s a bug in the user VP that incorrectly rejects this if the validator has enough stake to be in consensus set after unjailing. This can occur even if the validator didn’t change their consensus key”

The fix for unjailing required a hard fork. On March 26, following the hard fork, the operators with jailed validators were able to initiate unjailing (which took two epochs to complete): Discord

In total, we have 86 reported (and unknown unreported) validators that were jailed from Feb 19 to Mar 26: Rejected by VPs - Google Sheets

So this could been easily fixed by using one of the many public pruned snapshots available posted all the time in discord or their own snapshots and run cosmprund (if they kept them). Restoring from snap is one of the most common things validators do besides governance.

If we are going to ignore an easily fixable issue, grab tallys of the validators that were jailed due to bug confirmed, remove those epochs where they could not unjail from their uptime calculation.

i.e.

2/19

1 Like

Sorry but i don’t understand this reason at all. It was obvious before start of SE that SE-100 pilots have advantage in this, and it was discussed in validator calls and chat. Thats why everyone was eager to get into SE-100 list. Team choose SE-100 pilots for a reason (not a random selection) it was based on previous contributions to Namada, overall experience and validator CV. And obviously Genesis validators always have an advantage in any network than all other participants. Imagine you give out equal amount VP to thousands pilots at genesis, than i think the chain would not even moved. Thats why you choose trusted validators at genesis!
We are not the ones who created rules for SE. Rules was announced before SE started, everyone agreed to participate, and now we end up in situation that rules are changed(with removal of category) , it means this can be applied also to other SE categories especially S class, and even Governance (faucet was down from the start, and not everyone could vote, it means he cannot get 99% governance), i see that you ask not to equivocate with other competition decisions, so i better stop here, and not speak about S class rules. As always someone win, someone loose from decisions that are made after competition by changing rules. I believe such a big competition was not prepared in 1 week, and all the rules was thought out more than once, thats why it is strange to see so many changes in rules. Anyway was happy to contribute and participate, met so much great people, and i wish everyone all the best in future journey.

1 Like

Pruned passing the cometbft data directory through cosmprund, which makes it a little lighter. With the cosmprund I would say that the data can be lightened, which is what LavenderFive did.
Everything together doesn’t weigh much. I think the SE2 dir weighed less than 20GB when the SE ended.

Nodeify shared discord evidence of someone posting how they grabbed snap on 2/19 to get back up. Not much else to say.

@cwgoes @adrian @awa Link here to all the evidence and screenshots gathered from 19/02 from the Shielded expedition channels shielded-expedition, se-community support and se-100, covering and organised in 6 sections summarized here below:

  1. Resyncs fixing issue: several validators discussing how they managed to fix issues
  1. Increasing timeout fixing issue: Citadel advises a solution to increase timeouts to fix issue, several validators thanked him since it fixed the issues, and this was BEFORE the patch release was announced. Timeout issue also related to ‘not fast enough machine’

  2. Issues fixed with resync from snapshot: validators confirming that the issue was fixed by resyncing from snapshot

  3. Validators with low specs VPS cloud server having issues regarding the timeouts: Citadel, Gavin and Adrian saying if machine isn’t fast enough then operators have timeout issue and advice to migrate to faster machine

  4. Validators having issues because of lack of knowledge and inexperienced, low specs VPS, or being late for restarts, and suggesting better to wait for the more skilled se-100 to get chain up first:
    -Pretoro claims he uses VPS and asked what is a bare metal server, when Gavin says timeout issue related to not fast machine, pretoro says his machine is very powerful (not true, since running in cloud VPS as self-declared himself)
    -Pretoro worries and asked if upgrade will be easy ‘install and run’ or complicated and self-declares as non-skilled
    -Pretoro suggests to better to wait for the SE-100 validators to get network up and running
    -Pretoro was not paying attention and late for the upgrade, and Labisque tagged him and asked if he was sleeping. Pretoro comes back and rushed to get up to date with what’s going on, being late for the upgrade

  5. Cosmostation provides tool for post-gen to check liveness missing: seeing the inexperience of many post-genesis validators, Cosmostation provides a tool to easily check liveness missing similar to Tenderduty

For all the evidence and screenshots of the above 6 sections check the link included above, didn’t add all screenshots here to keep it brief and more clear.

100% not true as proven by the evidence above, with many options available that many validators used and none was jailed.

Not true as proven with the evidence above because:
-Citadel suggesting a fix for the timeouts issue before the patch release and many validators fixed issues like this before the patch release. Moreover timeouts issue related to hardware and ‘not fast enough machine’
-Many validators synced very fast from pruned snapshot

There are 91 epochs in the shielded expedition testnet (0-90), and for post-genesis 89 epochs since not counting the first 2 epochs when they cannot be part of the active set. The jail period for downtime is minimum 2 epochs which means that it was already not possible to achieve the 99% Uptime mission, even if normal unjail was possible and there was no unjail bug. This is because over 88.11 epochs was the minimum to get the 99% uptime, but after a jail event, the maximum epochs possible is 87 epochs, lower than 88.11 epochs. So the unjail bug was not a reason for not achieving the 99% Uptime mission.
Therefore any potential discussion about Uptime missions should be only for the 95% uptime and the 99% Uptime mission should always remain. Can calculate 95% Uptime after removing Feb 19-March 26 period if that’s the only issue.

In addition to the issues you mentioned, I would like to add another one to the list.

There was also a case on February 7th when the chain was halted. Even though we were explicitly told there would be a 24-hour heads-up before releasing anything (by Awa, message link), 5.5 hours later, v0.31.2 was suddenly released without prior or proper announcement, leaving us completely surprised.

This unexpected release devastated us. We updated our server at 4 A.M., but it was already too late, and our validator was jailed in the next epoch.

Despite this was a significant communication issue, it received little attention. This may be because it affected a very small number of validators, mostly those in different timezones. The patch was released during working hours in Europe, so they had a higher chance of checking Discord and finding out about the new version compared to people in the Asia timezone who were resting.

We raised our concerns about this issue many times in Discord, but we couldn’t get traction because only a few of us were affected. We are happy with the removal of the uptime mission, but now I see some are trying to bring it back again. I wanted to ask you to include this forgotten issue in your final decision.

I believe such a big competition was not prepared in 1 week, and all the rules was thought out more than once, thats why it is strange to see so many changes in rules.

agreed

The AMA may be helpful to answer some of the concerns, it started here: Discord

It’s impressive how adaptable some of the validators were, and their operators will be recognized and rewarded. I should clarify that our assertion is that the two uptime missions couldn’t be reasonably competed for by the entire set of Pilots, and thus shouldn’t be considered valid missions in the competition.

The fact that it was possible is a strong argument against moving goal posts. This applies to various categories, but I believe it’s not a valid reason to eliminate any of them. How many teams had full-stack developers to create new Namada applications? How many teams had the expertise to identify genuine security vulnerabilities? Certainly not the entire set of participants. However, instead of calling for the removal of those categories due to their inability to complete them, people accepted their limitations, focused on their strengths, and excelled in what they could do well.

Could the communication have been improved? Absolutely. However, I don’t believe this justifies sitting idle, demanding developers to intervene while blocks tick away, and abandoning self-reliance. If it’s none of those things, not knowing how to proceed without instruction is okay; we are all on different paths in our learning journeys. But pulling others down because they might have been lucky, knew what to do, or had prepared in advance for such situations is absolutely crazy to me.

I do fear for the future validator set, as they have been pitted against each other from the outset in the name of personal enrichment. This is further compounded by not sticking to what was agreed upon from the beginning, further dividing the community. While sticking to the terms might make more people angry, at least the integrity remains intact, which is something you cannot buy back.

There were issues with gov voting also voting duration is 2 epochs but we were able to vote on 1 epoch only next epoch we got vp issues. We got this issue while voting with non-validator account. I think we got this issue during entire SE, you can check the issue related messages on discord se-100 group during the first week. Why if there were also bugs related with governance and not all participants could compete for the governance tasks, governance is kept unchanged? There are many reports also of people unable to vote when the faucet was not working for an extended period of time. Is it ethical to ignore all these issues for governance while addressing them for uptime?

The faucet still didn’t work, so Adrian started sending tokens himself. (After the validator had already passed the deadline to vote on dozens of proposals). I’m not sure exactly in how many proposals many validators missed the opportunity to vote on because Faucet wasn’t working.

@adrian 's response regarding missing the vote caused by faucet not working. I think there were about 30 proposals that were missed by validators who did not receive tokens due to faucet issues.

  1. The faucet did not work, so the validator could not vote on the proposal with a voting period of epoch3 to epoch4.
  2. adrian then manually send to tokens.
  3. In epoch 4, some validators received token and tried bonding as a non-validator account.
  4. Bonding takes place in the next epoch, so can vote starting from epoch5. In the end, neither validators nor non-validator account could vote on proposals whose voting deadline was epoch 4.
    So these validators could vote on 90%+ of proposals, but 99%+ was not achievable.
    Honestly these are much better reasons to remove governance, since for uptime there were ways to avoid getting jailed while for governance there was no way to get tokens to vote for many proposals.

I’m just crosslinking my (co-signed) remarks under the proposal, for consideration here as well: Proposal to Reinstate the Uptime Mission in the Shielded Expedition Incentivised Testnet - #6 by preto