Giter Club home page Giter Club logo

Comments (19)

steviez avatar steviez commented on September 21, 2024 1

Here is the snippet where the EAH gets mixed into the bank hash as I was looking up for my own understanding as well:

solana/runtime/src/bank.rs

Lines 6963 to 7004 in 9db4e84

let mut hash = hashv(&[
self.parent_hash.as_ref(),
accounts_delta_hash.0.as_ref(),
&signature_count_buf,
self.last_blockhash().as_ref(),
]);
let epoch_accounts_hash = self.should_include_epoch_accounts_hash().then(|| {
let epoch_accounts_hash = self.wait_get_epoch_accounts_hash();
hash = hashv(&[hash.as_ref(), epoch_accounts_hash.as_ref().as_ref()]);
epoch_accounts_hash
});
let buf = self
.hard_forks
.read()
.unwrap()
.get_hash_data(slot, self.parent_slot());
if let Some(buf) = buf {
let hard_forked_hash = extend_and_hash(&hash, &buf);
warn!("hard fork at slot {slot} by hashing {buf:?}: {hash} => {hard_forked_hash}");
hash = hard_forked_hash;
}
let bank_hash_stats = self
.rc
.accounts
.accounts_db
.get_bank_hash_stats(slot)
.expect("No bank hash stats were found for this bank, that should not be possible");
info!(
"bank frozen: {slot} hash: {hash} accounts_delta: {} signature_count: {} last_blockhash: {} capitalization: {}{}, stats: {bank_hash_stats:?}",
accounts_delta_hash.0,
self.signature_count(),
self.last_blockhash(),
self.capitalization(),
if let Some(epoch_accounts_hash) = epoch_accounts_hash {
format!(", epoch_accounts_hash: {:?}", epoch_accounts_hash.as_ref())
} else {
"".to_string()
}
);

from solana.

steviez avatar steviez commented on September 21, 2024 1

bad

[2024-01-21T03:16:02.013644297Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615827727708, epoch_accounts_hash: xwb6iXsG3vdHgUTAFkgUiACYPwYkW8F7465CNL9WVaC, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

good

[2024-01-21T03:16:02.022355358Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615826502748, epoch_accounts_hash: 824tUYuwAKFv2kKz5m2Xf8YHNYhYUhsqwohjmrvTp3Be, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

Hmm yeah, it looks like the only thing that differs here are the epoch_accounts_hash and capitalization. The fact that you node did not diverge previously would suggest that the account that caused the EAH to diverge did NOT appear as part of any bank hashes recently. So, I don't think replaying the slot will give us any useful information. Rather, I think we would have to examine each account to find the offending one.

from solana.

steviez avatar steviez commented on September 21, 2024 1

And just making sure, you were running 6a9f729 with no other modifications ?

I think this is due to the reward PDA account was created in the previous epoch when run my node with partitioned rewards enabled (#34809). That PR is incomplete.

I have pushed fixes for this just now.

So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

243107999 matches
good: bank frozen: 243107999 hash: 7X1s4w65yBnwz18fKg5tamNrSHnK1AZyBhQEZRjctDhc
bad: bank frozen: 243107999 hash: 7X1s4w65yBnwz18fKg5tamNrSHnK1AZyBhQEZRjctDhc

243108000 mismatches
good bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ
bad bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

from solana.

steviez avatar steviez commented on September 21, 2024

And you have confirmed it with and without the commit you linked above, 6a9f729 ?

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

bad

[2024-01-21T03:16:02.013644297Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615827727708, epoch_accounts_hash: xwb6iXsG3vdHgUTAFkgUiACYPwYkW8F7465CNL9WVaC, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

good

[2024-01-21T03:16:02.022355358Z INFO  solana_runtime::bank] 
bank frozen: 243108000 hash: HMk3tMMympeHyBRpoDrqMLxfLjSUZvNiLYx6i2JCuNGZ 
accounts_delta: GCj8CFVaqeHwkVgaLb6f9PaPYeZMofBsKL1op3LXUJJ8 signature_count: 500 last_blockhash: Ant9w6LfnGbG4Jpm5VMxmyRpzN3myx3dDpAQSE3kTwcs 
capitalization: 567558615826502748, epoch_accounts_hash: 824tUYuwAKFv2kKz5m2Xf8YHNYhYUhsqwohjmrvTp3Be, 
stats: BankHashStats { num_updated_accounts: 1465, num_removed_accounts: 14, num_lamports_stored: 39484939731971, total_data_len: 10492554, num_executable_accounts: 1 }

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

And you have confirmed it with and without the commit you linked above, 6a9f729 ?

Yes. #34623 is not the problem.
I have verified with ledger tool that with and without the #34623 we get the same bad bank hash 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

It seems on 243108000 the epoch_accounts_hash mismatched, which could cause the bank hash mismatch?

from solana.

steviez avatar steviez commented on September 21, 2024

Yes. #34623 is not the problem.
I have verified with ledger tool that with and without the #34623 we get the same bad bank hash 7h4iAoXfX6KwitVhazib3fCnr3J4koFu7ArJZTY5heFZ

Gotcha, maybe we need to bisect then. We seemingly don't have a last known good commit, but I also would have expected canaries to hit this if the issue was too long ago.

And just making sure, you were running 6a9f729 with no other modifications ?

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

@brooksprumo My understanding is that If epoch_account_hash_mismatched, then at the bank when we look for epoch_accounts_hash (i.e. 3/4 of the epoch??), we will get a mismatch at the bank hash? Is that right?

from solana.

brooksprumo avatar brooksprumo commented on September 21, 2024

The epoch accounts hash is a (full) accounts hash calculation taken at the rooted slot 1/4 into the epoch. That value is saved into accounts-db and then hashed into the bank (when freezing) that's 3/4 into the epoch.

If the EAH values are different, that would imply an account hash calculation mismatch.

And yes, if the EAH values are different, that will cause the bank hashes to be different.

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

Yeah. 243108000 is at the 75% of epoch.

>>> x = 243108000
>>> begin = 242784000
>>> end = 243215999
>>> (x-begin)/(end-begin)
0.7500017361151299

from solana.

brooksprumo avatar brooksprumo commented on September 21, 2024

Yeah. 243108000 is at the 75% of epoch.

Also the EAH is only part of the "bank frozen" log line for the one1 bank that's including the EAH.

Footnotes

  1. There can be multiple forks such that more than one bank includes the EAH, since this is only "frozen". Eventually only one of these banks will be rooted.

from solana.

brooksprumo avatar brooksprumo commented on September 21, 2024

Rather, I think we would have to examine each account to find the offending one.

Yeah, that's what I would think too. The accounts hash calculation happens way after the bank hash is calculated (the eah start slot), so replaying the slot won't include anything about the EAH. We occasionally see accounts hash mismatches on snapshots, and it hasn't been reproducible before. Often theorized to be a HW issue, esp disk related.

from solana.

steviez avatar steviez commented on September 21, 2024

Note that I see this at startup when unpacking the snapshot:

[2024-01-22T20:02:23.357305050Z WARN  solana_runtime::bank]
verify failed: slot: 243107789, 4n2M6Y7tgkjUfahST7DFPLL1Su4V3cWGxo3gduJSv99N (calculated) != 95ahPUghb36UjDVtAmpLHGhYWFSwZ22spWW6k78AgGHQ (expected)

[2024-01-22T20:02:23.357326060Z INFO  solana_metrics::metrics]
datapoint: verify_snapshot_bank clean_us=4i shrink_us=1i verify_accounts_us=577i verify_bank_us=9414i
thread 'main' panicked at /solana-labs/solana-secondary/runtime/src/snapshot_bank_utils.rs:374:9:
Snapshot bank for slot 243107789 failed to verify

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

yeah. I see that too.

[2024-01-22T21:29:57.829048056Z INFO  solana_metrics::metrics] datapoint: accounts_db_active hash=0i
[2024-01-22T21:29:57.834151537Z WARN  solana_accounts_db::accounts_db] Mismatched total lamports: 567558617892842802 calculated: 567558617891617842
[2024-01-22T21:29:57.834178147Z WARN  solana_accounts_db::accounts] verify_accounts_hash failed: MismatchedTotalLamports(567558617891617842, 567558617892842802), slot: 243107789
[2024-01-22T21:29:57.834662541Z INFO  solana_runtime::bank] Initial background accounts hash verification has stopped

I am working on a fix for this.
d76a5a3

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

I think this is due to the reward PDA account was created in the previous epoch when run my node with partitioned rewards enabled (#34809). That PR is incomplete. It only patches the bank-hash to ignore the PDA accounts. but it didn't take care of epoch_hash and bank lamport adjustment. Therefore, we fail at the slot when we include epoch hash into the bank hash.

I have pushed fixes for this just now.

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

So to confirm, your believe that an account from a PR that has not landed yet altered your account state and caused your node to diverge ? And the fixes were pushed to your PR?

Yes, that's correct. All these are specific to my node, which was running with --partitioned-epoch-rewards-force-enable-single-slot on and off, which then messed up with the accounts stored in my local node.

from solana.

HaoranYi avatar HaoranYi commented on September 21, 2024

Close the issue since this is only specific related to my node.

from solana.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.