My question here only applies to songs where both the tempo and the time signature are known, but that should be most of the songs out there.
Imagining a song in 4/4 with 120 qpm that changes chord every 4 bars, you would have a change every 8 seconds (quarter is 60s/120=0.5s, bar is 0.5s * 4 = 2s, chord length is 2s * 4 = 8s). So the ideal output would be for example:
0.0 8.0 E:min
8.0 16.0 A:maj
16.0 24.0 E:min
[.....]
In reality time offsets in the predictions are a bit wonky, that is probably because in real sound there is not really an exact time when a chord starts. I have also tested this on a .wav render of a midi.
If the tempo is low and bars are long then durations can be sort of quantised "with a wrench hit" by approximating to the closest bar, but when the tempo is high enough (100+?) the timing error becomes too big, making it impossible to pin exactly when in the score the chord is changed.
I don't know much about how your NN works, but perhaps this is because the wave is analysed "continuously"? could it be made to analyse segments that are aligned with bars instead? In the previous song, for example, could the prediction function be made to guess what chord is there from 0.0 to 2.0, then from 2.0 to 4.0, etc?
Thanks!