While x64 appears to work without fuss, x86 builds seem to rely on some MSVCRT functio

Fix out-of-the-box vs2019 support for x86 about wavesabre HOT 22 CLOSED

logicomacorp commented on May 31, 2024

Fix out-of-the-box vs2019 support for x86

from wavesabre.

Comments (22)

yupferris commented on May 31, 2024

Note that I didn't actually explore where/why these are imported now and weren't before, and perhaps there's a simple workaround if we can figure that out and avoid it.

from wavesabre.

yupferris commented on May 31, 2024

_alldiv and _allmul both come from this line.
_allshl comes from this line.

Again, I'm not entirely sure why these didn't cause issues in previous vs versions, and perhaps it's worth looking into what the compiler used to generate and see if we can get it to do the same for vs2019.

from wavesabre.

kusma commented on May 31, 2024

These are about signed 64 integer mismatch between the C/C++ ABI and x86. I believe we can fix them by casting to unsigned long long in the right places, at least in the first case.

from wavesabre.

kusma commented on May 31, 2024

The second case seems like it's a performance disaster to end up in the _allshl codepath in the first place.

I would say some optimization rules has changed in the compiler, meaning it's no longer able to avoid these calls. I wouldn't be too surprised, x86 is no longer an important optimization target for most types of applications, so it might be a risky venture to use a new compiler for it...

from wavesabre.

yupferris commented on May 31, 2024

I also wouldn’t be entirely against not officially supporting x86 for vs2019 onwards and pushing x64 as the default everywhere once squishy x64 is available, fwiw. I really want x64 to be default for 64k moving forward even if it’s slightly larger today.

That said, it’s still probably worth looking for workaround(s) and I also have a patch to turn these errors at runtime into link errors. I’ll explore casting to unsigned and see where that gets us.

from wavesabre.

yupferris commented on May 31, 2024

re _allmul/_alldiv, I think it's enough just to make DirectSoundRenderThread::GetPlayPositionMs return unsigned int and perform the mul/div unsigned as follows (Edit: reading your comments above, @kusma, I see that this is exactly what you suggested!):

diff --git a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
index f54d1d7..3d68c30 100644
--- a/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
+++ b/WaveSabrePlayerLib/include/WaveSabrePlayerLib/DirectSoundRenderThread.h
@@ -17,7 +17,7 @@ namespace WaveSabrePlayerLib
                DirectSoundRenderThread(RenderCallback callback, void *callbackData, int sampleRate, int bufferSizeMs = 1000);
                ~DirectSoundRenderThread();

-               int GetPlayPositionMs();
+               unsigned int GetPlayPositionMs();

        private:
                static DWORD WINAPI threadProc(LPVOID lpParameter);
diff --git a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
index fd50a11..dae28eb 100644
--- a/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
+++ b/WaveSabrePlayerLib/src/DirectSoundRenderThread.cpp
@@ -26,7 +26,7 @@ namespace WaveSabrePlayerLib
                WaitForSingleObject(thread, INFINITE);
        }

-       int DirectSoundRenderThread::GetPlayPositionMs()
+       unsigned int DirectSoundRenderThread::GetPlayPositionMs()
        {
                if (!buffer)
                        return 0;
@@ -49,7 +49,7 @@ namespace WaveSabrePlayerLib
                        totalBytesRead += bufferSizeBytes;
                totalBytesRead += currentBytesRendered;

-               return (int)(totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
+               return (unsigned int)((unsigned long long)totalBytesRead / SongRenderer::BlockAlign * 1000 / sampleRate);
        }

        DWORD WINAPI DirectSoundRenderThread::threadProc(LPVOID lpParameter)

This function should never return negative values, as it represents how many ms have been rendered in total. This is in contrast to its caller, RealtimePlayer::GetSongPos, which is meant to compensate for output buffer latency, and thus may be negative (before the initial buffer capacity has played out). For that, we just use double and operate in seconds. It might also be an idea, if it's simpler, to use the same type/precision down the whole stack - though I suspect this may be more prone to (cumulative) rounding errors over time.

Alternatively, instead of keeping total bytes rendered in 64 bits, we could keep track of two 32-bit unsigned quantities, ms rendered, and fractional ms rendered, with enough precision to match our sample rate. This is more complicated/error-prone, though (edit: it may not be; we have 53 bits in double, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need).

from wavesabre.

yupferris commented on May 31, 2024

Getting rid of _allshl is, surprisingly, a bit more tricky. It doesn't seem to matter whether or not phaseAsInt is signed or unsigned (which makes sense, since it's a left shift); so far I haven't been able to work around it without decidedly losing precision.

(side note: there's a bit of irony in that reverting some of the sinusoid operations that I dug into last night may have allowed us to work around this easier, but I don't think that's a good enough reason to back out of the change still :) .)

from wavesabre.

yupferris commented on May 31, 2024

re my comment earlier:

edit: it may not be; we have 53 bits in double, so we could likely just keep track of how many samples have been rendered with that type, and return that all the way until the function that returns seconds, and just convert there, and we likely have several bits more than we need

I looked into this, and it turns out it's not actually simpler overall, since PreRenderPlayer also does some calculations in ms, so it's simpler to have the conversion from bytes rendered -> ms rendered in one place.

from wavesabre.

yupferris commented on May 31, 2024

Hey wait a minute! Since we already normalized the double phase to be in the [1, 2) range, wouldn’t that mean its exponent is fixed? So we actually shouldn’t need the left shift at all (or at the very least that it’s constant and we can factor it another way)?

from wavesabre.

yupferris commented on May 31, 2024

Ah, that would be true but we actually only guarantee the lower bound - and ofc something like fmod to enforce the upper bound is abysmal performance-wise.

from wavesabre.

yupferris commented on May 31, 2024

OK, I was wrong about the GetPlayPositionMs patch above - once I hacked a workaround for _allshl it actually failed at runtime again, this time with __aulldiv, again in GetPlayPositionMs. This is feeling a bit futile.

from wavesabre.

yupferris commented on May 31, 2024

ha, ok, I actually found a configuration that works. Turns out _allshl isn't required with /O2 (optimize for speed), and if I use double for all of the time tracking in DirectSoundRenderThread, neither are the other long-related ops. It feels a bit nasty, but it's enough that I wouldn't mind putting up a PR for further feedback.

from wavesabre.

kusma commented on May 31, 2024

Nice catch!
Move the code to the end of the source-file and do
#pragma optimize( "s", on) right before? :P

from wavesabre.

kusma commented on May 31, 2024

More seriously, we should probably look at what code is generated in the case that don't need it; maybe the compiler realize we don't really need this shift to be 64-bits for some reason?

from wavesabre.

yupferris commented on May 31, 2024

This is with /O2:

		auto significand = (unsigned int)((phaseAsInt << exponent) >> (52 - 32));
00406A78  movq        xmm1,mmword ptr [phaseAsInt]  
00406A7D  shr         eax,14h  
00406A80  sub         eax,3FFh  
00406A85  movd        xmm0,eax  
00406A89  psllq       xmm1,xmm0  
00406A8D  psrlq       xmm1,14h  
00406A92  movd        ecx,xmm1

The first couple lines determine the exponent, then there's the shifts. It seems both shifts (left and right) are still there, it's just that they use ~~SSE2~~ MMX.

from wavesabre.

yupferris commented on May 31, 2024

UGH, after making the PR I decided to test x64 again, and ofc, it's broken, likely due to /O2: unresolved external symbol __vdecl_cos2. And if I change back to /01, __imp_floorf and __imp_sqrtf are missing instead, which is actually strange, as I thought I had this working a few months ago...

from wavesabre.

yupferris commented on May 31, 2024

Yeah so master works, yet my branch doesn't, even if I manually change WaveSabreCore to use /O1 again and rebuild. This is really fucked.

from wavesabre.

yupferris commented on May 31, 2024

I guess we could make the /O1//O2 option dependent on the target platform...

from wavesabre.

yupferris commented on May 31, 2024

OK - I've been mixing up /O1 and /Oi options in CMakeLists.txt, which explains why those __imp_* fn's were missing in the x64 build after my latest changes. With both /O2 and /Oi, it works for x86; we still need /O1//Oi for x64.

Or, perhaps we can shim in fpuCos for x64 as well?

from wavesabre.

yupferris commented on May 31, 2024

FWIW I did get this to work by enabling MASM support and defining fpuCos for x64 (as inline assembly isn't supported by MSVC for x64), and in theory that could be made to work with cmake nicely, and also be conditionally included for x64 only. However, the resulting squished executable appears to be about 1kb larger than its /01 counterpart. So simply choosing the optimization parameter based on the target architecture might actually be the "best" solution.

from wavesabre.

yupferris commented on May 31, 2024

Or, perhaps we can implement our own shift function in this case, just for x86. I think we can make a fair number of assumptions about the input parameters, and we can use inline asm, so we shouldn't incur any performance hits. In fact, just using MMX like the optimized version did is probably a great strategy.

This is, of course, not too far off of just implementing the original intrinsic that's missing, however it can be slightly smaller/simpler to just support our single case (just like the math functions we currently implement), and avoid potential copyright issues that we were blissfully violating before by shipping the old msvcrt.lib binary blob.

That said, were we actually violating any copyright there? Could it be a valid solution to bring msvcrt.lib back for x86 and generate it for x64?

from wavesabre.

yupferris commented on May 31, 2024

You know what, I don’t know why I didn’t think of it sooner - we can just use MMX intrinsics to implement the shifts in a way that matches what the optimized code was doing. This way we shouldn’t have to change any optimization opts or anything, and we’ll always get faster code regardless of x86/x64.

from wavesabre.

Fix out-of-the-box vs2019 support for x86 about wavesabre HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent