Comments (24)
@aganea thank you so much for running more tests.
You're very welcome! You folks have been very helpful so far :)
Just wanted to confirm I am doing the same as you
In essence, you have to do a two-stage LLVM build.
git checkout https://github.com/llvm/llvm-project
-or-git pull
, thengit apply https://reviews.llvm.org/D71786
- The first stage builds LLVM with the bootstrap compiler (any compiler). You could use the allocator at this point if you wish.
- The second stage builds LLVM with with the 1st stage. At this point cmake uses ThinLTO & the allocator & O3 & -march=skylake or whatever your CPU is to ensure max. perfomance.
- Once everything is built, delete
buildninjaStage2\bin\clang.exe
, then re-runninja clang -v
. While it's linking (it should last a bit), go into the folderbuildninjaStage2\CMakeFiles
and copy clang.rsp (which is temp file created during link) to clang2.rsp. You can cancel the link at the point. - Put the following line in a new file
buildninjaStage2\link.rsp
:/nologo @CMakeFiles\clang2.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 -fuse-ld=lld /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /opt:lldltojobs=all
. All these gymnastics are needed because there's no option to disable the ThinLTO cache from cmake, nor an option to use all hardware threads (by default, only one thread per core is used). - You only need to do the above steps once. You can now run the test with:
> cd buildninjaStage2
> bin\lld-link @link.rsp
So it's using the stage2 LLD to relink the stage2 clang.
I use Bruce Dawson's UIforETW to take profile traces: https://github.com/google/UIforETW - ensure to check 'trace to file' first on the right side. Click 'Start Tracing' at the top before running the above cmd-line, then 'Save Trace Buffers' once it ends. After it is done compressing the trace, double-clicking on it would open WPA. If the traces are too big and you get out-of-memory crashes, set the following the wpa.exe.config file next to wpa.exe:
<configuration>
...
<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>
</configuration>
I've attached my build script:
make_llvm_snmalloc.zip
You need to run it from a VS 2017 or 2019 x64 Native Tools Command Prompt:
D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage1
D:\llvm-project> ninja check-all -C buildninjaStage1
D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage2
D:\llvm-project> ninja check-all -C buildninjaStage2
GnuWin32, Python 3.8, ninja are also needed.
Please let me know if there're difficulties along the way.
from snmalloc.
So I ran our benchmark with and without 1mib and I couldn't see any significant difference.
from snmalloc.
Figures with latest master:
Allocator | Wall clock | Page ranges commited/decommited | Total touched pages | Peak Mem |
---|---|---|---|---|
default | 2 min 21 sec | 73,611 | 43,8 GB | 42,6 GB |
+IS_ADDRESS_SPACE_CONSTRAINED | 2 min 21 sec | 48,836 | 92,7 GB | 21,6 GB |
It seems the latest snmalloc checkout is a tad slower but the commit is now much better with David's suggestion.
from snmalloc.
We're running it with default-features = false
so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)
from snmalloc.
@Licenser thanks for doing this.
from snmalloc.
@aganea I have replicated the experiment so far. I have checked out a Standard F72s_v2 (72 vcpus, 144 GiB memory)
on Azure with Windows 10 instance, and am getting about 32GiB PWS with the 16MiB chunk size, and 16GiB PWS with the 1MiB chunk size. The times look like they might be slightly faster with 16MiB, but not sure, so running some statistically meaningful tests. Also re-tested the tree_index
branch.
One minor tip, you can do ninja -v -d keeprsp
, then you don't have to worry about the rsp
file being deleted by ninja
in step 3 of your instructions.
from snmalloc.
@davidchisnall mentionned: "By default, on Windows, snmalloc only decommits memory when the kernel notifies it that memory is constrained. If you've got loads of spare memory, there's no problem letting the commit size grow a lot, it's only a negative if the memory could be usefully used for something else."
The test machine has 128 GB of RAM and memory is far from being constrained. I will nevertheless re-test with IS_ADDRESS_SPACE_CONSTRAINED and the latest master branch. My test was using the tree_index branch.
from snmalloc.
That looks a lot more plausible. We should probably rename IS_ADDRESS_SPACE_CONSTRAINED
: it's a bit misleading. Since @mjp41's recent work, we rarely see a performance advantage from using 16MiB super slabs, I wonder if we should consider adjusting the default to 1MiB (or 2MiB, which would play nicely with superpages on x86).
from snmalloc.
@aganea thank you so much for running more tests. I wonder if the small regression in performance is not using the tree_index
branch. That should have improved Windows performance a bit. I have rebased the tree_index
branch onto master
, so we can test those changes again, so see if they account for the regression.
I have been setting up an LLVM Windows build, so I can test the link time. Just wanted to confirm I am doing the same as you
cmake ..\llvm-project\llvm \
-DLLVM_ENABLE_LTO=On \
-DCMAKE_LINKER=c:/src/malloc-llvm/build/lld-link.exe \
-DLLVM_ENABLE_PROJECTS=clang \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
-DCMAKE_CXX_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
-G Ninja
When building this, I assume you are measuring the final step
Linking CXX executable bin\clang.exe
I am building with the latest master, but I haven't tried to apply your patch yet. Just getting the very slow version with lld-link
so far.
from snmalloc.
@davidchisnall I think moving to the 1MiB size as default would make a lot of sense. I think 1MiB could work well with a little fiddling around huge pages, so we can put two threads into one huge page, for low-memory multi-threaded scenarios.
Agreed IS_ADDRESS_SPACE_CONSTRAINED
is a terrible name, I named it after why I needed it originally, rather than what it does.
@Licenser, @darach, @SchrodingerZhu any thoughts on changing the default to 1MiB?
from snmalloc.
This comment contains some benchmarking using the microbenchmarks for mimalloc for the different chunk sizes.
@plietar any thoughts on changing the default to 1MiB?
from snmalloc.
on a small linux openvz instance (2Gib in total), after upgrading to snmalloc-rs==0.2.16
, I run into Out of memory
on initialisation with and without 1mib
flag.
The following rust is not very useful, but I may have time to check the problem further:
Out of memory
/bin/utopia(+0x62ff7b)[0x56062d9fdf7b]
/bin/utopia(+0x6304b3)[0x56062d9fe4b3]
/bin/utopia(+0x630511)[0x56062d9fe511]
/bin/utopia(+0x630b72)[0x56062d9feb72]
/bin/utopia(+0x6340fc)[0x56062da020fc]
/bin/utopia(+0x3c75e3)[0x56062d7955e3]
/bin/utopia(+0x17c2a7)[0x56062d54a2a7]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f27569406a3]
/bin/utopia(+0xe8ede)[0x56062d4b6ede]
I confirmed that this was resulted by snmalloc
since using system default malloc will solve the memory issue. This problem occurs right after upgrading to 0.2.16
(94a2ba4).
from snmalloc.
this is how I reproduce a similar problem on my PC:
firejail --noprofile --rlimit-as=2147483648 bash
: no problemfirejail --noprofile --rlimit-as=2147483648 --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash
: deadfirejail --noprofile --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash
: no problem
from snmalloc.
[schrodinger@Monad utopia]$ ulimit -Sv 500000
[schrodinger@Monad utopia]$ strace env LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash
execve("/usr/bin/env", ["env", "LD_PRELOAD=/tmp/snmalloc/test/li"..., "bash"], 0x7ffe909ace00 /* 69 vars */) = 0
brk(NULL) = 0x55a2b674d000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe9daa37e0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK) = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca548a000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5488000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ca52c1000
mprotect(0x7f0ca52e6000, 1671168, PROT_NONE) = 0
mmap(0x7f0ca52e6000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0ca52e6000
mmap(0x7f0ca5433000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f0ca5433000
mmap(0x7f0ca547e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f0ca547e000
mmap(0x7f0ca5484000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5484000
close(3) = 0
arch_prctl(ARCH_SET_FS, 0x7f0ca5489580) = 0
mprotect(0x7f0ca547e000, 12288, PROT_READ) = 0
mprotect(0x55a2b5e45000, 4096, PROT_READ) = 0
mprotect(0x7f0ca5525000, 4096, PROT_READ) = 0
munmap(0x7f0ca548a000, 458620) = 0
brk(NULL) = 0x55a2b674d000
brk(0x55a2b676e000) = 0x55a2b676e000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=6187360, ...}) = 0
mmap(NULL, 6187360, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca4cda000
close(3) = 0
execve("/home/schrodinger/.opam/default/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/var/lib/snapd/snap/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.idris2/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/mpich/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/intel/system_studio_2020/compilers_and_libraries/linux/bin/intel64/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/testa/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.cargo/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = 0
brk(NULL) = 0x557ec3ddb000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe95b408f0) = -1 EINVAL (Invalid argument)
openat(AT_FDCWD, "/tmp/snmalloc/test/libsnmallocshim.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\21\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1880712, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1659e35000
mmap(NULL, 16998432, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658dfe000
mmap(0x7f1658dff000, 57344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658dff000
mmap(0x7f1658e0d000, 147456, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f1658e0d000
mmap(0x7f1658e31000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x32000) = 0x7f1658e31000
mmap(0x7f1658e33000, 16781344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658e33000
close(3) = 0
access("/etc/ld.so.preload", R_OK) = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1658d8e000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libreadline.so.8", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 `\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=326416, ...}) = 0
mmap(NULL, 334344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d3c000
mmap(0x7f1658d52000, 163840, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f1658d52000
mmap(0x7f1658d7a000, 40960, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3e000) = 0x7f1658d7a000
mmap(0x7f1658d84000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x47000) = 0x7f1658d84000
mmap(0x7f1658d8d000, 2568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d8d000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\22\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=18608, ...}) = 0
mmap(NULL, 20624, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d36000
mmap(0x7f1658d37000, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658d37000
mmap(0x7f1658d39000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d39000
mmap(0x7f1658d3a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d3a000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b6f000
mprotect(0x7f1658b94000, 1671168, PROT_NONE) = 0
mmap(0x7f1658b94000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f1658b94000
mmap(0x7f1658ce1000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f1658ce1000
mmap(0x7f1658d2c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f1658d2c000
mmap(0x7f1658d32000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d32000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\201\0\0\0\0\0\0"..., 832) = 832
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=161024, ...}) = 0
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
mmap(NULL, 135600, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b4d000
mmap(0x7f1658b54000, 65536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f1658b54000
mmap(0x7f1658b64000, 20480, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f1658b64000
mmap(0x7f1658b69000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b000) = 0x7f1658b69000
mmap(0x7f1658b6b000, 12720, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b6b000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libatomic.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 \0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=167952, ...}) = 0
mmap(NULL, 36936, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b43000
mmap(0x7f1658b45000, 12288, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f1658b45000
mmap(0x7f1658b48000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7f1658b48000
mmap(0x7f1658b4a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f1658b4a000
mmap(0x7f1658b4c000, 72, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b4c000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@`\t\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=20945112, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658b41000
mmap(NULL, 1951744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658964000
mprotect(0x7f16589fa000, 1269760, PROT_NONE) = 0
mmap(0x7f16589fa000, 966656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x96000) = 0x7f16589fa000
mmap(0x7f1658ae6000, 299008, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x182000) = 0x7f1658ae6000
mmap(0x7f1658b30000, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1cb000) = 0x7f1658b30000
mmap(0x7f1658b3e000, 10240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b3e000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\363\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1328000, ...}) = 0
mmap(NULL, 1327128, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f165881f000
mmap(0x7f165882e000, 634880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f165882e000
mmap(0x7f16588c9000, 626688, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xaa000) = 0x7f16588c9000
mmap(0x7f1658962000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x142000) = 0x7f1658962000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libncursesw.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 p\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=457736, ...}) = 0
mmap(NULL, 462072, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f16587ae000
mmap(0x7f16587c5000, 245760, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587c5000
mmap(0x7f1658801000, 98304, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x53000) = 0x7f1658801000
mmap(0x7f1658819000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6a000) = 0x7f1658819000
close(3) = 0
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 0\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=595552, ...}) = 0
mmap(NULL, 103144, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658794000
mmap(0x7f1658797000, 69632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658797000
mmap(0x7f16587a8000, 16384, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f16587a8000
mmap(0x7f16587ac000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587ac000
close(3) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658792000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f165878f000
arch_prctl(ARCH_SET_FS, 0x7f165878f780) = 0
mprotect(0x7f1658d2c000, 12288, PROT_READ) = 0
mprotect(0x7f16587ac000, 4096, PROT_READ) = 0
mprotect(0x7f1658819000, 20480, PROT_READ) = 0
mprotect(0x7f1658962000, 4096, PROT_READ) = 0
mprotect(0x7f1658b30000, 53248, PROT_READ) = 0
mprotect(0x7f1658b69000, 4096, PROT_READ) = 0
mprotect(0x7f1658b4a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d3a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d84000, 12288, PROT_READ) = 0
mprotect(0x7f1658e31000, 4096, PROT_READ) = 0
mprotect(0x557ec257d000, 12288, PROT_READ) = 0
mprotect(0x7f1659e62000, 4096, PROT_READ) = 0
munmap(0x7f1658d8e000, 458620) = 0
set_tid_address(0x7f165878fa50) = 2446087
set_robust_list(0x7f165878fa60, 24) = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7f1658b54bf0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7f1658b54c90, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0xb), ...}) = 0
from snmalloc.
@SchrodingerZhu I have raised an issue (#224) for what you have reported. I believe it is independent of the Windows commit issue.
from snmalloc.
@Licenser thanks. Did you monitor RSS, or just throughput. If you did monitor RSS, do you have transparent huge pages enabled
from snmalloc.
I just looked at throughput we don't have any benchmarks that look at memory, sorry.
from snmalloc.
@aganea I have also got rpmalloc and mimalloc working in the way your patch describes.
Initially, I am observing rpmalloc as slightly slower than snmalloc, but mimalloc is quite a bit slower. Is there anything I might be missing in building mimalloc
. I manually applied your patch, and then did
msbuild mimalloc.sln /m /P:Configuration=Release /t:rebuild
from the ide\vs2019
directory.
Obviously, the machines are different, so we should expect different results. As this is running in the Cloud the cost of various operations are different, and may occur contention in Hyper-V.
It is definitely not running the system heap, as it is getting up to a reasonable percentage CPU utilization, which the system allocator does not.
from snmalloc.
@mjp41 What Windows 10 version is the underlying cloud system? It might definitly be something related to allocating hardware pages on the underlying system. mimalloc makes a lot more calls to VirtualAlloc than rpmalloc than snmalloc. Please take a ETW trace, then in WPA go the RandomAscii inclusive view, right-lick "Filter to selection" on lld-link, than add two colums Module and Function, in this order: Process, Module, Function. You'll be able to tell pretty quickly where the bottleneck is. Normally, ntdll.dll & ntoskrnl.exe combined shouldn't take more that 0.5-0.8% of CPU, and most of the time is spent by xperf Rtl functions capturing the callstacks.
from snmalloc.
@aganea it is running in Azure, so I assume HyperV at the bottom, and Windows 10 version 1809 as the OS.
Looking at the traces it is spending a lot of time inside ntoskrnl.dll
inside spin locks, about 45%. I haven't drilled into the traces much, but I think it is seems to be around page handling.
from snmalloc.
If the instance is running on 1809, then the behavior you're seeing is 'normal'.
There's a known issue in the NT kernel, there was a contention in the page zero-out mechanism: https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc
This was fixed after version 1903.
Same dataset, same LLD linker:
However after 1909 there's a new contention issue in the large page allocation -- I don't know if it was fixed in version 2004: https://twitter.com/alex_toresh/status/1215125422226231297
from snmalloc.
Okay, I'll try to update the VM. Though, Windows update wants to go to 1909.
Looking at the numbers on the machine today rpmalloc and 16MiB configuration were about the same, and the 1MiB was slightly slower, but all pretty close and within the level of noise, so would actually have to do some statistics to draw a conclusion. The machine Azure gave me yesterday, had rpmalloc as slightly slower, I didn't run enough tests to see if it was statistically significant though.
Memory usage was approximately as you saw but off by a factor.
- 16MiB configuration - 32.4Gb
- 1MiB configuration - 16.2 Gb
- rpmalloc - 28 Gb (Not tried the array-cache branch yet)
- mimalloc - 14.3 Gb
from snmalloc.
So my VM upgraded to 1909 and now mimalloc is even worse. On this machine it is giving rpmalloc about 5% faster then snmalloc 1MiB, with the 16 MiB in the middle of them. Memory usage looks about the same.
I am going to move to 1MiB as the default. It works much better in terms of RSS/PSW, and there are very few scenarios where the reduced throughput seem too costly.
from snmalloc.
We're running it with
default-features = false
so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)
If you are using the rust crate, it has been just updated and it now requires setting either the 1mib
feature or the 16mib
feature.
This is a broken change if you are using default-features=false
.
from snmalloc.
Related Issues (20)
- `constexpr size_to_sizeclass` has a significant performance impact HOT 12
- Alternative remote queue initialisation HOT 2
- OpenBSD support HOT 4
- incompatible with PHP8.1 on debian HOT 12
- __builtin_readcyclecounter broken on mac arm64 HOT 2
- Enable Pool to use a malloc
- Open Enclave memory fragmentation HOT 3
- Unknown failure in CI on Risc-V HOT 3
- Memory allocation alignment, and garbage collector compatibility HOT 6
- snmalloc CI for Morello still uses github-act-runner 0.4.0 - GitHub Action Protocol Breaking July 31, 2023 HOT 2
- Add `MADV_COLLAPSE` when committing a range? HOT 1
- Consistent state across forks HOT 1
- _msize is not provided as an override? HOT 11
- Implement slab level cache for remote frees HOT 5
- Maybe missing _base variants in overrides HOT 2
- Using mmap to zero memory: performance questions and CHERI incompatibility HOT 2
- how to compile with nginx? HOT 5
- FreeBSD and NetBSD CI broken
- mallocx and rallocx don't handle alignment HOT 2
- Reserve and limit memory usage? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from snmalloc.