Giter Club home page Giter Club logo

Comments (24)

aganea avatar aganea commented on May 22, 2024 3

@aganea thank you so much for running more tests.

You're very welcome! You folks have been very helpful so far :)

Just wanted to confirm I am doing the same as you

In essence, you have to do a two-stage LLVM build.

  1. git checkout https://github.com/llvm/llvm-project -or- git pull, then git apply https://reviews.llvm.org/D71786
  2. The first stage builds LLVM with the bootstrap compiler (any compiler). You could use the allocator at this point if you wish.
  3. The second stage builds LLVM with with the 1st stage. At this point cmake uses ThinLTO & the allocator & O3 & -march=skylake or whatever your CPU is to ensure max. perfomance.
  4. Once everything is built, delete buildninjaStage2\bin\clang.exe, then re-run ninja clang -v. While it's linking (it should last a bit), go into the folder buildninjaStage2\CMakeFiles and copy clang.rsp (which is temp file created during link) to clang2.rsp. You can cancel the link at the point.
  5. Put the following line in a new file buildninjaStage2\link.rsp: /nologo @CMakeFiles\clang2.rsp /out:bin\clang.exe /implib:lib\clang.lib /pdb:bin\clang.pdb /version:0.0 /machine:x64 -fuse-ld=lld /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /opt:lldltojobs=all. All these gymnastics are needed because there's no option to disable the ThinLTO cache from cmake, nor an option to use all hardware threads (by default, only one thread per core is used).
  6. You only need to do the above steps once. You can now run the test with:
> cd buildninjaStage2
> bin\lld-link @link.rsp

So it's using the stage2 LLD to relink the stage2 clang.

I use Bruce Dawson's UIforETW to take profile traces: https://github.com/google/UIforETW - ensure to check 'trace to file' first on the right side. Click 'Start Tracing' at the top before running the above cmd-line, then 'Save Trace Buffers' once it ends. After it is done compressing the trace, double-clicking on it would open WPA. If the traces are too big and you get out-of-memory crashes, set the following the wpa.exe.config file next to wpa.exe:

<configuration>
  ...
  <runtime>
   	<gcAllowVeryLargeObjects enabled="true" />
  </runtime>
</configuration>

I've attached my build script:
make_llvm_snmalloc.zip

You need to run it from a VS 2017 or 2019 x64 Native Tools Command Prompt:

D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage1
D:\llvm-project> ninja check-all -C buildninjaStage1
D:\llvm-project> make_llvm_snmalloc.bat buildninjaStage2
D:\llvm-project> ninja check-all -C buildninjaStage2

GnuWin32, Python 3.8, ninja are also needed.

Please let me know if there're difficulties along the way.

from snmalloc.

Licenser avatar Licenser commented on May 22, 2024 2

So I ran our benchmark with and without 1mib and I couldn't see any significant difference.

from snmalloc.

aganea avatar aganea commented on May 22, 2024 1

Figures with latest master:

Allocator Wall clock Page ranges commited/decommited Total touched pages Peak Mem
default 2 min 21 sec 73,611 43,8 GB 42,6 GB
+IS_ADDRESS_SPACE_CONSTRAINED 2 min 21 sec 48,836 92,7 GB 21,6 GB

It seems the latest snmalloc checkout is a tad slower but the commit is now much better with David's suggestion.

snmalloc_is_constrained

from snmalloc.

Licenser avatar Licenser commented on May 22, 2024 1

We're running it with default-features = false so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024 1

@Licenser thanks for doing this.

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024 1

@aganea I have replicated the experiment so far. I have checked out a Standard F72s_v2 (72 vcpus, 144 GiB memory) on Azure with Windows 10 instance, and am getting about 32GiB PWS with the 16MiB chunk size, and 16GiB PWS with the 1MiB chunk size. The times look like they might be slightly faster with 16MiB, but not sure, so running some statistically meaningful tests. Also re-tested the tree_index branch.

One minor tip, you can do ninja -v -d keeprsp, then you don't have to worry about the rsp file being deleted by ninja in step 3 of your instructions.

from snmalloc.

aganea avatar aganea commented on May 22, 2024

@davidchisnall mentionned: "By default, on Windows, snmalloc only decommits memory when the kernel notifies it that memory is constrained. If you've got loads of spare memory, there's no problem letting the commit size grow a lot, it's only a negative if the memory could be usefully used for something else."

The test machine has 128 GB of RAM and memory is far from being constrained. I will nevertheless re-test with IS_ADDRESS_SPACE_CONSTRAINED and the latest master branch. My test was using the tree_index branch.

from snmalloc.

davidchisnall avatar davidchisnall commented on May 22, 2024

That looks a lot more plausible. We should probably rename IS_ADDRESS_SPACE_CONSTRAINED: it's a bit misleading. Since @mjp41's recent work, we rarely see a performance advantage from using 16MiB super slabs, I wonder if we should consider adjusting the default to 1MiB (or 2MiB, which would play nicely with superpages on x86).

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@aganea thank you so much for running more tests. I wonder if the small regression in performance is not using the tree_index branch. That should have improved Windows performance a bit. I have rebased the tree_index branch onto master, so we can test those changes again, so see if they account for the regression.

I have been setting up an LLVM Windows build, so I can test the link time. Just wanted to confirm I am doing the same as you

cmake ..\llvm-project\llvm \
   -DLLVM_ENABLE_LTO=On \
   -DCMAKE_LINKER=c:/src/malloc-llvm/build/lld-link.exe \
   -DLLVM_ENABLE_PROJECTS=clang \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_C_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
   -DCMAKE_CXX_COMPILER=c:/src/malloc-llvm/build/bin/clang-cl.exe \
   -G Ninja

When building this, I assume you are measuring the final step

Linking CXX executable bin\clang.exe  

I am building with the latest master, but I haven't tried to apply your patch yet. Just getting the very slow version with lld-link so far.

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@davidchisnall I think moving to the 1MiB size as default would make a lot of sense. I think 1MiB could work well with a little fiddling around huge pages, so we can put two threads into one huge page, for low-memory multi-threaded scenarios.

Agreed IS_ADDRESS_SPACE_CONSTRAINED is a terrible name, I named it after why I needed it originally, rather than what it does.

@Licenser, @darach, @SchrodingerZhu any thoughts on changing the default to 1MiB?

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

This comment contains some benchmarking using the microbenchmarks for mimalloc for the different chunk sizes.

@plietar any thoughts on changing the default to 1MiB?

from snmalloc.

SchrodingerZhu avatar SchrodingerZhu commented on May 22, 2024

on a small linux openvz instance (2Gib in total), after upgrading to snmalloc-rs==0.2.16, I run into Out of memory on initialisation with and without 1mib flag.

The following rust is not very useful, but I may have time to check the problem further:

Out of memory
/bin/utopia(+0x62ff7b)[0x56062d9fdf7b]
/bin/utopia(+0x6304b3)[0x56062d9fe4b3]
/bin/utopia(+0x630511)[0x56062d9fe511]
/bin/utopia(+0x630b72)[0x56062d9feb72]
/bin/utopia(+0x6340fc)[0x56062da020fc]
/bin/utopia(+0x3c75e3)[0x56062d7955e3]
/bin/utopia(+0x17c2a7)[0x56062d54a2a7]
/lib64/libc.so.6(__libc_start_main+0xf3)[0x7f27569406a3]
/bin/utopia(+0xe8ede)[0x56062d4b6ede]

I confirmed that this was resulted by snmalloc since using system default malloc will solve the memory issue. This problem occurs right after upgrading to 0.2.16 (94a2ba4).

from snmalloc.

SchrodingerZhu avatar SchrodingerZhu commented on May 22, 2024

this is how I reproduce a similar problem on my PC:

  • firejail --noprofile --rlimit-as=2147483648 bash: no problem
  • firejail --noprofile --rlimit-as=2147483648 --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash: dead
  • firejail --noprofile --env=LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash: no problem

from snmalloc.

SchrodingerZhu avatar SchrodingerZhu commented on May 22, 2024
[schrodinger@Monad utopia]$ ulimit -Sv 500000
[schrodinger@Monad utopia]$ strace env LD_PRELOAD=/tmp/snmalloc/test/libsnmallocshim.so bash
execve("/usr/bin/env", ["env", "LD_PRELOAD=/tmp/snmalloc/test/li"..., "bash"], 0x7ffe909ace00 /* 69 vars */) = 0
brk(NULL)                               = 0x55a2b674d000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe9daa37e0) = -1 EINVAL (Invalid argument)
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca548a000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5488000
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f0ca52c1000
mprotect(0x7f0ca52e6000, 1671168, PROT_NONE) = 0
mmap(0x7f0ca52e6000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f0ca52e6000
mmap(0x7f0ca5433000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f0ca5433000
mmap(0x7f0ca547e000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f0ca547e000
mmap(0x7f0ca5484000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ca5484000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f0ca5489580) = 0
mprotect(0x7f0ca547e000, 12288, PROT_READ) = 0
mprotect(0x55a2b5e45000, 4096, PROT_READ) = 0
mprotect(0x7f0ca5525000, 4096, PROT_READ) = 0
munmap(0x7f0ca548a000, 458620)          = 0
brk(NULL)                               = 0x55a2b674d000
brk(0x55a2b676e000)                     = 0x55a2b676e000
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=6187360, ...}) = 0
mmap(NULL, 6187360, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0ca4cda000
close(3)                                = 0
execve("/home/schrodinger/.opam/default/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/var/lib/snapd/snap/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.idris2/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/mpich/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/intel/system_studio_2020/compilers_and_libraries/linux/bin/intel64/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/opt/testa/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/home/schrodinger/.cargo/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/local/sbin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = -1 ENOENT (No such file or directory)
execve("/usr/bin/bash", ["bash"], 0x55a2b674e550 /* 70 vars */) = 0
brk(NULL)                               = 0x557ec3ddb000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffe95b408f0) = -1 EINVAL (Invalid argument)
openat(AT_FDCWD, "/tmp/snmalloc/test/libsnmallocshim.so", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@\21\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1880712, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1659e35000
mmap(NULL, 16998432, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658dfe000
mmap(0x7f1658dff000, 57344, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658dff000
mmap(0x7f1658e0d000, 147456, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f1658e0d000
mmap(0x7f1658e31000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x32000) = 0x7f1658e31000
mmap(0x7f1658e33000, 16781344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658e33000
close(3)                                = 0
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
close(3)                                = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=458620, ...}) = 0
mmap(NULL, 458620, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f1658d8e000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libreadline.so.8", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 `\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=326416, ...}) = 0
mmap(NULL, 334344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d3c000
mmap(0x7f1658d52000, 163840, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x16000) = 0x7f1658d52000
mmap(0x7f1658d7a000, 40960, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3e000) = 0x7f1658d7a000
mmap(0x7f1658d84000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x47000) = 0x7f1658d84000
mmap(0x7f1658d8d000, 2568, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d8d000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\20\22\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=18608, ...}) = 0
mmap(NULL, 20624, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658d36000
mmap(0x7f1658d37000, 8192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1000) = 0x7f1658d37000
mmap(0x7f1658d39000, 4096, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d39000
mmap(0x7f1658d3a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658d3a000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@q\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=2146832, ...}) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0\20\0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0", 32, 848) = 32
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0\346~\200\347\6\31qw\t\343\30\16U*\21\242"..., 68, 880) = 68
mmap(NULL, 1860456, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b6f000
mprotect(0x7f1658b94000, 1671168, PROT_NONE) = 0
mmap(0x7f1658b94000, 1363968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x25000) = 0x7f1658b94000
mmap(0x7f1658ce1000, 303104, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x172000) = 0x7f1658ce1000
mmap(0x7f1658d2c000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bc000) = 0x7f1658d2c000
mmap(0x7f1658d32000, 13160, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658d32000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\220\201\0\0\0\0\0\0"..., 832) = 832
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
fstat(3, {st_mode=S_IFREG|0755, st_size=161024, ...}) = 0
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0:(A\261\254\325W\2768O\340i9\4#\234"..., 68, 824) = 68
mmap(NULL, 135600, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b4d000
mmap(0x7f1658b54000, 65536, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x7000) = 0x7f1658b54000
mmap(0x7f1658b64000, 20480, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f1658b64000
mmap(0x7f1658b69000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b000) = 0x7f1658b69000
mmap(0x7f1658b6b000, 12720, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b6b000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libatomic.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0  \0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=167952, ...}) = 0
mmap(NULL, 36936, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658b43000
mmap(0x7f1658b45000, 12288, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f1658b45000
mmap(0x7f1658b48000, 8192, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x5000) = 0x7f1658b48000
mmap(0x7f1658b4a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f1658b4a000
mmap(0x7f1658b4c000, 72, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b4c000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libstdc++.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0@`\t\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=20945112, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658b41000
mmap(NULL, 1951744, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658964000
mprotect(0x7f16589fa000, 1269760, PROT_NONE) = 0
mmap(0x7f16589fa000, 966656, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x96000) = 0x7f16589fa000
mmap(0x7f1658ae6000, 299008, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x182000) = 0x7f1658ae6000
mmap(0x7f1658b30000, 57344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1cb000) = 0x7f1658b30000
mmap(0x7f1658b3e000, 10240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f1658b3e000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libm.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\363\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=1328000, ...}) = 0
mmap(NULL, 1327128, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f165881f000
mmap(0x7f165882e000, 634880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xf000) = 0x7f165882e000
mmap(0x7f16588c9000, 626688, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xaa000) = 0x7f16588c9000
mmap(0x7f1658962000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x142000) = 0x7f1658962000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libncursesw.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 p\1\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=457736, ...}) = 0
mmap(NULL, 462072, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f16587ae000
mmap(0x7f16587c5000, 245760, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587c5000
mmap(0x7f1658801000, 98304, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x53000) = 0x7f1658801000
mmap(0x7f1658819000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6a000) = 0x7f1658819000
close(3)                                = 0
openat(AT_FDCWD, "/usr/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 0\0\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0644, st_size=595552, ...}) = 0
mmap(NULL, 103144, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f1658794000
mmap(0x7f1658797000, 69632, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0x7f1658797000
mmap(0x7f16587a8000, 16384, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x14000) = 0x7f16587a8000
mmap(0x7f16587ac000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x17000) = 0x7f16587ac000
close(3)                                = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f1658792000
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f165878f000
arch_prctl(ARCH_SET_FS, 0x7f165878f780) = 0
mprotect(0x7f1658d2c000, 12288, PROT_READ) = 0
mprotect(0x7f16587ac000, 4096, PROT_READ) = 0
mprotect(0x7f1658819000, 20480, PROT_READ) = 0
mprotect(0x7f1658962000, 4096, PROT_READ) = 0
mprotect(0x7f1658b30000, 53248, PROT_READ) = 0
mprotect(0x7f1658b69000, 4096, PROT_READ) = 0
mprotect(0x7f1658b4a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d3a000, 4096, PROT_READ) = 0
mprotect(0x7f1658d84000, 12288, PROT_READ) = 0
mprotect(0x7f1658e31000, 4096, PROT_READ) = 0
mprotect(0x557ec257d000, 12288, PROT_READ) = 0
mprotect(0x7f1659e62000, 4096, PROT_READ) = 0
munmap(0x7f1658d8e000, 458620)          = 0
set_tid_address(0x7f165878fa50)         = 2446087
set_robust_list(0x7f165878fa60, 24)     = 0
rt_sigaction(SIGRTMIN, {sa_handler=0x7f1658b54bf0, sa_mask=[], sa_flags=SA_RESTORER|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {sa_handler=0x7f1658b54c90, sa_mask=[], sa_flags=SA_RESTORER|SA_RESTART|SA_SIGINFO, sa_restorer=0x7f1658b61960}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0xb), ...}) = 0

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@SchrodingerZhu I have raised an issue (#224) for what you have reported. I believe it is independent of the Windows commit issue.

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@Licenser thanks. Did you monitor RSS, or just throughput. If you did monitor RSS, do you have transparent huge pages enabled

from snmalloc.

Licenser avatar Licenser commented on May 22, 2024

I just looked at throughput we don't have any benchmarks that look at memory, sorry.

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@aganea I have also got rpmalloc and mimalloc working in the way your patch describes.

Initially, I am observing rpmalloc as slightly slower than snmalloc, but mimalloc is quite a bit slower. Is there anything I might be missing in building mimalloc. I manually applied your patch, and then did

msbuild mimalloc.sln /m /P:Configuration=Release /t:rebuild

from the ide\vs2019 directory.

Obviously, the machines are different, so we should expect different results. As this is running in the Cloud the cost of various operations are different, and may occur contention in Hyper-V.

It is definitely not running the system heap, as it is getting up to a reasonable percentage CPU utilization, which the system allocator does not.

from snmalloc.

aganea avatar aganea commented on May 22, 2024

@mjp41 What Windows 10 version is the underlying cloud system? It might definitly be something related to allocating hardware pages on the underlying system. mimalloc makes a lot more calls to VirtualAlloc than rpmalloc than snmalloc. Please take a ETW trace, then in WPA go the RandomAscii inclusive view, right-lick "Filter to selection" on lld-link, than add two colums Module and Function, in this order: Process, Module, Function. You'll be able to tell pretty quickly where the bottleneck is. Normally, ntdll.dll & ntoskrnl.exe combined shouldn't take more that 0.5-0.8% of CPU, and most of the time is spent by xperf Rtl functions capturing the callstacks.

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

@aganea it is running in Azure, so I assume HyperV at the bottom, and Windows 10 version 1809 as the OS.

Looking at the traces it is spending a lot of time inside ntoskrnl.dll inside spin locks, about 45%. I haven't drilled into the traces much, but I think it is seems to be around page handling.

from snmalloc.

aganea avatar aganea commented on May 22, 2024

If the instance is running on 1809, then the behavior you're seeing is 'normal'.

There's a known issue in the NT kernel, there was a contention in the page zero-out mechanism: https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc
This was fixed after version 1903.

Same dataset, same LLD linker:
6140_ThinLTO_1709_vs_1909

However after 1909 there's a new contention issue in the large page allocation -- I don't know if it was fixed in version 2004: https://twitter.com/alex_toresh/status/1215125422226231297

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

Okay, I'll try to update the VM. Though, Windows update wants to go to 1909.

Looking at the numbers on the machine today rpmalloc and 16MiB configuration were about the same, and the 1MiB was slightly slower, but all pretty close and within the level of noise, so would actually have to do some statistics to draw a conclusion. The machine Azure gave me yesterday, had rpmalloc as slightly slower, I didn't run enough tests to see if it was statistically significant though.

Memory usage was approximately as you saw but off by a factor.

  • 16MiB configuration - 32.4Gb
  • 1MiB configuration - 16.2 Gb
  • rpmalloc - 28 Gb (Not tried the array-cache branch yet)
  • mimalloc - 14.3 Gb

from snmalloc.

mjp41 avatar mjp41 commented on May 22, 2024

So my VM upgraded to 1909 and now mimalloc is even worse. On this machine it is giving rpmalloc about 5% faster then snmalloc 1MiB, with the 16 MiB in the middle of them. Memory usage looks about the same.

I am going to move to 1MiB as the default. It works much better in terms of RSS/PSW, and there are very few scenarios where the reduced throughput seem too costly.

from snmalloc.

SchrodingerZhu avatar SchrodingerZhu commented on May 22, 2024

@Licenser

We're running it with default-features = false so it won't affect us but I'll try to get in some benchmarks on Monday of the impact of 1mb vs no features :)

If you are using the rust crate, it has been just updated and it now requires setting either the 1mib feature or the 16mib feature.
This is a broken change if you are using default-features=false.

from snmalloc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.