Comments (17)
The way I see (and perf record also) the slowdown is exactly in what I wrote in the original issue post, the allocations are made in TCPHandler
in ClickHouse and the memory is not reused (the case with the same allocation in std::unique_ptr).
jemalloc does a great job without doing huge number of syscalls in this case (the SQL query above)
$ sudo strace -fe mmap,munmap -p 360474 [3:02:03]
strace: Process 360474 attached with 58 threads
[pid 360527] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff0675f7000
[pid 360527] munmap(0x7ff0675f7000, 2097152) = 0
[pid 360527] mmap(NULL, 4190208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff0673f8000
[pid 360527] munmap(0x7ff0673f8000, 32768) = 0
[pid 360527] munmap(0x7ff067600000, 2060288) = 0
[pid 360527] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff067200000
[pid 360523] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff067000000
[pid 360523] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066e00000
[pid 360523] mmap(NULL, 2621440, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066b80000
[pid 360523] mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066880000
[pid 360513] mmap(NULL, 2621440, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066600000
[pid 360513] mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066300000
[pid 360513] mmap(NULL, 3670016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff065f80000
[pid 360513] mmap(NULL, 5242880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff065a80000
And with mimalloc I have
danlark: ~/ClickHouse mimalloc ⚡
$ sudo strace -fe mmap,munmap -p 376681 2>out [3:12:55]
danlark: ~/ClickHouse mimalloc ⚡
$ wc -l out [3:13:05]
6576 out
Sample is exactly what I wrote about -- so many >512Kb allocations with many munmaps (almost one thousand lines identical sample https://pastebin.com/xcfcWV1e)
[pid 376728] munmap(0x7feb88800000, 635576) = 0
[pid 376728] munmap(0x7feb88c00000, 1048784) = 0
[pid 376728] mmap(NULL, 1048784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf885c000
[pid 376728] munmap(0x7febf885c000, 1048784) = 0
[pid 376728] mmap(NULL, 5243088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb88aff000
[pid 376728] munmap(0x7feb88aff000, 1052672) = 0
[pid 376728] munmap(0x7feb88d01000, 3137744) = 0
[pid 376728] munmap(0x7feb88c00000, 1048784) = 0
[pid 376728] mmap(NULL, 1048784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf885c000
[pid 376728] munmap(0x7febf885c000, 1048784) = 0
[pid 376728] mmap(NULL, 5243088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb88aff000
[pid 376728] munmap(0x7feb88aff000, 1052672) = 0
[pid 376728] munmap(0x7feb88d01000, 3137744) = 0
[pid 376728] mmap(NULL, 660344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf88bb000
[pid 376728] munmap(0x7febf88bb000, 660344) = 0
from mimalloc.
Thanks Danila -- super helpful. I am traveling but will try this but soon next week. I can already see it is due to many "huge" (>1mb) being allocated and freed (using expensive mmap's). This is not quite the use case for mimalloc (being build for many short lived small allocations :-) ), -- but I have ideas on how to fix this -- there is already code to do pooled huge page allocations and I'll experiment with that.
Sorry for necroposting, but I wanted to clarify this message - does
mimalloc (being build for many short lived small allocations :-) )
mean that Mimalloc is built for a lot of short-lived small allocations, or the opposite - that it's built for big allocations that live for long?
from mimalloc.
Thank you for your answer, I hope we can collaborate a lot. My email if you have any questions: [email protected]. And I definitely need to read everything about mimalloc before getting some conclusions :)
ClickHouse performance reproducible instruction (in total, it can be hard a bit to get all the settings from scratch):
-
You can use such instructions (trust me, they are simple) to build and to run clickhouse https://clickhouse.yandex/docs/en/development/build/, https://clickhouse.yandex/docs/en/development/tests/
-
Then you should download some our anonymized dataset https://clickhouse.yandex/docs/en/getting_started/example_datasets/metrica/
-
I made a branch especially for this case and it is called
mimalloc
-
Then you should comment the function in contrib/ssl/crypto/compat/reallocarray.c because it is ambigious (I will investigate this issue separately). The build is by default will be with mimalloc, to turn off, use
-D ENABLE_MIMALLOC=0
(and uncomment function in ssl) in cmake (it will turn on jemalloc). Debug build type is-D CMAKE_BUILD_TYPE=Debug
-
Then I changed locally the code to turn on statistics in
mimalloc-types.h
.
The query I executed was even one thread query:
SELECT count(*)
FROM danlark_table
WHERE NOT ignore(URL)
SETTINGS max_threads = 1
From big dataset that can show some more information (though I can't give you the source), it is not the ending of execution, it is just before the end, I believe, but during the execution I already see a huge slowdown
heap stats: peak total freed unit count
normal 1: 98.2 kb 338.7 kb 337.1 kb 8 b 43.4 k not all freed!
normal 2: 29.6 kb 133.3 kb 121.8 kb 16 b 8.5 k not all freed!
normal 4: 2.7 mb 3.3 mb 3.2 mb 32 b 107.8 k not all freed!
normal 6: 591.5 kb 3.4 mb 3.4 mb 48 b 75.2 k not all freed!
normal 8: 873.9 kb 6.7 mb 6.5 mb 64 b 109.1 k not all freed!
normal 9: 3.7 mb 9.1 mb 9.0 mb 80 b 119.1 k not all freed!
normal 10: 8.4 kb 165.5 kb 159.1 kb 96 b 1.8 k not all freed!
normal 11: 3.1 mb 4.3 mb 4.3 mb 112 b 40.3 k not all freed!
normal 12: 195.4 kb 365.1 kb 172.6 kb 128 b 2.9 k not all freed!
normal 13: 4.5 kb 64.2 kb 60.0 kb 160 b 411 not all freed!
normal 14: 425.2 kb 1.6 mb 1.3 mb 192 b 8.9 k not all freed!
normal 15: 111.1 kb 125.1 kb 38.5 kb 224 b 572 not all freed!
normal 16: 16.8 kb 90.2 kb 80.5 kb 256 b 361 not all freed!
normal 17: 3.1 kb 35.6 kb 33.8 kb 320 b 114 not all freed!
normal 18: 30.0 kb 75.4 kb 52.9 kb 384 b 201 not all freed!
normal 19: 35.4 kb 49.0 kb 40.7 kb 448 b 112 not all freed!
normal 20: 10.0 kb 267.0 kb 260.5 kb 512 b 534 not all freed!
normal 21: 74.4 kb 87.5 kb 14.4 kb 640 b 140 not all freed!
normal 22: 16.5 kb 27.0 kb 26.2 kb 768 b 36 not all freed!
normal 23: 6.1 kb 37.6 kb 36.8 kb 896 b 43 not all freed!
normal 24: 8.0 kb 11.0 kb 7.0 kb 1.0 kb 11 not all freed!
normal 25: 25.0 kb 26.2 kb 1.2 kb 1.2 kb 21 not all freed!
normal 26: 13.5 kb 24.0 kb 24.0 kb 1.5 kb 16 ok
normal 27: 7.0 kb 71.8 kb 70.0 kb 1.8 kb 41 not all freed!
normal 28: 24.0 kb 28.0 kb 14.0 kb 2.0 kb 14 not all freed!
normal 29: 207.5 kb 492.5 kb 290.0 kb 2.5 kb 197 not all freed!
normal 30: 24.0 kb 33.0 kb 33.0 kb 3.0 kb 11 ok
normal 31: 7.0 kb 24.5 kb 21.0 kb 3.5 kb 7 not all freed!
normal 32: 37.6 mb 37.6 mb 37.6 mb 4.0 kb 9.6 k not all freed!
normal 33: 20.0 kb 45.0 kb 45.0 kb 5.0 kb 9 ok
normal 34: 144.0 kb 204.0 kb 66.0 kb 6.0 kb 34 not all freed!
normal 36: 32.0 kb 32.0 kb 24.0 kb 8.0 kb 4 not all freed!
normal 37: 10.0 kb 10.0 kb 10.0 kb 10.0 kb 1 ok
normal 38: 36.0 kb 60.0 kb 48.0 kb 12.0 kb 5 not all freed!
normal 39: 28.0 kb 56.0 kb 28.0 kb 14.0 kb 4 not all freed!
normal 40: 48.0 kb 80.0 kb 48.0 kb 16.0 kb 5 not all freed!
normal 41: 20.0 kb 40.0 kb 20.0 kb 20.0 kb 2 not all freed!
normal 42: 24.0 kb 24.0 kb 24.0 kb 24.0 kb 1 ok
normal 44: 64.0 kb 64.0 kb 32.0 kb 32.0 kb 2 not all freed!
normal 45: 680.0 kb 920.0 kb 920.0 kb 40.0 kb 23 ok
normal 48: 256.0 kb 256.0 kb 64.0 kb 64.0 kb 4 not all freed!
normal 50: 96.0 kb 96.0 kb 0 b 96.0 kb 1 not all freed!
normal 52: 87.0 mb 902.2 mb 902.2 mb 128.0 kb 7.2 k ok
normal 54: 192.0 kb 192.0 kb 192.0 kb 192.0 kb 1 ok
normal 55: 896.0 kb 896.0 kb 896.0 kb 224.0 kb 4 ok
normal 56: 90.0 mb 605.8 mb 605.2 mb 256.0 kb 2.4 k not all freed!
normal 57: 10.9 mb 10.9 mb 10.6 mb 320.0 kb 35 not all freed!
normal 58: 18.4 mb 18.4 mb 18.4 mb 384.0 kb 49 ok
normal 59: 17.1 mb 17.1 mb 17.1 mb 448.0 kb 39 ok
normal 60: 23.0 mb 23.0 mb 22.5 mb 512.0 kb 46 not all freed!
normal 61: 30.6 mb 30.6 mb 30.6 mb 640.0 kb 49 ok
normal 62: 19.5 mb 19.5 mb 19.5 mb 768.0 kb 26 ok
normal 63: 7.0 mb 7.0 mb 7.0 mb 896.0 kb 8 ok
normal 64: 4.6 gb 4.6 gb 0 b 1.0 mb 4.7 k not all freed!
heap stats: peak total freed unit count
normal: 5.0 gb 6.3 gb 1.7 gb 1 b not all freed!
huge: 737.0 mb 13.2 gb 17.9 gb 1 b ok
total: 5.7 gb 19.5 gb 19.5 gb 1 b not all freed!
malloc requested: 19.5 gb
committed: 1.1 gb 17.9 gb 17.9 gb 1 b not all freed!
reserved: 1.3 gb 18.1 gb 17.9 gb 1 b not all freed!
reset: 0 0 0
segments: 591 9.5 k 9.4 k
-abandoned: 0 0 0
pages: 939 9.8 k 9.4 k
-abandoned: 0 0 0
-extended: 10.4 k
mmaps: 19.5 k
mmap fast: 50
mmap slow: 9.7 k
threads: 15
searches: 0.1 avg
elapsed: 35.540 s
From test.hits (that you will download)
heap stats: peak total freed unit count
normal 1: 12.8 kb 69.4 kb 67.8 kb 8 b 8.9 k not all freed!
normal 2: 18.8 kb 87.9 kb 75.5 kb 16 b 5.6 k not all freed!
normal 4: 342.2 kb 908.5 kb 854.3 kb 32 b 29.1 k not all freed!
normal 6: 188.1 kb 2.3 mb 2.2 mb 48 b 50.8 k not all freed!
normal 8: 335.5 kb 1.8 mb 1.7 mb 64 b 30.2 k not all freed!
normal 9: 420.2 kb 1.1 mb 1.1 mb 80 b 15.0 k not all freed!
normal 10: 8.1 kb 169.7 kb 164.0 kb 96 b 1.8 k not all freed!
normal 11: 333.0 kb 560.5 kb 547.6 kb 112 b 5.1 k not all freed!
normal 12: 199.2 kb 314.5 kb 118.9 kb 128 b 2.5 k not all freed!
normal 13: 6.4 kb 70.9 kb 64.8 kb 160 b 454 not all freed!
normal 14: 429.6 kb 1.7 mb 1.4 mb 192 b 9.2 k not all freed!
normal 15: 111.3 kb 126.0 kb 40.2 kb 224 b 576 not all freed!
normal 16: 17.5 kb 92.2 kb 81.5 kb 256 b 369 not all freed!
normal 17: 3.4 kb 38.1 kb 36.2 kb 320 b 122 not all freed!
normal 18: 31.5 kb 78.8 kb 55.5 kb 384 b 210 not all freed!
normal 19: 35.9 kb 49.4 kb 40.7 kb 448 b 113 not all freed!
normal 20: 11.5 kb 40.5 kb 35.5 kb 512 b 81 not all freed!
normal 21: 76.9 kb 94.4 kb 20.6 kb 640 b 151 not all freed!
normal 22: 21.8 kb 36.8 kb 30.8 kb 768 b 49 not all freed!
normal 23: 7.0 kb 40.2 kb 38.5 kb 896 b 46 not all freed!
normal 24: 9.0 kb 13.0 kb 9.0 kb 1.0 kb 13 not all freed!
normal 25: 27.5 kb 31.2 kb 5.0 kb 1.2 kb 25 not all freed!
normal 26: 15.0 kb 25.5 kb 25.5 kb 1.5 kb 17 ok
normal 27: 7.0 kb 71.8 kb 70.0 kb 1.8 kb 41 not all freed!
normal 28: 24.0 kb 28.0 kb 14.0 kb 2.0 kb 14 not all freed!
normal 29: 212.5 kb 497.5 kb 290.0 kb 2.5 kb 199 not all freed!
normal 30: 30.0 kb 36.0 kb 36.0 kb 3.0 kb 12 ok
normal 31: 7.0 kb 24.5 kb 21.0 kb 3.5 kb 7 not all freed!
normal 32: 3.7 mb 3.7 mb 3.7 mb 4.0 kb 951 not all freed!
normal 33: 20.0 kb 45.0 kb 45.0 kb 5.0 kb 9 ok
normal 34: 150.0 kb 210.0 kb 66.0 kb 6.0 kb 35 not all freed!
normal 36: 48.0 kb 48.0 kb 24.0 kb 8.0 kb 6 not all freed!
normal 37: 10.0 kb 10.0 kb 10.0 kb 10.0 kb 1 ok
normal 38: 36.0 kb 60.0 kb 48.0 kb 12.0 kb 5 not all freed!
normal 39: 28.0 kb 56.0 kb 28.0 kb 14.0 kb 4 not all freed!
normal 40: 64.0 kb 96.0 kb 48.0 kb 16.0 kb 6 not all freed!
normal 41: 20.0 kb 40.0 kb 20.0 kb 20.0 kb 2 not all freed!
normal 42: 24.0 kb 24.0 kb 24.0 kb 24.0 kb 1 ok
normal 44: 64.0 kb 64.0 kb 32.0 kb 32.0 kb 2 not all freed!
normal 45: 640.0 kb 960.0 kb 960.0 kb 40.0 kb 24 ok
normal 48: 256.0 kb 256.0 kb 64.0 kb 64.0 kb 4 not all freed!
normal 50: 96.0 kb 96.0 kb 0 b 96.0 kb 1 not all freed!
normal 52: 12.6 mb 101.0 mb 100.9 mb 128.0 kb 808 not all freed!
normal 53: 160.0 kb 160.0 kb 160.0 kb 160.0 kb 1 ok
normal 54: 192.0 kb 192.0 kb 192.0 kb 192.0 kb 1 ok
normal 55: 224.0 kb 224.0 kb 224.0 kb 224.0 kb 1 ok
normal 56: 16.0 mb 32.0 mb 31.8 mb 256.0 kb 128 not all freed!
normal 57: 960.0 kb 960.0 kb 960.0 kb 320.0 kb 3 ok
normal 58: 2.6 mb 2.6 mb 2.6 mb 384.0 kb 7 ok
normal 59: 7.9 mb 7.9 mb 7.9 mb 448.0 kb 18 ok
normal 60: 17.0 mb 19.5 mb 19.0 mb 512.0 kb 39 not all freed!
normal 61: 20.0 mb 20.6 mb 20.6 mb 640.0 kb 33 ok
normal 62: 12.0 mb 12.0 mb 12.0 mb 768.0 kb 16 ok
normal 63: 14.0 mb 14.0 mb 14.0 mb 896.0 kb 16 ok
normal 64: 482.0 mb 482.0 mb 0 b 1.0 mb 482 not all freed!
heap stats: peak total freed unit count
normal: 593.6 mb 709.2 mb 224.4 mb 1 b not all freed!
huge: 182.0 mb 1.1 gb 1.6 gb 1 b ok
total: 775.6 mb 1.8 gb 1.8 gb 1 b not all freed!
malloc requested: 1.8 gb
committed: 243.6 mb 1.6 gb 1.6 gb 1 b not all freed!
reserved: 460.2 mb 1.8 gb 1.6 gb 1 b not all freed!
reset: 0 0 0
segments: 168 1.1 k 1010
-abandoned: 0 0 0
pages: 520 1.4 k 1010
-abandoned: 0 0 0
-extended: 1.8 k
mmaps: 2.2 k
mmap fast: 48
mmap slow: 1.0 k
threads: 15
searches: 0.3 avg
elapsed: 7.402 s
process: user: 1.072 s, system: 2.920 s, faults: 0, reclaims: 621804, rss: 827.8 mb
And in the end
mimalloc
SELECT count(*)
FROM test.hits
WHERE NOT ignore(URL)
SETTINGS max_threads = 1
┌─count()─┐
│ 8873898 │
└─────────┘
1 rows in set. Elapsed: 0.693 sec. Processed 8.87 million rows, 762.68 MB (12.80 million rows/s., 1.10 GB/s.)
jemalloc
SELECT count(*)
FROM test.hits
WHERE NOT ignore(URL)
SETTINGS max_threads = 1
┌─count()─┐
│ 8873898 │
└─────────┘
1 rows in set. Elapsed: 0.388 sec. Processed 8.87 million rows, 762.68 MB (22.84 million rows/s., 1.96 GB/s.)
I put mi_stats_print(nullptr)
in dbms/src/Compression/LZ4_decompress_faster.cpp
(it is commented in the branch) to get the stats of the execution thread (logs will be big though because we decompress by chunks), maybe we should do some function that can print the stats of all threads (am I correct that it is printing only the current thread stats?).
from mimalloc.
Thanks Danila -- super helpful. I am traveling but will try this but soon next week. I can already see it is due to many "huge" (>1mb) being allocated and freed (using expensive mmap's). This is not quite the use case for mimalloc (being build for many short lived small allocations :-) ), -- but I have ideas on how to fix this -- there is already code to do pooled huge page allocations and I'll experiment with that.
from mimalloc.
Thanks Danila for your benchmarking! I definitely want to fix this issue on the ClickHouse benchmark and I hope we can work together to figure out what happens. 2x is too much on a real-world application!
Did you read the technical report? We describe that over all our (intense) 12 benchmarks, and all SpecMark benchmarks we perform very well -- except suddenly for the GCC benchmark. That kind of shows that for every allocator there can be workloads where it does not do well suddenly; for example jemalloc is 3x slower on Larson, or 19x on cachescratch etc. In the GCC case, it turned out to be the allocation of many long-lived full pages and we fixed that.
I am hoping that we can find something similar for the ClickHouse bench -- especially since this seems a real-world benchmark? Feel free to write me an email so we can figure out in detail what is going on there. Perhaps you can build the DEBUG version and run with MIMALLOC_STATS=1 to gain more insight. Also, can show what you are testing exactly -- maybe we can include it in mimalloc-bench? (I wonder if it is a re-use case indeed where you happen to allocate only large pages -- in that case we can fix it by just increasing the constants in mimalloc-types.h
.)
With regard to the "mmap" calls above, this is due to the 4MiB alignment -- with good reasons as discussed in the techreport (jemalloc does the same for arena allocations, and FreeBSD provides aligned mmap). The benchmarks you show above are not very illustrative as they don't read or write the memory which is not what regular applications do -- and allocators ammortize such costs (i.e. efficient small allocation vs mmap for large ones). See for example the alloc-test
in mimalloc-bench for a more realistic test of allocation (described here )
from mimalloc.
Ah, I wasnt able to build the ClickHouse benchmark yet :-( However, I pushed new changes to the dev
branch where huge page segments are now part of the segment cache. This should make a big difference I think -- can you give it a try on the larger benchmark?
from mimalloc.
Ah, I wasnt able to build the ClickHouse benchmark yet :-( However, I pushed new changes to the
dev
branch where huge page segments are now part of the segment cache. This should make a big difference I think -- can you give it a try on the larger benchmark?
Seems much better, I will test it with a slower but more reliable perf test and show you the results
from mimalloc.
Good to hear -- let me know how it goes and I'll push the fix to the main branch. It is quite interesting to find this case as across the current range of benchmarks it does not occur. As such, I want to include an analysis in the tech report (much like the "full pages" case for the specmark gcc benchmark). One hope of mimalloc is that by having a small codebase, we can more easily identify these kind of cases.
from mimalloc.
Handling allocations bigger than 512KB is critical for many server applications. It is not uncommon to see several MBs of arenas being allocated. Would be great if this allocator can scale better for huge allocations.
from mimalloc.
@junhuaw : mimalloc
handles large allocations of course and the new cache should improve the performance. However, the ClickHouse benchmark is a bit special (in the sense that we haven't encountered such program across our wide range of benchmarks yet) in that it does large allocations without doing much with the data in it.. and then freeing it. It turns out that the mmap
system call is so expensive that it starts to dominate in those cases. The new cache avoids calling mmap
too often and fixes that. (at least, still waiting for the new benchmark results).
All in all very interesting -- it shows that even after testing on a wide range of benchmarks and programs you can still encounter situations that need special strategies -- there is no silver bullet. If you read the tech report, you can see that we had this before with the SpecMark gcc benchmark.
from mimalloc.
For now, I have some benchmark results and I can say that mimalloc is ok for now (from dev branch) but not better than jemalloc for some of our purposes and 2x slowdown disappears.
I tested on a big variety of queries, the average loss against jemalloc is 3-5% in rather simple cases and sometimes even better than jemalloc in more complicated. I can't show all the results because mimalloc crashed after a long time of work (investigating, maybe because of our intensive allocation but this will be a completely separate issue). Good result from the start!
Some queries, allocator/time (test runs many times and we get the min time):
- https://pastebin.com/5JBpjpdp -- string search, less is better
- https://pastebin.com/QRyap9cK -- string processing, less is better
- https://pastebin.com/U3tPeZnC -- join queries, more is better
I believe we can close this issue as soon as you merge dev branch into master.
from mimalloc.
See also rust-lang/rust#62073 for a benchmark of mimalloc
in another workload. It didn't crash, but made almost no difference. That's probably without the large allocation caching.
from mimalloc.
@danlark1 : Thanks for re-running the benchmarks! Great to see that we perform as well as jemalloc
now.
For the future, I have ideas how to use the main ideas of mimalloc
to improve performance for the small allocations can be reused for larger allocations too so stay tuned for mimalloc2 :-) One thing I learned here which I didn't expect is how expensive an mmap
call really is -- I was assuming that for large allocations (of size N) the access to the memory (as N reads/writes) would dominate of the mmap
call but that is clearly not the case for allocations between, say 512KiB and 10MiB. The cache fixes this for now but I think there are opportunities to do better in the future.
Thanks again for helping to improve mimalloc :-)
from mimalloc.
@inicola, thanks for testing. Note that on many workloads most modern allocators will perform very close -- a good thing. Especially of course if the load is not dominated by allocation. So, similar results are very common.
As said in the readme, there are always suddenly programs where some allocator does not do so well and the main goal of a good allocator is to guard against such "edge" cases.
In the end though, there is never an optimal strategy in general and all allocators need to make assumptions about typical program behaviour and optimize for that. That is why one can usually construct artificial benchmarks where allocators trip over (like cache-scratch
). In the end, real world behavior is what really matters -- which is why I am quite happy we could fix the intial perf problem observed by @danlark1 .
from mimalloc.
Closing issue as for now mimalloc does not have such problems.
Btw, we used mimalloc with secure mode in ClickHouse for internal caches and happy with it.
from mimalloc.
That is good to hear -- glad you found this issue and that there was an easy fix!
If possible, could you amend your comment on Hackernews? It is the highest comment now and may give the wrong impression now :-)
Also, I am working to also apply the techniques of mimalloc to large allocations too where we may get further improvements beyond avoiding mmap calls. Stay tuned :-)
from mimalloc.
That is good to hear -- glad you found this issue and that there was an easy fix!
If possible, could you amend your comment on Hackernews? It is the highest comment now and may give the wrong impression now :-)Also, I am working to also apply the techniques of mimalloc to large allocations too where we may get further improvements beyond avoiding mmap calls. Stay tuned :-)
Ah, I can't edit (and delete) a comment that was 5 days ago because of HN restriction. I added the reply that anyone can see.
from mimalloc.
Related Issues (20)
- System requirements for mimalloc
- Malloc the pinned memory for a 10GB picture
- Fails to build on mipsel / m68k / powerpc / sh4 HOT 1
- mi_malloc is much slower then original malloc in Windows (In big memory allocation, bigger then 1MB) HOT 7
- is this project died?
- CMake find_package: mimalloc-config.cmake points to a mimalloc.cmake file that does not exist HOT 2
- error: 'large_os_page_size' undeclared HOT 2
- Using bitmaps instead of linked lists to manage free memory blocks
- [question] Is shared memory supported ?
- msvc build application(support windows xp)
- crash on Intel CPUs with FSRM HOT 5
- Single file amalgamation
- Linux: Crash can occur if mimalloc is unloaded before thread ends
- page leak at program done (Linux)
- Call mi_heap_destroy from another thread
- getting an assertion error in Koka
- Feature Request: Support AArch64 MTE
- Feedback on mimalloc 2.X
- mi_is_in_heap_region asserting in debug HOT 2
- Override malloc in dynamically loaded dll on Windows
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mimalloc.