Hi, I tried mimalloc in ClickHouse

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Some performance issues with mimalloc,about microsoft/mimalloc

Comments (17)

danlark1 commented on July 19, 2024 4

The way I see (and perf record also) the slowdown is exactly in what I wrote in the original issue post, the allocations are made in TCPHandler in ClickHouse and the memory is not reused (the case with the same allocation in std::unique_ptr).

jemalloc does a great job without doing huge number of syscalls in this case (the SQL query above)

$ sudo strace -fe mmap,munmap -p 360474                                             [3:02:03]
strace: Process 360474 attached with 58 threads
[pid 360527] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff0675f7000
[pid 360527] munmap(0x7ff0675f7000, 2097152) = 0
[pid 360527] mmap(NULL, 4190208, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff0673f8000
[pid 360527] munmap(0x7ff0673f8000, 32768) = 0
[pid 360527] munmap(0x7ff067600000, 2060288) = 0
[pid 360527] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff067200000
[pid 360523] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff067000000
[pid 360523] mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066e00000
[pid 360523] mmap(NULL, 2621440, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066b80000
[pid 360523] mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066880000
[pid 360513] mmap(NULL, 2621440, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066600000
[pid 360513] mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff066300000
[pid 360513] mmap(NULL, 3670016, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff065f80000
[pid 360513] mmap(NULL, 5242880, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7ff065a80000

And with mimalloc I have

danlark: ~/ClickHouse mimalloc ⚡
$ sudo strace -fe mmap,munmap -p 376681 2>out                                       [3:12:55]                                                            
danlark: ~/ClickHouse mimalloc ⚡
$ wc -l out                                                                         [3:13:05]
6576 out

Sample is exactly what I wrote about -- so many >512Kb allocations with many munmaps (almost one thousand lines identical sample https://pastebin.com/xcfcWV1e)

[pid 376728] munmap(0x7feb88800000, 635576) = 0
[pid 376728] munmap(0x7feb88c00000, 1048784) = 0
[pid 376728] mmap(NULL, 1048784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf885c000
[pid 376728] munmap(0x7febf885c000, 1048784) = 0
[pid 376728] mmap(NULL, 5243088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb88aff000
[pid 376728] munmap(0x7feb88aff000, 1052672) = 0
[pid 376728] munmap(0x7feb88d01000, 3137744) = 0
[pid 376728] munmap(0x7feb88c00000, 1048784) = 0
[pid 376728] mmap(NULL, 1048784, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf885c000
[pid 376728] munmap(0x7febf885c000, 1048784) = 0
[pid 376728] mmap(NULL, 5243088, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7feb88aff000
[pid 376728] munmap(0x7feb88aff000, 1052672) = 0
[pid 376728] munmap(0x7feb88d01000, 3137744) = 0
[pid 376728] mmap(NULL, 660344, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7febf88bb000
[pid 376728] munmap(0x7febf88bb000, 660344) = 0

from mimalloc.

Yardanico commented on July 19, 2024 3

@daanx

Thanks Danila -- super helpful. I am traveling but will try this but soon next week. I can already see it is due to many "huge" (>1mb) being allocated and freed (using expensive mmap's). This is not quite the use case for mimalloc (being build for many short lived small allocations :-) ), -- but I have ideas on how to fix this -- there is already code to do pooled huge page allocations and I'll experiment with that.

Sorry for necroposting, but I wanted to clarify this message - does

mimalloc (being build for many short lived small allocations :-) )

mean that Mimalloc is built for a lot of short-lived small allocations, or the opposite - that it's built for big allocations that live for long?

from mimalloc.

danlark1 commented on July 19, 2024 2

Thank you for your answer, I hope we can collaborate a lot. My email if you have any questions: [email protected]. And I definitely need to read everything about mimalloc before getting some conclusions :)

ClickHouse performance reproducible instruction (in total, it can be hard a bit to get all the settings from scratch):

You can use such instructions (trust me, they are simple) to build and to run clickhouse https://clickhouse.yandex/docs/en/development/build/, https://clickhouse.yandex/docs/en/development/tests/
Then you should download some our anonymized dataset https://clickhouse.yandex/docs/en/getting_started/example_datasets/metrica/
I made a branch especially for this case and it is called mimalloc
Then you should comment the function in contrib/ssl/crypto/compat/reallocarray.c because it is ambigious (I will investigate this issue separately). The build is by default will be with mimalloc, to turn off, use -D ENABLE_MIMALLOC=0 (and uncomment function in ssl) in cmake (it will turn on jemalloc). Debug build type is -D CMAKE_BUILD_TYPE=Debug
Then I changed locally the code to turn on statistics in mimalloc-types.h.

The query I executed was even one thread query:

SELECT count(*)
FROM danlark_table
WHERE NOT ignore(URL)
SETTINGS max_threads = 1

From big dataset that can show some more information (though I can't give you the source), it is not the ending of execution, it is just before the end, I believe, but during the execution I already see a huge slowdown

heap stats:     peak      total      freed       unit      count  
normal   1:    98.2 kb   338.7 kb   337.1 kb       8 b     43.4 k   not all freed!
normal   2:    29.6 kb   133.3 kb   121.8 kb      16 b      8.5 k   not all freed!
normal   4:     2.7 mb     3.3 mb     3.2 mb      32 b    107.8 k   not all freed!
normal   6:   591.5 kb     3.4 mb     3.4 mb      48 b     75.2 k   not all freed!
normal   8:   873.9 kb     6.7 mb     6.5 mb      64 b    109.1 k   not all freed!
normal   9:     3.7 mb     9.1 mb     9.0 mb      80 b    119.1 k   not all freed!
normal  10:     8.4 kb   165.5 kb   159.1 kb      96 b      1.8 k   not all freed!
normal  11:     3.1 mb     4.3 mb     4.3 mb     112 b     40.3 k   not all freed!
normal  12:   195.4 kb   365.1 kb   172.6 kb     128 b      2.9 k   not all freed!
normal  13:     4.5 kb    64.2 kb    60.0 kb     160 b      411     not all freed!
normal  14:   425.2 kb     1.6 mb     1.3 mb     192 b      8.9 k   not all freed!
normal  15:   111.1 kb   125.1 kb    38.5 kb     224 b      572     not all freed!
normal  16:    16.8 kb    90.2 kb    80.5 kb     256 b      361     not all freed!
normal  17:     3.1 kb    35.6 kb    33.8 kb     320 b      114     not all freed!
normal  18:    30.0 kb    75.4 kb    52.9 kb     384 b      201     not all freed!
normal  19:    35.4 kb    49.0 kb    40.7 kb     448 b      112     not all freed!
normal  20:    10.0 kb   267.0 kb   260.5 kb     512 b      534     not all freed!
normal  21:    74.4 kb    87.5 kb    14.4 kb     640 b      140     not all freed!
normal  22:    16.5 kb    27.0 kb    26.2 kb     768 b       36     not all freed!
normal  23:     6.1 kb    37.6 kb    36.8 kb     896 b       43     not all freed!
normal  24:     8.0 kb    11.0 kb     7.0 kb     1.0 kb      11     not all freed!
normal  25:    25.0 kb    26.2 kb     1.2 kb     1.2 kb      21     not all freed!
normal  26:    13.5 kb    24.0 kb    24.0 kb     1.5 kb      16     ok
normal  27:     7.0 kb    71.8 kb    70.0 kb     1.8 kb      41     not all freed!
normal  28:    24.0 kb    28.0 kb    14.0 kb     2.0 kb      14     not all freed!
normal  29:   207.5 kb   492.5 kb   290.0 kb     2.5 kb     197     not all freed!
normal  30:    24.0 kb    33.0 kb    33.0 kb     3.0 kb      11     ok
normal  31:     7.0 kb    24.5 kb    21.0 kb     3.5 kb       7     not all freed!
normal  32:    37.6 mb    37.6 mb    37.6 mb     4.0 kb     9.6 k   not all freed!
normal  33:    20.0 kb    45.0 kb    45.0 kb     5.0 kb       9     ok
normal  34:   144.0 kb   204.0 kb    66.0 kb     6.0 kb      34     not all freed!
normal  36:    32.0 kb    32.0 kb    24.0 kb     8.0 kb       4     not all freed!
normal  37:    10.0 kb    10.0 kb    10.0 kb    10.0 kb       1     ok
normal  38:    36.0 kb    60.0 kb    48.0 kb    12.0 kb       5     not all freed!
normal  39:    28.0 kb    56.0 kb    28.0 kb    14.0 kb       4     not all freed!
normal  40:    48.0 kb    80.0 kb    48.0 kb    16.0 kb       5     not all freed!
normal  41:    20.0 kb    40.0 kb    20.0 kb    20.0 kb       2     not all freed!
normal  42:    24.0 kb    24.0 kb    24.0 kb    24.0 kb       1     ok
normal  44:    64.0 kb    64.0 kb    32.0 kb    32.0 kb       2     not all freed!
normal  45:   680.0 kb   920.0 kb   920.0 kb    40.0 kb      23     ok
normal  48:   256.0 kb   256.0 kb    64.0 kb    64.0 kb       4     not all freed!
normal  50:    96.0 kb    96.0 kb       0 b     96.0 kb       1     not all freed!
normal  52:    87.0 mb   902.2 mb   902.2 mb   128.0 kb     7.2 k   ok
normal  54:   192.0 kb   192.0 kb   192.0 kb   192.0 kb       1     ok
normal  55:   896.0 kb   896.0 kb   896.0 kb   224.0 kb       4     ok
normal  56:    90.0 mb   605.8 mb   605.2 mb   256.0 kb     2.4 k   not all freed!
normal  57:    10.9 mb    10.9 mb    10.6 mb   320.0 kb      35     not all freed!
normal  58:    18.4 mb    18.4 mb    18.4 mb   384.0 kb      49     ok
normal  59:    17.1 mb    17.1 mb    17.1 mb   448.0 kb      39     ok
normal  60:    23.0 mb    23.0 mb    22.5 mb   512.0 kb      46     not all freed!
normal  61:    30.6 mb    30.6 mb    30.6 mb   640.0 kb      49     ok
normal  62:    19.5 mb    19.5 mb    19.5 mb   768.0 kb      26     ok
normal  63:     7.0 mb     7.0 mb     7.0 mb   896.0 kb       8     ok
normal  64:     4.6 gb     4.6 gb       0 b      1.0 mb     4.7 k   not all freed!

heap stats:     peak      total      freed       unit      count  
    normal:     5.0 gb     6.3 gb     1.7 gb       1 b              not all freed!
      huge:   737.0 mb    13.2 gb    17.9 gb       1 b              ok
     total:     5.7 gb    19.5 gb    19.5 gb       1 b              not all freed!
malloc requested:         19.5 gb

 committed:     1.1 gb    17.9 gb    17.9 gb       1 b              not all freed!
  reserved:     1.3 gb    18.1 gb    17.9 gb       1 b              not all freed!
     reset:       0          0          0   
  segments:     591        9.5 k      9.4 k 
-abandoned:       0          0          0   
     pages:     939        9.8 k      9.4 k 
-abandoned:       0          0          0   
 -extended:    10.4 k 
     mmaps:    19.5 k 
 mmap fast:      50   
 mmap slow:     9.7 k 
   threads:      15   
  searches:     0.1 avg
   elapsed:    35.540 s

From test.hits (that you will download)

heap stats:     peak      total      freed       unit      count  
normal   1:    12.8 kb    69.4 kb    67.8 kb       8 b      8.9 k   not all freed!
normal   2:    18.8 kb    87.9 kb    75.5 kb      16 b      5.6 k   not all freed!
normal   4:   342.2 kb   908.5 kb   854.3 kb      32 b     29.1 k   not all freed!
normal   6:   188.1 kb     2.3 mb     2.2 mb      48 b     50.8 k   not all freed!
normal   8:   335.5 kb     1.8 mb     1.7 mb      64 b     30.2 k   not all freed!
normal   9:   420.2 kb     1.1 mb     1.1 mb      80 b     15.0 k   not all freed!
normal  10:     8.1 kb   169.7 kb   164.0 kb      96 b      1.8 k   not all freed!
normal  11:   333.0 kb   560.5 kb   547.6 kb     112 b      5.1 k   not all freed!
normal  12:   199.2 kb   314.5 kb   118.9 kb     128 b      2.5 k   not all freed!
normal  13:     6.4 kb    70.9 kb    64.8 kb     160 b      454     not all freed!
normal  14:   429.6 kb     1.7 mb     1.4 mb     192 b      9.2 k   not all freed!
normal  15:   111.3 kb   126.0 kb    40.2 kb     224 b      576     not all freed!
normal  16:    17.5 kb    92.2 kb    81.5 kb     256 b      369     not all freed!
normal  17:     3.4 kb    38.1 kb    36.2 kb     320 b      122     not all freed!
normal  18:    31.5 kb    78.8 kb    55.5 kb     384 b      210     not all freed!
normal  19:    35.9 kb    49.4 kb    40.7 kb     448 b      113     not all freed!
normal  20:    11.5 kb    40.5 kb    35.5 kb     512 b       81     not all freed!
normal  21:    76.9 kb    94.4 kb    20.6 kb     640 b      151     not all freed!
normal  22:    21.8 kb    36.8 kb    30.8 kb     768 b       49     not all freed!
normal  23:     7.0 kb    40.2 kb    38.5 kb     896 b       46     not all freed!
normal  24:     9.0 kb    13.0 kb     9.0 kb     1.0 kb      13     not all freed!
normal  25:    27.5 kb    31.2 kb     5.0 kb     1.2 kb      25     not all freed!
normal  26:    15.0 kb    25.5 kb    25.5 kb     1.5 kb      17     ok
normal  27:     7.0 kb    71.8 kb    70.0 kb     1.8 kb      41     not all freed!
normal  28:    24.0 kb    28.0 kb    14.0 kb     2.0 kb      14     not all freed!
normal  29:   212.5 kb   497.5 kb   290.0 kb     2.5 kb     199     not all freed!
normal  30:    30.0 kb    36.0 kb    36.0 kb     3.0 kb      12     ok
normal  31:     7.0 kb    24.5 kb    21.0 kb     3.5 kb       7     not all freed!
normal  32:     3.7 mb     3.7 mb     3.7 mb     4.0 kb     951     not all freed!
normal  33:    20.0 kb    45.0 kb    45.0 kb     5.0 kb       9     ok
normal  34:   150.0 kb   210.0 kb    66.0 kb     6.0 kb      35     not all freed!
normal  36:    48.0 kb    48.0 kb    24.0 kb     8.0 kb       6     not all freed!
normal  37:    10.0 kb    10.0 kb    10.0 kb    10.0 kb       1     ok
normal  38:    36.0 kb    60.0 kb    48.0 kb    12.0 kb       5     not all freed!
normal  39:    28.0 kb    56.0 kb    28.0 kb    14.0 kb       4     not all freed!
normal  40:    64.0 kb    96.0 kb    48.0 kb    16.0 kb       6     not all freed!
normal  41:    20.0 kb    40.0 kb    20.0 kb    20.0 kb       2     not all freed!
normal  42:    24.0 kb    24.0 kb    24.0 kb    24.0 kb       1     ok
normal  44:    64.0 kb    64.0 kb    32.0 kb    32.0 kb       2     not all freed!
normal  45:   640.0 kb   960.0 kb   960.0 kb    40.0 kb      24     ok
normal  48:   256.0 kb   256.0 kb    64.0 kb    64.0 kb       4     not all freed!
normal  50:    96.0 kb    96.0 kb       0 b     96.0 kb       1     not all freed!
normal  52:    12.6 mb   101.0 mb   100.9 mb   128.0 kb     808     not all freed!
normal  53:   160.0 kb   160.0 kb   160.0 kb   160.0 kb       1     ok
normal  54:   192.0 kb   192.0 kb   192.0 kb   192.0 kb       1     ok
normal  55:   224.0 kb   224.0 kb   224.0 kb   224.0 kb       1     ok
normal  56:    16.0 mb    32.0 mb    31.8 mb   256.0 kb     128     not all freed!
normal  57:   960.0 kb   960.0 kb   960.0 kb   320.0 kb       3     ok
normal  58:     2.6 mb     2.6 mb     2.6 mb   384.0 kb       7     ok
normal  59:     7.9 mb     7.9 mb     7.9 mb   448.0 kb      18     ok
normal  60:    17.0 mb    19.5 mb    19.0 mb   512.0 kb      39     not all freed!
normal  61:    20.0 mb    20.6 mb    20.6 mb   640.0 kb      33     ok
normal  62:    12.0 mb    12.0 mb    12.0 mb   768.0 kb      16     ok
normal  63:    14.0 mb    14.0 mb    14.0 mb   896.0 kb      16     ok
normal  64:   482.0 mb   482.0 mb       0 b      1.0 mb     482     not all freed!

heap stats:     peak      total      freed       unit      count  
    normal:   593.6 mb   709.2 mb   224.4 mb       1 b              not all freed!
      huge:   182.0 mb     1.1 gb     1.6 gb       1 b              ok
     total:   775.6 mb     1.8 gb     1.8 gb       1 b              not all freed!
malloc requested:          1.8 gb

 committed:   243.6 mb     1.6 gb     1.6 gb       1 b              not all freed!
  reserved:   460.2 mb     1.8 gb     1.6 gb       1 b              not all freed!
     reset:       0          0          0   
  segments:     168        1.1 k     1010   
-abandoned:       0          0          0   
     pages:     520        1.4 k     1010   
-abandoned:       0          0          0   
 -extended:     1.8 k 
     mmaps:     2.2 k 
 mmap fast:      48   
 mmap slow:     1.0 k 
   threads:      15   
  searches:     0.3 avg
   elapsed:     7.402 s
   process: user: 1.072 s, system: 2.920 s, faults: 0, reclaims: 621804, rss: 827.8 mb

And in the end

mimalloc

SELECT count(*)
FROM test.hits 
WHERE NOT ignore(URL)
SETTINGS max_threads = 1

┌─count()─┐
│ 8873898 │
└─────────┘

1 rows in set. Elapsed: 0.693 sec. Processed 8.87 million rows, 762.68 MB (12.80 million rows/s., 1.10 GB/s.)

jemalloc

SELECT count(*)
FROM test.hits 
WHERE NOT ignore(URL)
SETTINGS max_threads = 1

┌─count()─┐
│ 8873898 │
└─────────┘

1 rows in set. Elapsed: 0.388 sec. Processed 8.87 million rows, 762.68 MB (22.84 million rows/s., 1.96 GB/s.)

I put mi_stats_print(nullptr) in dbms/src/Compression/LZ4_decompress_faster.cpp (it is commented in the branch) to get the stats of the execution thread (logs will be big though because we decompress by chunks), maybe we should do some function that can print the stats of all threads (am I correct that it is printing only the current thread stats?).

from mimalloc.

daanx commented on July 19, 2024 1

Thanks Danila -- super helpful. I am traveling but will try this but soon next week. I can already see it is due to many "huge" (>1mb) being allocated and freed (using expensive mmap's). This is not quite the use case for mimalloc (being build for many short lived small allocations :-) ), -- but I have ideas on how to fix this -- there is already code to do pooled huge page allocations and I'll experiment with that.

from mimalloc.

daanx commented on July 19, 2024

Thanks Danila for your benchmarking! I definitely want to fix this issue on the ClickHouse benchmark and I hope we can work together to figure out what happens. 2x is too much on a real-world application!

Did you read the technical report? We describe that over all our (intense) 12 benchmarks, and all SpecMark benchmarks we perform very well -- except suddenly for the GCC benchmark. That kind of shows that for every allocator there can be workloads where it does not do well suddenly; for example jemalloc is 3x slower on Larson, or 19x on cachescratch etc. In the GCC case, it turned out to be the allocation of many long-lived full pages and we fixed that.

I am hoping that we can find something similar for the ClickHouse bench -- especially since this seems a real-world benchmark? Feel free to write me an email so we can figure out in detail what is going on there. Perhaps you can build the DEBUG version and run with MIMALLOC_STATS=1 to gain more insight. Also, can show what you are testing exactly -- maybe we can include it in mimalloc-bench? (I wonder if it is a re-use case indeed where you happen to allocate only large pages -- in that case we can fix it by just increasing the constants in mimalloc-types.h.)

With regard to the "mmap" calls above, this is due to the 4MiB alignment -- with good reasons as discussed in the techreport (jemalloc does the same for arena allocations, and FreeBSD provides aligned mmap). The benchmarks you show above are not very illustrative as they don't read or write the memory which is not what regular applications do -- and allocators ammortize such costs (i.e. efficient small allocation vs mmap for large ones). See for example the alloc-test in mimalloc-bench for a more realistic test of allocation (described here )

from mimalloc.

daanx commented on July 19, 2024

Ah, I wasnt able to build the ClickHouse benchmark yet :-( However, I pushed new changes to the dev branch where huge page segments are now part of the segment cache. This should make a big difference I think -- can you give it a try on the larger benchmark?

from mimalloc.

danlark1 commented on July 19, 2024

Ah, I wasnt able to build the ClickHouse benchmark yet :-( However, I pushed new changes to the dev branch where huge page segments are now part of the segment cache. This should make a big difference I think -- can you give it a try on the larger benchmark?

Seems much better, I will test it with a slower but more reliable perf test and show you the results

from mimalloc.

daanx commented on July 19, 2024

Good to hear -- let me know how it goes and I'll push the fix to the main branch. It is quite interesting to find this case as across the current range of benchmarks it does not occur. As such, I want to include an analysis in the tech report (much like the "full pages" case for the specmark gcc benchmark). One hope of mimalloc is that by having a small codebase, we can more easily identify these kind of cases.

from mimalloc.

junhuaw commented on July 19, 2024

Handling allocations bigger than 512KB is critical for many server applications. It is not uncommon to see several MBs of arenas being allocated. Would be great if this allocator can scale better for huge allocations.

from mimalloc.

daanx commented on July 19, 2024

@junhuaw : mimalloc handles large allocations of course and the new cache should improve the performance. However, the ClickHouse benchmark is a bit special (in the sense that we haven't encountered such program across our wide range of benchmarks yet) in that it does large allocations without doing much with the data in it.. and then freeing it. It turns out that the mmap system call is so expensive that it starts to dominate in those cases. The new cache avoids calling mmap too often and fixes that. (at least, still waiting for the new benchmark results).

All in all very interesting -- it shows that even after testing on a wide range of benchmarks and programs you can still encounter situations that need special strategies -- there is no silver bullet. If you read the tech report, you can see that we had this before with the SpecMark gcc benchmark.

from mimalloc.

danlark1 commented on July 19, 2024

For now, I have some benchmark results and I can say that mimalloc is ok for now (from dev branch) but not better than jemalloc for some of our purposes and 2x slowdown disappears.

I tested on a big variety of queries, the average loss against jemalloc is 3-5% in rather simple cases and sometimes even better than jemalloc in more complicated. I can't show all the results because mimalloc crashed after a long time of work (investigating, maybe because of our intensive allocation but this will be a completely separate issue). Good result from the start!

Some queries, allocator/time (test runs many times and we get the min time):

https://pastebin.com/5JBpjpdp -- string search, less is better
https://pastebin.com/QRyap9cK -- string processing, less is better
https://pastebin.com/U3tPeZnC -- join queries, more is better

I believe we can close this issue as soon as you merge dev branch into master.

from mimalloc.

lnicola commented on July 19, 2024

See also rust-lang/rust#62073 for a benchmark of mimalloc in another workload. It didn't crash, but made almost no difference. That's probably without the large allocation caching.

from mimalloc.

daanx commented on July 19, 2024

@danlark1 : Thanks for re-running the benchmarks! Great to see that we perform as well as jemalloc now.

For the future, I have ideas how to use the main ideas of mimalloc to improve performance for the small allocations can be reused for larger allocations too so stay tuned for mimalloc2 :-) One thing I learned here which I didn't expect is how expensive an mmap call really is -- I was assuming that for large allocations (of size N) the access to the memory (as N reads/writes) would dominate of the mmap call but that is clearly not the case for allocations between, say 512KiB and 10MiB. The cache fixes this for now but I think there are opportunities to do better in the future.

Thanks again for helping to improve mimalloc :-)

from mimalloc.

daanx commented on July 19, 2024

@inicola, thanks for testing. Note that on many workloads most modern allocators will perform very close -- a good thing. Especially of course if the load is not dominated by allocation. So, similar results are very common.

As said in the readme, there are always suddenly programs where some allocator does not do so well and the main goal of a good allocator is to guard against such "edge" cases.
In the end though, there is never an optimal strategy in general and all allocators need to make assumptions about typical program behaviour and optimize for that. That is why one can usually construct artificial benchmarks where allocators trip over (like cache-scratch). In the end, real world behavior is what really matters -- which is why I am quite happy we could fix the intial perf problem observed by @danlark1 .

from mimalloc.

danlark1 commented on July 19, 2024

Closing issue as for now mimalloc does not have such problems.

Btw, we used mimalloc with secure mode in ClickHouse for internal caches and happy with it.

from mimalloc.

daanx commented on July 19, 2024

That is good to hear -- glad you found this issue and that there was an easy fix!
If possible, could you amend your comment on Hackernews? It is the highest comment now and may give the wrong impression now :-)

Also, I am working to also apply the techniques of mimalloc to large allocations too where we may get further improvements beyond avoiding mmap calls. Stay tuned :-)

from mimalloc.

danlark1 commented on July 19, 2024

That is good to hear -- glad you found this issue and that there was an easy fix!
If possible, could you amend your comment on Hackernews? It is the highest comment now and may give the wrong impression now :-)

Also, I am working to also apply the techniques of mimalloc to large allocations too where we may get further improvements beyond avoiding mmap calls. Stay tuned :-)

Ah, I can't edit (and delete) a comment that was 5 days ago because of HN restriction. I added the reply that anyone can see.

from mimalloc.

Some performance issues with mimalloc about mimalloc HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent