Giter Club home page Giter Club logo

fastmemcpy's Introduction

Build SSE

with gcc:

gcc -O3 -msse2 FastMemcpy.c -o FastMemcpy

with msvc:

cl -nologo -arch:SSE2 -O2 FastMemcpy.c

Build AVX

with gcc:

gcc -O3 -mavx FastMemcpy_Avx.c -o FastMemcpy_Avx

with msvc:

cl -nologo -arch:AVX -O2 FastMemcpy_Avx.c

Features

  • 50% speedup in avg. vs traditional memcpy in msvc 2012 or gcc 4.9
  • small size copy optimized with jump table
  • medium size copy optimized with sse2 vector copy
  • huge size copy optimized with cache prefetch & movntdq

Reference

Using Block Prefetch for Optimized Memory Performance

The artical only focused on aligned huge memory copy. You need handle other cases by your self.

Results

result: gcc4.9 (msvc 2012 got a similar result):
 
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=81ms memcpy=281 ms
result(dst aligned, src unalign): memcpy_fast=88ms memcpy=254 ms
result(dst unalign, src aligned): memcpy_fast=87ms memcpy=245 ms
result(dst unalign, src unalign): memcpy_fast=81ms memcpy=258 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=91ms memcpy=364 ms
result(dst aligned, src unalign): memcpy_fast=95ms memcpy=336 ms
result(dst unalign, src aligned): memcpy_fast=96ms memcpy=353 ms
result(dst unalign, src unalign): memcpy_fast=99ms memcpy=346 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=124ms memcpy=242 ms
result(dst aligned, src unalign): memcpy_fast=166ms memcpy=555 ms
result(dst unalign, src aligned): memcpy_fast=168ms memcpy=602 ms
result(dst unalign, src unalign): memcpy_fast=174ms memcpy=614 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=119ms memcpy=171 ms
result(dst aligned, src unalign): memcpy_fast=182ms memcpy=442 ms
result(dst unalign, src aligned): memcpy_fast=163ms memcpy=466 ms
result(dst unalign, src unalign): memcpy_fast=168ms memcpy=472 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=68ms memcpy=82 ms
result(dst aligned, src unalign): memcpy_fast=94ms memcpy=226 ms
result(dst unalign, src aligned): memcpy_fast=134ms memcpy=216 ms
result(dst unalign, src unalign): memcpy_fast=84ms memcpy=188 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=55ms memcpy=70 ms
result(dst aligned, src unalign): memcpy_fast=75ms memcpy=192 ms
result(dst unalign, src aligned): memcpy_fast=79ms memcpy=223 ms
result(dst unalign, src unalign): memcpy_fast=91ms memcpy=219 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=181ms memcpy=165 ms
result(dst aligned, src unalign): memcpy_fast=192ms memcpy=303 ms
result(dst unalign, src aligned): memcpy_fast=218ms memcpy=310 ms
result(dst unalign, src unalign): memcpy_fast=183ms memcpy=307 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=263ms memcpy=398 ms
result(dst aligned, src unalign): memcpy_fast=269ms memcpy=433 ms
result(dst unalign, src aligned): memcpy_fast=306ms memcpy=497 ms
result(dst unalign, src unalign): memcpy_fast=285ms memcpy=417 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=287ms memcpy=421 ms
result(dst aligned, src unalign): memcpy_fast=288ms memcpy=430 ms
result(dst unalign, src aligned): memcpy_fast=285ms memcpy=510 ms
result(dst unalign, src unalign): memcpy_fast=291ms memcpy=440 ms

benchmark random access:
memcpy_fast=487ms memcpy=1000ms

About

skywind

http://www.skywind.me

fastmemcpy's People

Contributors

skywind3000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastmemcpy's Issues

gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) results

➜ FastMemcpy git:(master) ./FastMemcpy
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=37ms memcpy=39 ms
result(dst aligned, src unalign): memcpy_fast=35ms memcpy=38 ms
result(dst unalign, src aligned): memcpy_fast=34ms memcpy=39 ms
result(dst unalign, src unalign): memcpy_fast=42ms memcpy=51 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=34ms memcpy=43 ms
result(dst aligned, src unalign): memcpy_fast=60ms memcpy=60 ms
result(dst unalign, src aligned): memcpy_fast=33ms memcpy=40 ms
result(dst unalign, src unalign): memcpy_fast=34ms memcpy=39 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=121ms memcpy=77 ms
result(dst aligned, src unalign): memcpy_fast=102ms memcpy=63 ms
result(dst unalign, src aligned): memcpy_fast=88ms memcpy=79 ms
result(dst unalign, src unalign): memcpy_fast=88ms memcpy=51 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=82ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=87ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=88ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=87ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=47ms memcpy=27 ms
result(dst aligned, src unalign): memcpy_fast=40ms memcpy=20 ms
result(dst unalign, src aligned): memcpy_fast=38ms memcpy=23 ms
result(dst unalign, src unalign): memcpy_fast=40ms memcpy=22 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=22 ms
result(dst aligned, src unalign): memcpy_fast=62ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=57ms memcpy=46 ms
result(dst unalign, src unalign): memcpy_fast=41ms memcpy=47 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=60ms memcpy=53 ms
result(dst aligned, src unalign): memcpy_fast=61ms memcpy=48 ms
result(dst unalign, src aligned): memcpy_fast=61ms memcpy=52 ms
result(dst unalign, src unalign): memcpy_fast=62ms memcpy=51 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=70ms memcpy=53 ms
result(dst aligned, src unalign): memcpy_fast=66ms memcpy=53 ms
result(dst unalign, src aligned): memcpy_fast=67ms memcpy=57 ms
result(dst unalign, src unalign): memcpy_fast=66ms memcpy=57 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=63ms memcpy=69 ms
result(dst aligned, src unalign): memcpy_fast=64ms memcpy=68 ms
result(dst unalign, src aligned): memcpy_fast=64ms memcpy=71 ms
result(dst unalign, src unalign): memcpy_fast=80ms memcpy=69 ms

benchmark random access:
memcpy_fast=600ms memcpy=531ms

➜ FastMemcpy git:(master) gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/9/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none:hsa OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-17ubuntu1~20.04' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-9-HskZEa/gcc-9-9.3.0/debian/tmp-nvptx/usr,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)

一些小小的问题

和现在的 MCFCRT 比较了一下,因为 MCFCRT 不打算支持 AVX 就只测试了 SSE 的(实际上是懒得改,其实比较简单,目前的复制操作都是两个连续 movups 打包的,这地方改改就能支持 AVX):

4311

https://github.com/lhmouse/MCF/blob/master/MCFCRT/src/stdc/string/_memcpy_impl.h#L292

gcc (gcc-7-branch HEAD with MCF thread model, built by LH_Mouse.) 7.3.1 20180125
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

处理器是 Intel Xeon E3 1230v3 ,Haswell 架构。

你这个测试我看了下,有几个小问题:

  1. #1 里面提过的。这个二次函数调用的开销对 32 字节的复制测试的影响比较大,大概有十几个毫秒。
  2. 只有 32 64 512 等尺寸的复制,没有带余数的。
  3. timeGetTime() 太不精确了(误差经常在十几个毫秒),建议用 QuertPerformanceCounter(), 还不用链接 winmm

实现的问题也有一些:

  1. 全部用 static 函数涉嫌利用编译器对 internal linkage 的函数的 aggressive optimization 造假。(逃)
  2. memcpy_fast() 最后的 memcpy_tiny() 实际上没有必要:因为上面 if (size <= 128) { 条件不成立的缘故,直接用 _mm_storeu_si128() 处理最后的 xmmword 就行。
  3. prefetching 实测意义不大。
  4. 不建议用整数的 sse 指令。它要求 SSE2,指令长度更大所以 cache locality 不好。相比之下, _mm_{loadu,load,storeu,store,stream}_ps() 仅要求 SSE 支持,并且指令长度更小。(由于 Core2 上存在延迟问题, GCC 会将 movups 优化为 movlpsmovhps 的拼合。然而这是后话,Haswell 上已经没有这延迟问题了。)

Very slow than the builtin memcpy

gcc version is 4.4.7

OP:

benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=278ms memcpy=85 ms
result(dst aligned, src unalign): memcpy_fast=295ms memcpy=175 ms
result(dst unalign, src aligned): memcpy_fast=299ms memcpy=85 ms
result(dst unalign, src unalign): memcpy_fast=283ms memcpy=179 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=398ms memcpy=123 ms
result(dst aligned, src unalign): memcpy_fast=403ms memcpy=211 ms
result(dst unalign, src aligned): memcpy_fast=386ms memcpy=104 ms
result(dst unalign, src unalign): memcpy_fast=384ms memcpy=195 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=815ms memcpy=223 ms
result(dst aligned, src unalign): memcpy_fast=819ms memcpy=296 ms
result(dst unalign, src aligned): memcpy_fast=924ms memcpy=324 ms
result(dst unalign, src unalign): memcpy_fast=1082ms memcpy=304 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=881ms memcpy=265 ms
result(dst aligned, src unalign): memcpy_fast=748ms memcpy=292 ms
result(dst unalign, src aligned): memcpy_fast=792ms memcpy=271 ms
result(dst unalign, src unalign): memcpy_fast=918ms memcpy=282 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=365ms memcpy=40 ms
result(dst aligned, src unalign): memcpy_fast=374ms memcpy=81 ms
result(dst unalign, src aligned): memcpy_fast=369ms memcpy=69 ms
result(dst unalign, src unalign): memcpy_fast=378ms memcpy=80 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=391ms memcpy=35 ms
result(dst aligned, src unalign): memcpy_fast=369ms memcpy=67 ms
result(dst unalign, src aligned): memcpy_fast=374ms memcpy=60 ms
result(dst unalign, src unalign): memcpy_fast=417ms memcpy=68 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=597ms memcpy=162 ms
result(dst aligned, src unalign): memcpy_fast=618ms memcpy=166 ms
result(dst unalign, src aligned): memcpy_fast=604ms memcpy=154 ms
result(dst unalign, src unalign): memcpy_fast=587ms memcpy=155 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=409ms memcpy=195 ms
result(dst aligned, src unalign): memcpy_fast=450ms memcpy=205 ms
result(dst unalign, src aligned): memcpy_fast=440ms memcpy=189 ms
result(dst unalign, src unalign): memcpy_fast=438ms memcpy=200 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=420ms memcpy=219 ms
result(dst aligned, src unalign): memcpy_fast=426ms memcpy=212 ms
result(dst unalign, src aligned): memcpy_fast=431ms memcpy=210 ms
result(dst unalign, src unalign): memcpy_fast=426ms memcpy=209 ms

benchmark random access:
memcpy_fast=6917ms memcpy=1862ms

Slower on later GCC

This actually appears to be slower on GCC 5.4

benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=42ms memcpy=48 ms
result(dst aligned, src unalign): memcpy_fast=46ms memcpy=54 ms
result(dst unalign, src aligned): memcpy_fast=43ms memcpy=53 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=55 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=44ms memcpy=57 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=60 ms
result(dst unalign, src aligned): memcpy_fast=43ms memcpy=65 ms
result(dst unalign, src unalign): memcpy_fast=43ms memcpy=62 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=77ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=82ms memcpy=61 ms
result(dst unalign, src aligned): memcpy_fast=81ms memcpy=61 ms
result(dst unalign, src unalign): memcpy_fast=79ms memcpy=61 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=79ms memcpy=45 ms
result(dst aligned, src unalign): memcpy_fast=77ms memcpy=47 ms
result(dst unalign, src aligned): memcpy_fast=77ms memcpy=50 ms
result(dst unalign, src unalign): memcpy_fast=76ms memcpy=55 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=39ms memcpy=33 ms
result(dst aligned, src unalign): memcpy_fast=47ms memcpy=33 ms
result(dst unalign, src aligned): memcpy_fast=40ms memcpy=46 ms
result(dst unalign, src unalign): memcpy_fast=45ms memcpy=49 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=30 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=31 ms
result(dst unalign, src aligned): memcpy_fast=48ms memcpy=43 ms
result(dst unalign, src unalign): memcpy_fast=48ms memcpy=43 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=82ms memcpy=68 ms
result(dst aligned, src unalign): memcpy_fast=84ms memcpy=68 ms
result(dst unalign, src aligned): memcpy_fast=82ms memcpy=67 ms
result(dst unalign, src unalign): memcpy_fast=81ms memcpy=72 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=114ms memcpy=110 ms
result(dst aligned, src unalign): memcpy_fast=101ms memcpy=107 ms
result(dst unalign, src aligned): memcpy_fast=103ms memcpy=103 ms
result(dst unalign, src unalign): memcpy_fast=101ms memcpy=105 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=113ms memcpy=108 ms
result(dst aligned, src unalign): memcpy_fast=100ms memcpy=107 ms
result(dst unalign, src aligned): memcpy_fast=104ms memcpy=107 ms
result(dst unalign, src unalign): memcpy_fast=101ms memcpy=107 ms

benchmark random access:
memcpy_fast=647ms memcpy=593ms

GCC 10.2.1 Results

gcc version 10.2.1 20201007 releases/gcc-10.2.0-350-g136256c32d (Clear Linux OS for Intel Architecture)

./FastMemcpy
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=48ms memcpy=35 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=33 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=34 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst aligned, src unalign): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src aligned): memcpy_fast=54ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=54ms memcpy=34 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=52 ms
result(dst unalign, src aligned): memcpy_fast=93ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=94ms memcpy=51 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=85ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=91ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=91ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=90ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=20 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=20 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=20 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=40ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=43ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=43ms memcpy=33 ms
result(dst unalign, src unalign): memcpy_fast=43ms memcpy=34 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=54ms memcpy=43 ms
result(dst aligned, src unalign): memcpy_fast=55ms memcpy=44 ms
result(dst unalign, src aligned): memcpy_fast=55ms memcpy=47 ms
result(dst unalign, src unalign): memcpy_fast=55ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=70 ms
result(dst aligned, src unalign): memcpy_fast=88ms memcpy=78 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=74 ms
result(dst unalign, src unalign): memcpy_fast=91ms memcpy=75 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=96ms memcpy=90 ms
result(dst aligned, src unalign): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=92 ms

benchmark random access:
memcpy_fast=802ms memcpy=662ms

./FastMemcpy_Avx
benchmark(size=32 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=64 bytes, times=16777216):
result(dst aligned, src aligned): memcpy_fast=49ms memcpy=29 ms
result(dst aligned, src unalign): memcpy_fast=49ms memcpy=29 ms
result(dst unalign, src aligned): memcpy_fast=49ms memcpy=30 ms
result(dst unalign, src unalign): memcpy_fast=49ms memcpy=29 ms

benchmark(size=512 bytes, times=8388608):
result(dst aligned, src aligned): memcpy_fast=64ms memcpy=56 ms
result(dst aligned, src unalign): memcpy_fast=64ms memcpy=51 ms
result(dst unalign, src aligned): memcpy_fast=66ms memcpy=56 ms
result(dst unalign, src unalign): memcpy_fast=66ms memcpy=52 ms

benchmark(size=1024 bytes, times=4194304):
result(dst aligned, src aligned): memcpy_fast=43ms memcpy=41 ms
result(dst aligned, src unalign): memcpy_fast=44ms memcpy=43 ms
result(dst unalign, src aligned): memcpy_fast=44ms memcpy=44 ms
result(dst unalign, src unalign): memcpy_fast=44ms memcpy=44 ms

benchmark(size=4096 bytes, times=524288):
result(dst aligned, src aligned): memcpy_fast=20ms memcpy=19 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=21 ms
result(dst unalign, src aligned): memcpy_fast=21ms memcpy=21 ms
result(dst unalign, src unalign): memcpy_fast=21ms memcpy=21 ms

benchmark(size=8192 bytes, times=262144):
result(dst aligned, src aligned): memcpy_fast=21ms memcpy=23 ms
result(dst aligned, src unalign): memcpy_fast=22ms memcpy=23 ms
result(dst unalign, src aligned): memcpy_fast=22ms memcpy=34 ms
result(dst unalign, src unalign): memcpy_fast=22ms memcpy=33 ms

benchmark(size=1048576 bytes, times=2048):
result(dst aligned, src aligned): memcpy_fast=90ms memcpy=45 ms
result(dst aligned, src unalign): memcpy_fast=90ms memcpy=45 ms
result(dst unalign, src aligned): memcpy_fast=89ms memcpy=48 ms
result(dst unalign, src unalign): memcpy_fast=88ms memcpy=48 ms

benchmark(size=4194304 bytes, times=512):
result(dst aligned, src aligned): memcpy_fast=88ms memcpy=72 ms
result(dst aligned, src unalign): memcpy_fast=92ms memcpy=79 ms
result(dst unalign, src aligned): memcpy_fast=88ms memcpy=76 ms
result(dst unalign, src unalign): memcpy_fast=87ms memcpy=77 ms

benchmark(size=8388608 bytes, times=256):
result(dst aligned, src aligned): memcpy_fast=95ms memcpy=91 ms
result(dst aligned, src unalign): memcpy_fast=98ms memcpy=92 ms
result(dst unalign, src aligned): memcpy_fast=94ms memcpy=91 ms
result(dst unalign, src unalign): memcpy_fast=95ms memcpy=95 ms

benchmark random access:
memcpy_fast=796ms memcpy=687ms

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.