Deion of the problem When running a unet pytorch training (

Huge performance drop when running pytorch training with graphene-sgx,about gramineproject/graphene

Comments (26)

qqingli commented on July 22, 2024 1

Finally I found why "from skimage import io" is too slow in sgx, it indirectly calls ctypes to scan dynamic libraries and fork many processes. A workaround is adding these codes before from skimage import io to disable the scanning.

# ctypes workaround for graphene
import ctypes.util
def fl(name):
    raise PermissionError("No permission")
ctypes.util.find_library = fl

from graphene.

mkow commented on July 22, 2024

I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?

from graphene.

Kanakoo commented on July 22, 2024

I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?

I run cat /proc/pid/status | grep Threads and find It's using 49 threads without graphene (see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
7388 root      20   0 23.572g 993216 133064 R  2933  0.4   2:43.90 python3

) and 51 threads with graphene-sgx (see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
62493 root      20   0 32.299g 202800  34620 S  4720  0.1 177:41.91 loader

). In the manifest, sgx.thread_num is 512, 128 will cause Please try to increase sgx.thread_num in the manifest error.

There are 2 physical CPUs, each has 24 cores, and there are 96 logical CPUs.
Here's a snippet of cat /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping        : 6
microcode       : 0xd0002a0
cpu MHz         : 2400.000
cache size      : 36864 KB
physical id     : 0
siblings        : 48
core id         : 0
cpu cores       : 24
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 4800.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 57 bits virtual
power management:
...
processor       : 95
vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping        : 6
microcode       : 0xd0002a0
cpu MHz         : 2400.000
cache size      : 36864 KB
physical id     : 1
siblings        : 48
core id         : 23
cpu cores       : 24
apicid          : 111
initial apicid  : 111
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 4802.23
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 57 bits virtual
power management:

Also, /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor is set to "performance".

from graphene.

aneessahib commented on July 22, 2024

whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?

from graphene.

Kanakoo commented on July 22, 2024

whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?

enclave size is 32G, 16G will cause Cannot allocate memory error. Yes the larger enclave size is, the longer loading time will be, but 32G enclave size needs load for about 1 hour, is that normal?

Besides, graphene-direct works fine with the example, it runs as fast as without graphene.

from graphene.

aneessahib commented on July 22, 2024

it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.

Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.

from graphene.

Kanakoo commented on July 22, 2024

it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.

So 18 times performance drop is very common? I've enabled preheat and used tcmalloc, also enabled performance mode, are there any other methods to optimize?

from graphene.

aneessahib commented on July 22, 2024

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

from graphene.

Kanakoo commented on July 22, 2024

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

90 seconds is just one training step as below:

for batch_idx, (data, target) in enumerate(train_loader):
    logger.info("start step")
    start_time = time.time()
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = client_model(data)
    loss = soft_dice_loss(output, target)
    loss.backward()
    optimizer.step()
    end_time = time.time()
    time_spend = end_time - start_time
    logger.info("end step, time spend: %.3f s" % time_spend)

each loop takes about 90 seconds in sgx, but only 5 seconds without sgx.
But from graphene-sgx starts to the training starts, it takes 1 hour, is that also possible?

from graphene.

mkow commented on July 22, 2024

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX.
@Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?

from graphene.

mkow commented on July 22, 2024

Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl). But I don't remember the exact invocation.

from graphene.

Kanakoo commented on July 22, 2024

Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl). But I don't remember the exact invocation.

Do you mean to run with command "numactl --cpunodebind=0 --membind=0 graphene-sgx python scripts/train.py"? I've tried it, and performance of each step improves by 10 seconds, see the logs:

2021-08-31 09:46:23,137 INFO __main__: Start training
2021-08-31 09:46:24,584 INFO __main__: start step
2021-08-31 09:46:55,982 INFO __main__: end step, time spend: 31.396 s
2021-08-31 09:46:56,234 INFO __main__: start step
2021-08-31 09:47:56,709 INFO __main__: end step, time spend: 60.472 s
2021-08-31 09:47:56,931 INFO __main__: start step
2021-08-31 09:49:12,775 INFO __main__: end step, time spend: 75.841 s
2021-08-31 09:49:13,015 INFO __main__: start step
2021-08-31 09:50:40,776 INFO __main__: end step, time spend: 87.758 s
2021-08-31 09:50:41,036 INFO __main__: start step
2021-08-31 09:52:04,008 INFO __main__: end step, time spend: 82.969 s
2021-08-31 09:52:04,240 INFO __main__: start step
2021-08-31 09:53:28,593 INFO __main__: end step, time spend: 84.350 s
2021-08-31 09:53:28,842 INFO __main__: start step
2021-08-31 09:54:53,361 INFO __main__: end step, time spend: 84.517 s
2021-08-31 09:54:53,609 INFO __main__: start step
2021-08-31 09:56:20,133 INFO __main__: end step, time spend: 86.521 s
...

It's still a huge performance drop compared to not using sgx. And after graphene starts, it still takes about 1 hour to start training.

from graphene.

Kanakoo commented on July 22, 2024

Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX.
@Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?

I only know DataLoader will start multiple processes when specifying num_workers > 0. But specifying multiple processes will cause "Function Not Implemented" error in graphene, so I've specified it to 0, see:

training_set = dataloader.KvasirDataset(DATA_PATH, collaborator_count=1, collaborator_num=0, is_validation=False)
train_loader = DataLoader(training_set, num_workers=0, batch_size=6, shuffle=True)

which means it only starts one process.
And for batch_idx, (data, target) in enumerate(train_loader): runs fast as shown in logs.

from graphene.

dimakuv commented on July 22, 2024

@Kanakoo Could you run graphene-sgx with trace log level (set loader.log_level = "trace" in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.

from graphene.

aneessahib commented on July 22, 2024

we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.

from graphene.

Kanakoo commented on July 22, 2024

we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.

Thanks Anees! I've seen that.
Jitender found that the PRMRR size of my machine is too small, after I increase it to the maximum 64GB and add some optimizations including openmpi and libiomp5.so, the performance with sgx is even a little bit better than without sgx! see:

error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
/venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
2021-09-01 16:18:36,486 INFO __main__: Start training
2021-09-01 16:18:37,478 INFO __main__: start step
2021-09-01 16:18:50,832 INFO __main__: end step, time spend: 13.353 s
2021-09-01 16:18:50,928 INFO __main__: start step
2021-09-01 16:18:55,652 INFO __main__: end step, time spend: 4.724 s
2021-09-01 16:18:55,770 INFO __main__: start step
2021-09-01 16:18:58,442 INFO __main__: end step, time spend: 2.672 s
2021-09-01 16:18:58,530 INFO __main__: start step
2021-09-01 16:19:01,091 INFO __main__: end step, time spend: 2.561 s
2021-09-01 16:19:01,179 INFO __main__: start step
2021-09-01 16:19:03,866 INFO __main__: end step, time spend: 2.687 s
2021-09-01 16:19:03,956 INFO __main__: start step
2021-09-01 16:19:06,010 INFO __main__: end step, time spend: 2.053 s
2021-09-01 16:19:06,100 INFO __main__: start step
2021-09-01 16:19:08,168 INFO __main__: end step, time spend: 2.068 s
...

However, it still takes about 1 hour before the training starts, i mean between the log error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production! and the log /venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning:..., but on your machine, it didn't take so much time.

from graphene.

mkow commented on July 22, 2024

I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?

from graphene.

Kanakoo commented on July 22, 2024

@Kanakoo Could you run graphene-sgx with trace log level (set loader.log_level = "trace" in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.

I've ran with trace level and got the log. The whole log file is too large, about 1.7G, so i split 100M and compress it here: pytorch-log.zip

from graphene.

Kanakoo commented on July 22, 2024

I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?

I've tried to disable preheat and found no effects.

from graphene.

svenkata9 commented on July 22, 2024

@Kanakoo You can try couple of things - Please refer to this PR 2530
Before running the workload:

Pre-System setting

Linux systems have CPU frequency scaling governor that helps the system to scale the CPU frequency to achieve best performance or to save power based on the requirement. To achieve the best peformance, please set the CPU frequency scaling governor to performance mode.

for ((i=0; i<$(nproc); i++)); do echo 'performance' > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

Performance considerations

Preheat manifest option pre-faults the enclave memory and moves the performance penalty to graphene-sgx invocation (before the workload starts executing). To use preheat option, add sgx.preheat_enclave = 1 to the manifest template.
TCMalloc and mimalloc are memory allocator libraries from Google and Microsoft that can help improve performance significantly based on the workloads. At any point, only one of these allocators can be used.
- TCMalloc (Please update the binary location and name if different from default)
  - Install tcmalloc : sudo apt-get install google-perftools
  - Add these in the manifest template
    
    loader.env.LD_PRELOAD = "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
    
    sgx.trusted_files.libtcmalloc = "file:/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
    
    sgx.trusted_files.libunwind = "file:/usr/lib/x86_64-linux-gnu/libunwind.so.8"
  - Save the template and rebuild.
- mimalloc (Please update the binary location and name if different from default)
  - Install mimalloc using the steps from https://github.com/microsoft/mimalloc
  - Add these in the manifest template
    
    loader.env.LD_PRELOAD = "/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
    
    sgx.trusted_files.libmimalloc = "file:/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
  - Save the template and rebuild.

from graphene.

Kanakoo commented on July 22, 2024

@Kanakoo You can try couple of things - Please refer to this PR 2530

I've already applied these settings but found no effects.

Besides, I have discovered which part caused the extremely slow loading, that is from skimage import io.
I wrote a simple example to test:

import logging
logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(name)s: %(message)s')
logger = logging.getLogger(__name__)
logger.info("start import skimage.io")

from skimage import io
logger.info("finish import, start imread")
image = io.imread('/tcf/image.jpg')
logger.info("finish imread")
assert(image.shape[2] == 3)

and found that when running with graphene-sgx on bare metal, it took about 30 minutes to import, see:

2021-09-16 09:27:24,769 INFO __main__: start import skimage.io
2021-09-16 10:07:15,932 INFO __main__: finish import, start imread
2021-09-16 10:07:15,999 INFO __main__: finish imread

I was wondering if there was any problem with my python environment, so i packaged graphene to a image and ran in container. However, in container, when running with graphene-sgx, it took about 10 minutes to import, see:

graphene-worker    | 2021-09-16 10:34:55,266 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:44:50,683 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:44:50,711 INFO __main__: finish imread

also, in container, when running with graphene-direct, it took about 30 seconds to import, see:

graphene-worker    | 2021-09-16 10:32:38,944 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:33:06,027 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:33:06,054 INFO __main__: finish imread

in container, when running without graphene, that is, running with python3, it took about 1 second to import, see:

graphene-worker    | 2021-09-16 10:32:04,014 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:32:04,997 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:32:05,021 INFO __main__: finish imread

There's also a huge performance drop between with and without sgx in container.

To reproduce it in container, here is the related file : graphene_docker.zip

unzip it to /root/graphene_docker
cd /root/graphene_docker/pytorch-fl
python3.6 -m venv venv
source venv/bin/activate
python3.6 -m pip install --upgrade pip
python3.6 -m pip install scikit-image
deactivate
cd /root/graphene_docker
docker-compose -f docker-compose-graphene.yaml build

To run with graphene-sgx:
./graphene-worker.sh start sgx
To run with graphene-direct:
./graphene-worker.sh start direct
To run without graphene:
./graphene-worker.sh start nographene
To stop container:
./graphene-worker.sh stop

from graphene.

svenkata9 commented on July 22, 2024

Based on my little analysis with strace, this module import seems to be heavy on IO operations (stat/read/ftsat) and hence the perf impact with Graphene-SGX.

from graphene.

dimakuv commented on July 22, 2024

@Kanakoo @svenkata9 Actually, what are the sizes of the *.manifest.template, *.manifest, and *.manifest.sgx files for your application? I recently learned that people sometimes have manifest.sgx files which are ~70MB in size (which obviously makes Graphene slow).

from graphene.

Kanakoo commented on July 22, 2024

@dimakuv when running on bare metal, they are:

8.0K    python.manifest
13M     python.manifest.sgx
8.0K    python.manifest.template

when running in container, I remake the exmaple, and they are:

8.0K    python.manifest
1.3M    python.manifest.sgx
8.0K    python.manifest.template

Is 13M too large?

from graphene.

dimakuv commented on July 22, 2024

Is 13M too large?

No, this is a normal value. And 1.3M in the container is a very good value.

So yes, the manifest-file size is not a reason for performance slowdown. I guess it's just the fact that import skimage.io is just slow by itself because it performs a lot of file I/O (as @svenkata9 debugged). Unfortunately, there is probably not much that can be done here.

We could look into these performance bottlenecks, but this would be very low priority... We prioritize functionality issues in Graphene over performance for now.

from graphene.

mkow commented on July 22, 2024

Seems this is resolved, closing.

from graphene.

Huge performance drop when running pytorch training with graphene-sgx about graphene HOT 26 CLOSED

Comments (26)

Pre-System setting

Performance considerations

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent