Comments (26)
Finally I found why "from skimage import io" is too slow in sgx, it indirectly calls ctypes to scan dynamic libraries and fork many processes. A workaround is adding these codes before from skimage import io
to disable the scanning.
# ctypes workaround for graphene
import ctypes.util
def fl(name):
raise PermissionError("No permission")
ctypes.util.find_library = fl
from graphene.
I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?
from graphene.
I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?
I run cat /proc/pid/status | grep Threads
and find It's using 49 threads without graphene (see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7388 root 20 0 23.572g 993216 133064 R 2933 0.4 2:43.90 python3
) and 51 threads with graphene-sgx (see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
62493 root 20 0 32.299g 202800 34620 S 4720 0.1 177:41.91 loader
). In the manifest, sgx.thread_num is 512, 128 will cause Please try to increase sgx.thread_num in the manifest
error.
There are 2 physical CPUs, each has 24 cores, and there are 96 logical CPUs.
Here's a snippet of cat /proc/cpuinfo
:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping : 6
microcode : 0xd0002a0
cpu MHz : 2400.000
cache size : 36864 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips : 4800.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:
...
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping : 6
microcode : 0xd0002a0
cpu MHz : 2400.000
cache size : 36864 KB
physical id : 1
siblings : 48
core id : 23
cpu cores : 24
apicid : 111
initial apicid : 111
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips : 4802.23
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:
Also, /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor is set to "performance".
from graphene.
whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?
from graphene.
whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?
enclave size is 32G, 16G will cause Cannot allocate memory
error. Yes the larger enclave size is, the longer loading time will be, but 32G enclave size needs load for about 1 hour, is that normal?
Besides, graphene-direct works fine with the example, it runs as fast as without graphene.
from graphene.
it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.
from graphene.
it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.
So 18 times performance drop is very common? I've enabled preheat and used tcmalloc, also enabled performance mode, are there any other methods to optimize?
from graphene.
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
from graphene.
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
90 seconds is just one training step as below:
for batch_idx, (data, target) in enumerate(train_loader):
logger.info("start step")
start_time = time.time()
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = client_model(data)
loss = soft_dice_loss(output, target)
loss.backward()
optimizer.step()
end_time = time.time()
time_spend = end_time - start_time
logger.info("end step, time spend: %.3f s" % time_spend)
each loop takes about 90 seconds in sgx, but only 5 seconds without sgx.
But from graphene-sgx starts to the training starts, it takes 1 hour, is that also possible?
from graphene.
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX.
@Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?
from graphene.
Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl
). But I don't remember the exact invocation.
from graphene.
Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with
numactl
). But I don't remember the exact invocation.
Do you mean to run with command "numactl --cpunodebind=0 --membind=0 graphene-sgx python scripts/train.py"? I've tried it, and performance of each step improves by 10 seconds, see the logs:
2021-08-31 09:46:23,137 INFO __main__: Start training
2021-08-31 09:46:24,584 INFO __main__: start step
2021-08-31 09:46:55,982 INFO __main__: end step, time spend: 31.396 s
2021-08-31 09:46:56,234 INFO __main__: start step
2021-08-31 09:47:56,709 INFO __main__: end step, time spend: 60.472 s
2021-08-31 09:47:56,931 INFO __main__: start step
2021-08-31 09:49:12,775 INFO __main__: end step, time spend: 75.841 s
2021-08-31 09:49:13,015 INFO __main__: start step
2021-08-31 09:50:40,776 INFO __main__: end step, time spend: 87.758 s
2021-08-31 09:50:41,036 INFO __main__: start step
2021-08-31 09:52:04,008 INFO __main__: end step, time spend: 82.969 s
2021-08-31 09:52:04,240 INFO __main__: start step
2021-08-31 09:53:28,593 INFO __main__: end step, time spend: 84.350 s
2021-08-31 09:53:28,842 INFO __main__: start step
2021-08-31 09:54:53,361 INFO __main__: end step, time spend: 84.517 s
2021-08-31 09:54:53,609 INFO __main__: start step
2021-08-31 09:56:20,133 INFO __main__: end step, time spend: 86.521 s
...
It's still a huge performance drop compared to not using sgx. And after graphene starts, it still takes about 1 hour to start training.
from graphene.
Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX.
@Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?
I only know DataLoader will start multiple processes when specifying num_workers > 0. But specifying multiple processes will cause "Function Not Implemented" error in graphene, so I've specified it to 0, see:
training_set = dataloader.KvasirDataset(DATA_PATH, collaborator_count=1, collaborator_num=0, is_validation=False)
train_loader = DataLoader(training_set, num_workers=0, batch_size=6, shuffle=True)
which means it only starts one process.
And for batch_idx, (data, target) in enumerate(train_loader):
runs fast as shown in logs.
from graphene.
@Kanakoo Could you run graphene-sgx
with trace log level (set loader.log_level = "trace"
in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.
from graphene.
we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.
from graphene.
we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.
Thanks Anees! I've seen that.
Jitender found that the PRMRR size of my machine is too small, after I increase it to the maximum 64GB and add some optimizations including openmpi and libiomp5.so, the performance with sgx is even a little bit better than without sgx! see:
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
/venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
2021-09-01 16:18:36,486 INFO __main__: Start training
2021-09-01 16:18:37,478 INFO __main__: start step
2021-09-01 16:18:50,832 INFO __main__: end step, time spend: 13.353 s
2021-09-01 16:18:50,928 INFO __main__: start step
2021-09-01 16:18:55,652 INFO __main__: end step, time spend: 4.724 s
2021-09-01 16:18:55,770 INFO __main__: start step
2021-09-01 16:18:58,442 INFO __main__: end step, time spend: 2.672 s
2021-09-01 16:18:58,530 INFO __main__: start step
2021-09-01 16:19:01,091 INFO __main__: end step, time spend: 2.561 s
2021-09-01 16:19:01,179 INFO __main__: start step
2021-09-01 16:19:03,866 INFO __main__: end step, time spend: 2.687 s
2021-09-01 16:19:03,956 INFO __main__: start step
2021-09-01 16:19:06,010 INFO __main__: end step, time spend: 2.053 s
2021-09-01 16:19:06,100 INFO __main__: start step
2021-09-01 16:19:08,168 INFO __main__: end step, time spend: 2.068 s
...
However, it still takes about 1 hour before the training starts, i mean between the log error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
and the log /venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning:...
, but on your machine, it didn't take so much time.
from graphene.
I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?
from graphene.
@Kanakoo Could you run
graphene-sgx
with trace log level (setloader.log_level = "trace"
in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.
I've ran with trace level and got the log. The whole log file is too large, about 1.7G, so i split 100M and compress it here: pytorch-log.zip
from graphene.
I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?
I've tried to disable preheat and found no effects.
from graphene.
@Kanakoo You can try couple of things - Please refer to this PR 2530
Before running the workload:
Pre-System setting
Linux systems have CPU frequency scaling governor that helps the system to scale the CPU frequency to achieve best performance or to save power based on the requirement. To achieve the best peformance, please set the CPU frequency scaling governor to performance mode.
for ((i=0; i<$(nproc); i++)); do echo 'performance' > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
Performance considerations
- Preheat manifest option pre-faults the enclave memory and moves the performance penalty to graphene-sgx invocation (before the workload starts executing). To use preheat option, add
sgx.preheat_enclave = 1
to the manifest template. - TCMalloc and mimalloc are memory allocator libraries from Google and Microsoft that can help improve performance significantly based on the workloads. At any point, only one of these allocators can be used.
- TCMalloc (Please update the binary location and name if different from default)
- Install tcmalloc : sudo apt-get install google-perftools
- Add these in the manifest template
loader.env.LD_PRELOAD = "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
sgx.trusted_files.libtcmalloc = "file:/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
sgx.trusted_files.libunwind = "file:/usr/lib/x86_64-linux-gnu/libunwind.so.8"
- Save the template and rebuild.
- mimalloc (Please update the binary location and name if different from default)
- Install mimalloc using the steps from https://github.com/microsoft/mimalloc
- Add these in the manifest template
loader.env.LD_PRELOAD = "/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
sgx.trusted_files.libmimalloc = "file:/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
- Save the template and rebuild.
- TCMalloc (Please update the binary location and name if different from default)
from graphene.
@Kanakoo You can try couple of things - Please refer to this PR 2530
I've already applied these settings but found no effects.
Besides, I have discovered which part caused the extremely slow loading, that is from skimage import io
.
I wrote a simple example to test:
import logging
logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(name)s: %(message)s')
logger = logging.getLogger(__name__)
logger.info("start import skimage.io")
from skimage import io
logger.info("finish import, start imread")
image = io.imread('/tcf/image.jpg')
logger.info("finish imread")
assert(image.shape[2] == 3)
and found that when running with graphene-sgx on bare metal, it took about 30 minutes to import, see:
2021-09-16 09:27:24,769 INFO __main__: start import skimage.io
2021-09-16 10:07:15,932 INFO __main__: finish import, start imread
2021-09-16 10:07:15,999 INFO __main__: finish imread
I was wondering if there was any problem with my python environment, so i packaged graphene to a image and ran in container. However, in container, when running with graphene-sgx, it took about 10 minutes to import, see:
graphene-worker | 2021-09-16 10:34:55,266 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:44:50,683 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:44:50,711 INFO __main__: finish imread
also, in container, when running with graphene-direct, it took about 30 seconds to import, see:
graphene-worker | 2021-09-16 10:32:38,944 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:33:06,027 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:33:06,054 INFO __main__: finish imread
in container, when running without graphene, that is, running with python3, it took about 1 second to import, see:
graphene-worker | 2021-09-16 10:32:04,014 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:32:04,997 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:32:05,021 INFO __main__: finish imread
There's also a huge performance drop between with and without sgx in container.
To reproduce it in container, here is the related file : graphene_docker.zip
unzip it to /root/graphene_docker
cd /root/graphene_docker/pytorch-fl
python3.6 -m venv venv
source venv/bin/activate
python3.6 -m pip install --upgrade pip
python3.6 -m pip install scikit-image
deactivate
cd /root/graphene_docker
docker-compose -f docker-compose-graphene.yaml build
To run with graphene-sgx:
./graphene-worker.sh start sgx
To run with graphene-direct:
./graphene-worker.sh start direct
To run without graphene:
./graphene-worker.sh start nographene
To stop container:
./graphene-worker.sh stop
from graphene.
Based on my little analysis with strace, this module import seems to be heavy on IO operations (stat/read/ftsat) and hence the perf impact with Graphene-SGX.
from graphene.
@Kanakoo @svenkata9 Actually, what are the sizes of the *.manifest.template
, *.manifest
, and *.manifest.sgx
files for your application? I recently learned that people sometimes have manifest.sgx
files which are ~70MB in size (which obviously makes Graphene slow).
from graphene.
@dimakuv when running on bare metal, they are:
8.0K python.manifest
13M python.manifest.sgx
8.0K python.manifest.template
when running in container, I remake the exmaple, and they are:
8.0K python.manifest
1.3M python.manifest.sgx
8.0K python.manifest.template
Is 13M too large?
from graphene.
Is 13M too large?
No, this is a normal value. And 1.3M
in the container is a very good value.
So yes, the manifest-file size is not a reason for performance slowdown. I guess it's just the fact that import skimage.io
is just slow by itself because it performs a lot of file I/O (as @svenkata9 debugged). Unfortunately, there is probably not much that can be done here.
We could look into these performance bottlenecks, but this would be very low priority... We prioritize functionality issues in Graphene over performance for now.
from graphene.
Seems this is resolved, closing.
from graphene.
Related Issues (20)
- Data transmission error with Python gRPC running in graphene HOT 9
- Graphene-SGX: Syscalls wakes up early/prematurely on Server Machines HOT 12
- untrusted PAL sent PAL event HOT 10
- BUG() triggered during vfork and clone HOT 15
- With Go program, inside a docker container, bind fails with permission denied error, invalid handle error. HOT 7
- Workloads (Redis, Curl, R) failing with Out of memory PAL error after new manifest syntax to define lists of SGX trusted files. HOT 7
- Unable to Sign the graphenized Docker image using gsc sign-image: HOT 5
- RFC: Trusted files metadata sideloading
- [Error:38]Function not implemented. multiprocessing in graphene HOT 7
- How to transmit variables between SGX and untrusted environments HOT 4
- Function not implemented (src/ip.cpp:563) in testing GSC container HOT 2
- [Examples] Python Example Stuck Without Any Error Message HOT 1
- ModuleNotFoundError: No module named 'graphenelibos' HOT 5
- web server use golang, QPS(queries per second) is very low HOT 4
- File Listener Based on INOTIFY Throws Error HOT 1
- Issue with libprotobuf version. HOT 1
- Issue in Cloud Deployment to AKS HOT 3
- Decimal type prone to float rounding error. HOT 1
- pytorch sample config for better performance HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graphene.