Comments (15)
@llly Could you try it on the latest Graphene? After the v1.2-rc
release, we had a bunch of bug fixes and small changes to the IPC (Inter-Process Communication) subsystem in Graphene.
In particular, when you move to the latest Graphene, you will notice more meaningful Process IDs (instead of your current P4346
, P5015
there will be more meaningful P1
for main process, P2
for the first child process, etc).
Hopefully this bug will disappear after you update. If not, please post a similar debug log for the updated run.
@boryspoplawski Please keep an eye on this too. Also, looks like this message:
debug: ipc_change_id_owner: sending a request (524..5015)
should actually be:
debug: ipc_change_id_owner: sending a request (524 -> 5015)
(Meaning that TID 524
should get a new owner with PID 5015
.)
from graphene.
@dimakuv I can reproduce this failure on latest Graphene trunk.
This time it's caused by clone
a process at the very beginning of Python program.
[P2:T21:python] trace: ---- shim_clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0x0, 0x0, 0xcf740ca10, 0x0) ...
[P2:T21:python] debug: ipc_get_new_vmid: sending a request
[P2:T21:python] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: sending a request: [33..36]
[P2:T21:python] debug: Waiting for a response to 3
[P2:shim] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: ipc_send_message: 0
[P1:shim] debug: IPC worker: received IPC message from 2: code=1 size=34 seq=3
[P1:shim] debug: ipc_get_new_vmid_callback: 3
[P1:shim] debug: Sending ipc message to 2
[P1:shim] debug: IPC worker: received IPC message from 2: code=4 size=42 seq=0
[P2:shim] debug: IPC worker: received IPC message from 1: code=0 size=38 seq=3
[P1:shim] debug: ipc_release_id_range_callback: release_id_range(33..36)
[P2:T21:python] debug: Waiting finished: 0
[P2:shim] debug: Got an IPC response from 1, seq: 3
[P2:T21:python] debug: ipc_get_new_vmid: got a response: 3
......
[P2:T21:python] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P2:T21:python] debug: complete checkpointing data
[P2:T21:python] debug: checkpoint of 178440 bytes created
......
[P3:T37:python] debug: successfully restored checkpoint at 0xfb89fe000 - 0xfb8a29908
[P3:T37:python] debug: Creating pipe: pipe.srv:3
debug: sock_getopt (fd = 86, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 87, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
[P3:shim] debug: IPC worker started
debug: sock_getopt (fd = 330, sockopt addr = 0x7f5a84f25f70) is not implemented and always returns 0
[P3:T37:python] debug: ipc_change_id_owner: sending a request (37..3)
debug: sock_getopt (fd = 88, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 263, sockopt addr = 0x7f91a51eaf70) is not implemented and always returns 0
[P3:T37:python] debug: Sending ipc message to 1
[P3:T37:python] debug: Waiting for a response to 1
[P1:shim] debug: IPC worker: received IPC message from 3: code=5 size=42 seq=1
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P1:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f7e91d (IP = +0x2891d, VMID = 1, TID = 0)
debug: DkProcessExit: Returning exit code 1
[P2:shim] debug: IPC leader disconnected
[P2:shim] debug: Unknown process (vmid: 0x1) disconnected
from graphene.
@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template
?
from graphene.
Also what is the exact commit hash you've run this on?
from graphene.
Also what is the exact commit hash you've run this on?
the commit hash is fb71e43
and also on 1.2-rc1, same issue happens
and on previous commit 1b8848b which is before 1.2-rc1, it works well
from graphene.
@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in
python.manifest.template
?
actually we use bash to start java/python programs
please refer to this manifest
https://github.com/intel-analytics/analytics-zoo/blob/master/ppml/trusted-big-data-ml/python/docker-graphene/bash.manifest.template
any comments and help will be appreciated!
from graphene.
That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)
from graphene.
That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)
sorry, I pasted the wrong commit, actually we are now running on fb71e43, and encounter this issue.
from graphene.
@glorysdj Unfortunately, the manifest you provided requires a very specific environment on the machine (Python3.6, Java, Spark, something called zinc
, etc). So we won't be able to reproduce this.
Anyway, @boryspoplawski, do you have an idea what the debug log above could indicate? In particular, why this could trigger:
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
In other words, why the main process P1 wouldn't know all ancestor IDs?
from graphene.
@dimakuv no idea, sounds like a serious bug, but we've never seen that before...
@glorysdj @llly Could you please post a trace log of a full execution? Maybe then we could tell more (the provided snipped skips too much details).
from graphene.
trace log is attached.
also you can run it with below docker image:
export ENCLAVE_KEY=xxx/enclave-key.pem
export LOCAL_IP=x.x.x.x
docker run -itd \
--privileged \
--net=host \
--cpuset-cpus="26-30" \
--oom-kill-disable \
--device=/dev/gsgx \
--device=/dev/sgx/enclave \
--device=/dev/sgx/provsion \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
--name=spark-local \
-e LOCAL_IP=$LOCAL_IP \
-e SGX_MEM_SIZE=32G \
intelanalytics/analytics-zoo-ppml-trusted-big-data-ml-python-graphene:0.12-k8s bash
attach the container
docker exec -it spark-local bash
cd /ppml/trusted-big-data-ml/
init the sgx manifest
./init.sh
run the workload
graphene-sgx ./bash -c "/opt/jdk8/bin/java -cp \
'/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar:${SPARK_HOME}/conf/:${SPARK_HOME}/jars/*' \
-Xmx3g \
org.apache.spark.deploy.SparkSubmit \
--master 'local[4]' \
--conf spark.driver.memory=3g \
--conf spark.executor.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--conf spark.driver.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--properties-file /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/conf/spark-analytics-zoo.conf \
--jars /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--py-files /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-python-api.zip \
--executor-memory 3g \
--executor-cores 2 \
--driver-cores 2 \
/ppml/trusted-big-data-ml/work/examples/pyzoo/orca/learn/tf/basic_text_classification/basic_text_classification.py \
--cluster_mode local" | tee test-orca-tf-text-sgx.log
from graphene.
@dimakuv @boryspoplawski any insights?
from graphene.
@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?
from graphene.
@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?
yes, we will try this
from graphene.
@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?
have tried with latest gramine, but encountered another issues, when run a very simple java program. will try to summarize the issue.
from graphene.
Related Issues (20)
- Data transmission error with Python gRPC running in graphene HOT 9
- Huge performance drop when running pytorch training with graphene-sgx HOT 26
- Graphene-SGX: Syscalls wakes up early/prematurely on Server Machines HOT 12
- untrusted PAL sent PAL event HOT 10
- With Go program, inside a docker container, bind fails with permission denied error, invalid handle error. HOT 7
- Workloads (Redis, Curl, R) failing with Out of memory PAL error after new manifest syntax to define lists of SGX trusted files. HOT 7
- Unable to Sign the graphenized Docker image using gsc sign-image: HOT 5
- RFC: Trusted files metadata sideloading
- [Error:38]Function not implemented. multiprocessing in graphene HOT 7
- How to transmit variables between SGX and untrusted environments HOT 4
- Function not implemented (src/ip.cpp:563) in testing GSC container HOT 2
- [Examples] Python Example Stuck Without Any Error Message HOT 1
- ModuleNotFoundError: No module named 'graphenelibos' HOT 5
- web server use golang, QPS(queries per second) is very low HOT 4
- File Listener Based on INOTIFY Throws Error HOT 1
- Issue with libprotobuf version. HOT 1
- Issue in Cloud Deployment to AKS HOT 3
- Decimal type prone to float rounding error. HOT 1
- pytorch sample config for better performance HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from graphene.