Giter Club home page Giter Club logo

Comments (15)

dimakuv avatar dimakuv commented on July 3, 2024

@llly Could you try it on the latest Graphene? After the v1.2-rc release, we had a bunch of bug fixes and small changes to the IPC (Inter-Process Communication) subsystem in Graphene.

In particular, when you move to the latest Graphene, you will notice more meaningful Process IDs (instead of your current P4346, P5015 there will be more meaningful P1 for main process, P2 for the first child process, etc).

Hopefully this bug will disappear after you update. If not, please post a similar debug log for the updated run.

@boryspoplawski Please keep an eye on this too. Also, looks like this message:

debug: ipc_change_id_owner: sending a request (524..5015)

should actually be:

debug: ipc_change_id_owner: sending a request (524 -> 5015)

(Meaning that TID 524 should get a new owner with PID 5015.)

from graphene.

llly avatar llly commented on July 3, 2024

@dimakuv I can reproduce this failure on latest Graphene trunk.
This time it's caused by clone a process at the very beginning of Python program.

[P2:T21:python] trace: ---- shim_clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0x0, 0x0, 0xcf740ca10, 0x0) ...
[P2:T21:python] debug: ipc_get_new_vmid: sending a request
[P2:T21:python] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: sending a request: [33..36]
[P2:T21:python] debug: Waiting for a response to 3
[P2:shim] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: ipc_send_message: 0
[P1:shim] debug: IPC worker: received IPC message from 2: code=1 size=34 seq=3
[P1:shim] debug: ipc_get_new_vmid_callback: 3
[P1:shim] debug: Sending ipc message to 2
[P1:shim] debug: IPC worker: received IPC message from 2: code=4 size=42 seq=0
[P2:shim] debug: IPC worker: received IPC message from 1: code=0 size=38 seq=3
[P1:shim] debug: ipc_release_id_range_callback: release_id_range(33..36)
[P2:T21:python] debug: Waiting finished: 0
[P2:shim] debug: Got an IPC response from 1, seq: 3
[P2:T21:python] debug: ipc_get_new_vmid: got a response: 3
......
[P2:T21:python] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P2:T21:python] debug: complete checkpointing data
[P2:T21:python] debug: checkpoint of 178440 bytes created
......
[P3:T37:python] debug: successfully restored checkpoint at 0xfb89fe000 - 0xfb8a29908
[P3:T37:python] debug: Creating pipe: pipe.srv:3
debug: sock_getopt (fd = 86, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 87, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
[P3:shim] debug: IPC worker started
debug: sock_getopt (fd = 330, sockopt addr = 0x7f5a84f25f70) is not implemented and always returns 0
[P3:T37:python] debug: ipc_change_id_owner: sending a request (37..3)
debug: sock_getopt (fd = 88, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 263, sockopt addr = 0x7f91a51eaf70) is not implemented and always returns 0
[P3:T37:python] debug: Sending ipc message to 1
[P3:T37:python] debug: Waiting for a response to 1
[P1:shim] debug: IPC worker: received IPC message from 3: code=5 size=42 seq=1
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P1:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f7e91d (IP = +0x2891d, VMID = 1, TID = 0)
debug: DkProcessExit: Returning exit code 1
[P2:shim] debug: IPC leader disconnected
[P2:shim] debug: Unknown process (vmid: 0x1) disconnected

from graphene.

dimakuv avatar dimakuv commented on July 3, 2024

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

from graphene.

boryspoplawski avatar boryspoplawski commented on July 3, 2024

Also what is the exact commit hash you've run this on?

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

Also what is the exact commit hash you've run this on?

the commit hash is fb71e43
and also on 1.2-rc1, same issue happens
and on previous commit 1b8848b which is before 1.2-rc1, it works well

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

actually we use bash to start java/python programs
please refer to this manifest
https://github.com/intel-analytics/analytics-zoo/blob/master/ppml/trusted-big-data-ml/python/docker-graphene/bash.manifest.template

any comments and help will be appreciated!

from graphene.

boryspoplawski avatar boryspoplawski commented on July 3, 2024

That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

That commit is 5 months old...
Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)

sorry, I pasted the wrong commit, actually we are now running on fb71e43, and encounter this issue.

from graphene.

dimakuv avatar dimakuv commented on July 3, 2024

@glorysdj Unfortunately, the manifest you provided requires a very specific environment on the machine (Python3.6, Java, Spark, something called zinc, etc). So we won't be able to reproduce this.

Anyway, @boryspoplawski, do you have an idea what the debug log above could indicate? In particular, why this could trigger:

[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153

In other words, why the main process P1 wouldn't know all ancestor IDs?

from graphene.

boryspoplawski avatar boryspoplawski commented on July 3, 2024

@dimakuv no idea, sounds like a serious bug, but we've never seen that before...

@glorysdj @llly Could you please post a trace log of a full execution? Maybe then we could tell more (the provided snipped skips too much details).

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

test-orca-tf-text-sgx.log

trace log is attached.

also you can run it with below docker image:

export ENCLAVE_KEY=xxx/enclave-key.pem
export LOCAL_IP=x.x.x.x

docker run -itd \
--privileged \
--net=host \
--cpuset-cpus="26-30" \
--oom-kill-disable \
--device=/dev/gsgx \
--device=/dev/sgx/enclave \
--device=/dev/sgx/provsion \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
--name=spark-local \
-e LOCAL_IP=$LOCAL_IP \
-e SGX_MEM_SIZE=32G \
intelanalytics/analytics-zoo-ppml-trusted-big-data-ml-python-graphene:0.12-k8s bash

attach the container

docker exec -it spark-local bash
cd /ppml/trusted-big-data-ml/

init the sgx manifest

./init.sh

run the workload

graphene-sgx ./bash -c "/opt/jdk8/bin/java -cp \
  '/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar:${SPARK_HOME}/conf/:${SPARK_HOME}/jars/*' \
  -Xmx3g \
  org.apache.spark.deploy.SparkSubmit \
  --master 'local[4]' \
  --conf spark.driver.memory=3g \
  --conf spark.executor.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --conf spark.driver.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --properties-file /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/conf/spark-analytics-zoo.conf \
  --jars /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --py-files /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-python-api.zip \
  --executor-memory 3g \
  --executor-cores 2 \
  --driver-cores 2 \
  /ppml/trusted-big-data-ml/work/examples/pyzoo/orca/learn/tf/basic_text_classification/basic_text_classification.py \
  --cluster_mode local" | tee test-orca-tf-text-sgx.log

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

@dimakuv @boryspoplawski any insights?

from graphene.

boryspoplawski avatar boryspoplawski commented on July 3, 2024

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

yes, we will try this

from graphene.

glorysdj avatar glorysdj commented on July 3, 2024

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

have tried with latest gramine, but encountered another issues, when run a very simple java program. will try to summarize the issue.

from graphene.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.