Giter Club home page Giter Club logo

Comments (7)

arai-fortanix avatar arai-fortanix commented on August 20, 2024

I've verified the same behavior with the latest DCAP driver source code (v 1.36). The stack trace is the same.

from sgxdatacenterattestationprimitives.

haitaohuang avatar haitaohuang commented on August 20, 2024

Is it possible you provide the container image?

from sgxdatacenterattestationprimitives.

arai-fortanix avatar arai-fortanix commented on August 20, 2024

@haitaohuang I can share the image with you, but we prefer not to share the image publicly. Can you please contact me directly? My email address is in my github profile.

from sgxdatacenterattestationprimitives.

dimakuv avatar dimakuv commented on August 20, 2024

+1. I started experiencing the same issue as @arai-fortanix reports, but with Graphene (https://github.com/oscarlab/graphene) and a workload that spawns ~1000 SGX enclave processes. I encounter this issue with seemingly all recent Linux DCAP drivers (tested on v1.33, v1.35, and the latest available v1.36) and different SGX-enabled machines running Ubuntu 18.04 and Linux 4.15 and Linux 5.4.

It definitely feels like a data race leading to a deadlock in the SGX driver on too-fast forks (i.e., too-fast enclave creations/destructions). Here is my stack trace obtained today from dmesg:

$ dmesg
...
[2469400.795579] INFO: task ksgxswapd:459 blocked for more than 362 seconds.
[2469400.795581]       Tainted: G           OE     5.4.0-fsgsbase #1
[2469400.795582] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2469400.795582] ksgxswapd       D    0   459      2 0x80004000
[2469400.795584] Call Trace:
[2469400.795587]  __schedule+0x293/0x730
[2469400.795588]  ? __switch_to_asm+0x40/0x70
[2469400.795589]  schedule+0x33/0xa0
[2469400.795591]  schedule_timeout+0x1d3/0x2f0
[2469400.795591]  ? __schedule+0x29b/0x730
[2469400.795593]  wait_for_completion+0xba/0x140
[2469400.795594]  ? wake_up_q+0x80/0x80
[2469400.795595]  __synchronize_srcu.part.21+0x91/0xc0
[2469400.795597]  ? __bpf_trace_rcu_utilization+0x10/0x10
[2469400.795598]  ? cpumask_any_but+0x25/0x50
[2469400.795599]  synchronize_srcu_expedited+0x27/0x30
[2469400.795599]  ? synchronize_srcu_expedited+0x27/0x30
[2469400.795600]  synchronize_srcu+0xce/0xe0
[2469400.795602]  sgx_mmu_notifier_release+0x9c/0xb0 [intel_sgx]
[2469400.795604]  __mmu_notifier_release+0x47/0xd0
[2469400.795605]  exit_mmap+0x15d/0x1b0
[2469400.795606]  ? _cond_resched+0x19/0x40
[2469400.795607]  mmput+0x57/0x140
[2469400.795609]  sgx_reclaim_pages+0x3e7/0x7f0 [intel_sgx]
[2469400.795612]  ksgxswapd+0x14d/0x2e0 [intel_sgx]
[2469400.795613]  ? wait_woken+0x80/0x80
[2469400.795614]  kthread+0x121/0x140
[2469400.795615]  ? sgx_alloc_epc_page+0x100/0x100 [intel_sgx]
[2469400.795616]  ? kthread_park+0x90/0x90
[2469400.795617]  ret_from_fork+0x35/0x40

I could provide the workload internally at Intel, but cannot share the details here.

Interestingly, when I artificially decrease the frequency of enclave creations (by delaying them by 1 second), the SGX driver seems to survive. At least my workload runs for an hour already (previously, it failed after 5-10 min).

from sgxdatacenterattestationprimitives.

haitaohuang avatar haitaohuang commented on August 20, 2024

Thanks for the info. Can you try this fix for kernel 5.4 and above?
haitaohuang@54b4b0b

from sgxdatacenterattestationprimitives.

haitaohuang avatar haitaohuang commented on August 20, 2024

still pending on more code review. This should fix it:
#135
Added additional fix for 5.3- and back ported xas lock fix from upstream

from sgxdatacenterattestationprimitives.

haitaohuang avatar haitaohuang commented on August 20, 2024

Closing it as the fix is merged. The only limitation of the fix is that we can't cover kernels without mmput_async or kallsyms_lookup_name exported. Known kernels not expose either is mainline 5.7 or above. Ubuntu kernels export mmput_async.

from sgxdatacenterattestationprimitives.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.