Giter Club home page Giter Club logo

Comments (12)

Neo-Outis avatar Neo-Outis commented on May 23, 2024 1

Hi, I have tried with kernel_irq_handler u mentioned last time, and that works fine. Currently, I have been running bench several times with ur new code, and so far no screen freeze happens. I will let u know the result after several days using(Since it used to be 'sometimes' freeze). Thanks for ur updating !

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

Hi Neo!

Nice to hear you are experimenting with the single stepping -- the problem you describe, sounds indeed like a known infamous issue :/ Unfortunately, I also experienced that somehow sometimes for some yet-unknown reason the system crashes with a complete freeze and you have to reboot the machine. However, in my case, this is relatively infrequent that I have to reboot and I am able to do single-stepping of enclaves with several millions of instructions without crashes ^^

I have a few suspicions of what the bug causing these crashes might be, but have to investigate further at some point.. I think it's related to some race condition between the user-space code and the kernel, so from experience it really helps to use the isolcpus kernel option and CPU pinning with the claim_cpu() function. What also helps is to disable the NMI interrupts with the kernel option nmi_watchdog=0.

Hope this helps! It's already great that you write that it works for you perfectly when it doesn't crash: that means at least that your setup is correct! Not sure what you mean with:

Besides, once I load the sgx-step kernel, System always warn me that System program problem detected.

I think Ubuntu systems might sometimes show such a GUI warning, but not sure why it's triggered or why it's relevant. If you can provide more details, that would be helpful. If you suspect this is a kernel problem, then check and provide the output of dmesg | tail after loading the driver? Also make sure to pass iomem=relaxed no_timer_check to the kernel as described in the README to suppress some warnings:)

from sgx-step.

Neo-Outis avatar Neo-Outis commented on May 23, 2024

from sgx-step.

shujiecui avatar shujiecui commented on May 23, 2024

Hi Neo,
Thanks for your reply. I tried enabling HT, the system still crashes but less frequently.

Shujie

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

Hi Shujie,

Too bad you run into this issue. HT should normally not interfere too much with single-stepping--I'd even expect things work more stable w/o HT. Important however is to affinitize the victim CPU 1 with the isolcpus=1 Linux kernel param, as described in the README. You can check the kernel params with dmesg or cat /proc/cmdline

The error messages you posted seem to indicate something is going wrong with the page-table remapping. Linux may complain when it detects the user-space tampering with PTEs. What kernel version are you using as specified by uname -a?

What do you mean exactly with:

Everytime after testing bench, idt and cpl, the system crashes and I have to reboot my machine

Do you mean the example first works and produces expected outputs, and after that the system crashes? Or does it not work at all? In the first case, it might be related to tear down.

from sgx-step.

shujiecui avatar shujiecui commented on May 23, 2024

Hi Shujie,

Too bad you run into this issue. HT should normally not interfere too much with single-stepping--I'd even expect things work more stable w/o HT. Important however is to affinitize the victim CPU 1 with the isolcpus=1 Linux kernel param, as described in the README. You can check the kernel params with dmesg or cat /proc/cmdline

I did that. The system is configured with all the parameters mentioned in README.

The error messages you posted seem to indicate something is going wrong with the page-table remapping. Linux may complain when it detects the user-space tampering with PTEs. What kernel version are you using as specified by uname -a?

Sorry, the error messages I posted can be ignored. It is shown only when testing foreshadow, and the system doesn't crash after testing foreshadow.

What do you mean exactly with:

Everytime after testing bench, idt and cpl, the system crashes and I have to reboot my machine

Do you mean the example first works and produces expected outputs, and after that the system crashes? Or does it not work at all? In the first case, it might be related to tear down.

Both bench and idt work and produce expected outputs, but the system crashes after that.

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

okay so the fact that it works and produces expected outputs is great, but seems to indicates there's an issue with the teardown. As I mentioned above, I think it might have something to do with a race condition or unexpected interrupt point between the kernel and libsgxstep when configuring privilege levels and call gates..

Unfortunately, not sure where the bug would be and how to fix it. Some suggestions:

  1. Which kernel version are you using? It could be that the GDT/IDT vectors being overwritten are already in use by the kernel, and the current code does not properly backup/restore them (I'm aware of that but didn't yet find time to do that more properly and just hacked in some unused vectors on my kernel). Vectors can be changed here and you should be able to inspect gdt and idt via the dump_gdt/dump_idt functions, eg by modifying app/idt app/cpl to only print w/o modifications after a fresh reboot.
jo@breuer:~$ uname -a
Linux breuer 5.3.0-40-generic #32~18.04.1-Ubuntu SMP Mon Feb 3 14:05:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  1. There's config switches in app/idt and app/cpl to only do a subset of things (eg only sw IRQs w/o timer or only IRQ gates and no call gates). Try toggling these to further narrow down which functionality exactly causes the crash?

  2. there's a USER_IDT_ENABLE switch that you can disable here, which is used by app/bench and may fix the problem by not relying on custom irq gates and falling back to the "old" approach of directly hooking the existing Linux APIC timer handler. This option apparently broke with some recent changes, but I'll push a commit to fix this option again in case it might help you

  3. Try configuring SGX-Step without user-space interrupt handlers.

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

I fixed the USER_IDT_ENABLE=0 option for app/bench in the latest commit. This works stable on my machine and allows to do single-stepping with minimal intervention in the kernel data structures (ie without having to change IDT or GDT entries), which may fix your issue(?)

This commit also fixes a possible issue where the APIC timer vector got improperly restored (zeroed) that you can run into in corner cases (when restoring the APIC without having reconfigured it). I don't think this is the root cause of the troubles in this thread, but it might help ^^

from sgx-step.

shujiecui avatar shujiecui commented on May 23, 2024

Unluckily, it doesn't help.
But enabling HT really helps.

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

so just thought of one more thing that you might try: it could be that the issue arises from interrupting the user-space handler code and the kernel not expecting that somehow.. This is possible as the user-spcae handlers run as a trap gate allowing to be interrupted (as they otherwise don't properly restore interrupts on user-space iret).

You might want to try replacing the install_user_irq_handler with install_kernel_irq_handler so the handler will run as a proper interrupt gate with ring-0 privileges and without being interrupted. See for instance app/cpl for an example of install_kernel_irq_handler and make sure to disable SMAP/SMEP for this to work(!)

Hope it helps, let me know if you see any improvements or not ^^

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

So I've worked a bit more on this issue. Currently, the most likely hypothesis is that sometimes (i.e., infrequently that explains that the issue only sometimes arises and depends on the target system configuration) the kernel may interrupt the user-space application after a timer IRQ has been scheduled and before the timer has fired. Consequently, the CPU may raise a #GP exception when attempting to vector to our user-space ring3 timer IRQ handler while currently executing in ring0. As follows:

  1. SGX-Step schedules an APIC timer IRQ in user space
  2. CPU may switch into kernel space for some reason
  3. External timer IRQ arrives in kernel space
  4. CPU locates the handler for the timer IRQ in the IDT and finds the
    handler user-space code segment with index 0x6 in the GDT
  5. When attempting to load the user-mode segment selector, the CPU
    detects a privilege level violation and generates a #GP
  6. The kernel doesn't expect a #GP for a timer IRQ and crashes

The relevant section is in Intel SDM "6.12.1.1 Protection of Exception- and Interrupt-Handler Procedures":

The processor does not permit transfer of execution to an exception- or interrupt-handler procedure in a
less privileged code segment (numerically greater privilege level) than the CPL.
An attempt to violate this rule results in a general-protection exception (#GP).

I was aware that this scenario is exotic and def not recommended in non-adversarial deployments, but was not aware that there seems simply to be no way to allow this by the processor apparently.

So I did some coding and managed to reproduce the above hypothesis in the updated app/idt program on the irq_cpl branch:

https://github.com/jovanbulck/sgx-step/tree/irq_cpl

For reference, the following matrix summarizes whether code with privilege level my_cpl can be interrupted by a handler with privilege level irq_cpl.

my_cpl \ irq_cpl 0 3
0 OK FAIL
3 OK OK

All the OK entries go smooth without any problems on my machine, but for the FAIL I get an immediate system freeze an I have to reboot the machine. So the solution seems to be to simply never use ring-3 IRQ handlers so that things keep working, even if the processor would be in kernel mode somehow. (The original motivation for ring3 handlers was to avoid a privilege switch in the interrupt path and improve Nemesis IRQ latency measurements, but I expect that an added CPL switch will not significantly affect Nemesis). The new code sets the IRQ gate DPL and segment to the kernel and adds some custom asm in the handler to directly set the APIC end-of-interrupt register and return w/o clobbering registers.

I pushed some preliminary commits to the irq_cpl branch and also updated app/bench on that branch to make use of the new ring-0 handlers. For me the new code seems to run very stable now and I haven't encountered any #GP so far! So I hope this may have pinpointed the problem and the app/idt and app/bench code on the irq_cpl branch also works for you?

I'd be curious to hear your experiences! If things work out to be stable, I'll later merge the new code to master after doing some more refactoring and duplicate code removal etc ^^

from sgx-step.

jovanbulck avatar jovanbulck commented on May 23, 2024

Fixed and merged to master in #31

from sgx-step.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.