Giter Club home page Giter Club logo

Comments (26)

lyakh avatar lyakh commented on August 16, 2024

First of all I think the topic of this bug is not quite correct: this isn't a kernel panic, this is a kernel WARNING. Still, it has to be fixed of course.

I've tried to reproduce the bug but failed so far. The difference is, that my Up^2 has no codec. I tried both topology variants - with a codec and a "nocodec" one. In both cases kernel was reporting some errors, so, apparently the drivers didn't complete the registration, which might then explain, why a different path was taken when unloading and reloading modules, so no warnings were generated. I also noticed, that not all snd-soc modules were unloaded by the "remove" script. These modules were kept:

snd_soc_sst_bxt_pcm512x    16384  0
snd_soc_pcm512x_i2c    16384  0
snd_soc_pcm512x        32768  1 snd_soc_pcm512x_i2c
snd_seq_midi           16384  0
snd_seq_midi_event     16384  1 snd_seq_midi
snd_rawmidi            36864  1 snd_seq_midi
snd_sof                98304  0
snd_sof_xtensa_dsp     20480  0
snd_soc_core          258048  3 snd_soc_pcm512x,snd_soc_sst_bxt_pcm512x,snd_sof

I modified the script to also unload those, and still got no warning. @ZhendanYang could you maybe try to reproduce the problem with the codec board unplugged? To check, whether I've done something wrong or indeed it only occurs with a codec?

from linux.

ZhendanYang avatar ZhendanYang commented on August 16, 2024

@lyakh
I modified the "remove" script to unload those snd-soc modules.
I add some lines:
remove_module snd_soc_hdac_hdmi
remove_module snd_soc_hdac_hda
remove_module snd_soc_core
remove_module snd_seq_midi
remove_module snd_seq_midi_event
remove_module snd_rawmidi
And when I run sof_remove.sh, system crashed. I had to reboot through the power button.

from linux.

lyakh avatar lyakh commented on August 16, 2024

@ZhendanYang do you have a kernel log from that crash? Logs, attached to this issue "only" contain a kernel warning. Have you tried with no codec?

from linux.

lyakh avatar lyakh commented on August 16, 2024

I've got a "nocodec" Up^2 configuration that seems to work - at least aplay and arecord return after some time and I'm seeing interrupts. Although the time is larger than it should be. And I've got an Oops! Investigating...

from linux.

lyakh avatar lyakh commented on August 16, 2024

I think I've found the reason for this Oops. It seems to be a case of a use after free. First the DAI driver gets freed along the lines of

sof_pcm_free()
snd_sof_free_topology()
snd_soc_tplg_component_remove()
remove_dai()

And then as resources are freed the driver's .remove method is verified, initially it was NULL, but since the memory was released, it seems to be often reused in our use-cases and the .remove pointer then contains random data, if that memory is again accessible, of course. That happens as

devm_card_release()
snd_soc_unregister_card()
soc_cleanup_card_resources()
soc_remove_dai_links()
soc_remove_link_dais()
soc_remove_dai()
dai->driver->remove()

I'm looking at ways to fix this.

from linux.

mengdonglin avatar mengdonglin commented on August 16, 2024

@ZhendanYang Could you check if this issue is solved?

from linux.

ZhendanYang avatar ZhendanYang commented on August 16, 2024

@mengdonglin
Checked. This issue still exists.
Test env:
kernel: sof-dev 160b45f
sof: master 81eae
sof tool: 57b5212
tplg: sof-apl-nocodec.tplg on APL and test-ssp0-mclk-0-I2S-volume-s16le-s16le-48k-24000k-nocodec.tplg on CNL

from linux.

lgirdwood avatar lgirdwood commented on August 16, 2024

@ZhendanYang is it the same oops message or different now ?

from linux.

ZhendanYang avatar ZhendanYang commented on August 16, 2024

@lgirdwood
The oops message is the same as before.

from linux.

lyakh avatar lyakh commented on August 16, 2024

@lgirdwood I think it's a different Oops. As I mentioned above, this whole "issue" is confusing, because attached and quoted logs don't actually provide Oops logs, but some kernel warnings. I was able to reproduce an Oops again - upon the second reloading of the drivers... This time it's a NULL cpu_dai->driver pointer. While I'm investigating it, I wanted to ask whether the actual behaviour is correct even when no Oops is happening. I added some printk()s to the driver and here's what I'm seeing:

Sep 27 13:49:48 UP-APL01 kernel: [    6.392837] soc_bind_dai_link(): 879 00000000742f5da3 SSP0 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392842] soc_bind_dai_link(): 879 00000000dd7bb785 SSP1 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392847] soc_bind_dai_link(): 879 00000000fb5b92c4 SSP2 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392849] soc_bind_dai_link(): 879 000000002601e15b SSP3 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392855] soc_bind_dai_link(): 879 000000006d4fd906 SSP4 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392857] soc_bind_dai_link(): 879 000000002c00faf7 SSP5 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392861] soc_bind_dai_link(): 879 000000009532faa9 DMIC01 Pin
Sep 27 13:49:48 UP-APL01 kernel: [    6.392864] soc_bind_dai_link(): 879 00000000fe1882de DMIC16k Pin

First round of DAI linking

Sep 27 13:49:48 UP-APL01 kernel: [    6.406062] soc_bind_dai_link(): 879 0000000065557754 Port0 0
Sep 27 13:49:48 UP-APL01 kernel: [    6.406072] soc_bind_dai_link(): 879 00000000bf579ef9 Port1 1
Sep 27 13:49:48 UP-APL01 kernel: [    6.406076] soc_bind_dai_link(): 879 00000000e37aa70b Port2 2
Sep 27 13:49:48 UP-APL01 kernel: [    6.406081] soc_bind_dai_link(): 879 0000000000c6d7bb Port3 3

Second round of DAI linking

Sep 27 13:49:51 UP-APL01 kernel: [    9.679756] soc_pcm_trigger: 1100 (00000000eee08c57:subdevice #0) 0000000065557754 Port0 0 1
Sep 27 13:49:51 UP-APL01 kernel: [    9.679997] soc_pcm_trigger: 1100 (000000002341c4cc:subdevice #0) 00000000742f5da3 SSP0 Pin 1
Sep 27 13:49:51 UP-APL01 kernel: [    9.715892] soc_pcm_trigger: 1100 (00000000aac26c92:subdevice #0) 0000000065557754 Port0 0 1
Sep 27 13:49:51 UP-APL01 kernel: [    9.716281] soc_pcm_trigger: 1100 (000000005909a405:subdevice #0) 00000000742f5da3 SSP0 Pin 1

Above soc_pcm_trigger() is called 4 times with cmd=SNDRV_PCM_TRIGGER_START, every call is for a different substream (the first pointer in parentheses), but the name of all substreams is the same "subdevice #0." Also note, that the first and the third, as well as the second and the fourth substream link to the same cpu_dai.

Sep 27 13:49:56 UP-APL01 kernel: [   14.725543] soc_pcm_trigger: 1100 (00000000aac26c92:subdevice #0) 0000000065557754 Port0 0 0
Sep 27 13:49:56 UP-APL01 kernel: [   14.726164] soc_pcm_trigger: 1100 (000000005909a405:subdevice #0) 00000000742f5da3 SSP0 Pin 0
Sep 27 13:49:56 UP-APL01 kernel: [   14.728767] soc_pcm_trigger: 1100 (00000000eee08c57:subdevice #0) 0000000065557754 Port0 0 0
Sep 27 13:49:56 UP-APL01 kernel: [   14.729151] soc_pcm_trigger: 1100 (000000002341c4cc:subdevice #0) 00000000742f5da3 SSP0 Pin 0

Now the command = SNDRV_PCM_TRIGGER_STOP, the order is reversed but otherwise everything looks good.

Sep 27 13:51:44 UP-APL01 kernel: [  122.174468] soc_bind_dai_link(): 879 000000003534241d SSP0 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174478] soc_bind_dai_link(): 879 000000004ae88687 SSP1 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174485] soc_bind_dai_link(): 879 00000000f1a271ee SSP2 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174490] soc_bind_dai_link(): 879 00000000d6d652bd SSP3 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174496] soc_bind_dai_link(): 879 000000005f91f001 SSP4 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174502] soc_bind_dai_link(): 879 00000000c819f31d SSP5 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174509] soc_bind_dai_link(): 879 0000000044ea8ea5 DMIC01 Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.174516] soc_bind_dai_link(): 879 000000006046078f DMIC16k Pin
Sep 27 13:51:44 UP-APL01 kernel: [  122.187528] soc_bind_dai_link(): 879 00000000ab938673 Port0 0
Sep 27 13:51:44 UP-APL01 kernel: [  122.187534] soc_bind_dai_link(): 879 00000000bfeeaf3e Port1 1
Sep 27 13:51:44 UP-APL01 kernel: [  122.187538] soc_bind_dai_link(): 879 0000000063ced702 Port2 2
Sep 27 13:51:44 UP-APL01 kernel: [  122.187541] soc_bind_dai_link(): 879 00000000d2c3a3d7 Port3 3

The linking stage again upon loading of the modules.

Sep 27 13:51:44 UP-APL01 kernel: [  122.310478] soc_pcm_trigger: 1100 (00000000a57839e2:subdevice #0) 00000000ab938673 Port0 0 1
Sep 27 13:51:44 UP-APL01 kernel: [  122.310912] soc_pcm_trigger: 1100 (00000000da02ed2e:subdevice #0) 000000003534241d SSP0 Pin 1
Sep 27 13:51:44 UP-APL01 kernel: [  122.347331] soc_pcm_trigger: 1100 (00000000bb37370b:subdevice #0) 00000000ab938673 Port0 0 1
Sep 27 13:51:44 UP-APL01 kernel: [  122.347839] soc_pcm_trigger: 1100 (000000004cd8f939:subdevice #0) 000000003534241d SSP0 Pin 1

START still looks good.

Sep 27 13:51:45 UP-APL01 kernel: [  123.369430] soc_pcm_trigger: 1100 (00000000a57839e2:subdevice #0) 00000000ab938673 Port0 0 0
Sep 27 13:51:45 UP-APL01 kernel: [  123.369751] soc_pcm_trigger(): 1125           (null)           (null)

The first STOP command driver=NULL... Investigating.

from linux.

lyakh avatar lyakh commented on August 16, 2024

The difference between the first and the second module unloading runs is the order of remove_dai() and snd_pcm_do_stop() execution. During the first module unloading the order is

snd_pcm_do_stop()
remove_dai()

During the second unloading it's the opposite. remove_dai() is always called when the sof-pci-dev driver is unloaded. Whereas snd_pcm_do_stop() upon the first unloading is called for the SNDRV_PCM_IOCTL_DROP ioctl(), which seems to be called from the pulseaudio daemon, in the second case it's called from snd_pcm_release(). @lgirdwood any ideas what might be going on there with the pulseaudio? As far as I understand, it isn't the problem, that one of the orders is not allowed, both seem to be legitimate, so handling the second one has to be fixed...

Update: In the first unload case the call to snd_pcm_do_stop() is unrelated to the unloading itself. It is indeed just called from an ioctl() during user-space (pulsedaemon) initialisation. That ioctl() comes 5 seconds after loading the modules. So, if a 6 second delay is added to the test loop, the kernel Oops disappears.

from linux.

lyakh avatar lyakh commented on August 16, 2024

I added a bunch of checks for "cpu_dai->driver == NULL" and that seems to avoid that type of Oopses, but firstly I don't think that's a correct solution. It really looks like the driver gets freed too early and I'm not sure how to avoid that. Further after eliminating those bugs I hit another one, beginning with an "error: unexpected fault 0x00000000 trace 0x00004000" and leading to "probe of sof-nocodec failed with error -22" and an Oops.

from linux.

lgirdwood avatar lgirdwood commented on August 16, 2024

@lyakh sorry, not following - is there a stream playing when you try and remove ? if so we probably need to increase mod ref count.

from linux.

bardliao avatar bardliao commented on August 16, 2024

@mengdonglin @ZhendanYang PR #311 can solve

[   62.430791] sof-audio sof-audio: HDA codec #2 probed OK: response: 8086280a
[   62.431343] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:0e.0/ehdaudio0D2'

And I can get "boot success" result on Up2 with pcm512x codec. But, the device will sometimes hang for other reasons.

from linux.

plbossart avatar plbossart commented on August 16, 2024

@bardliao I had a similar issue with the sof device #259 and I 'fixed' it by changing the order in which the inits where done. We may be doing something in the PCI device that doesn't belong there?

What I did is try to look in /sys/devices/pic0000:00/0000:00:0e.0 and check if the subdevices were freed when the modules were removed.

from linux.

bardliao avatar bardliao commented on August 16, 2024

@mengdonglin @keqiaozhang Could QA do a test for this issue? It is not stable yet, but we made some progress.

from linux.

bardliao avatar bardliao commented on August 16, 2024

@plbossart I found that we allocate hdac_dev by devm_kzalloc, but it will finally be freed by kfree. I believe that it is the root cause of current issue. The same code is also used on intel/skylake/skl.c. So, the same issue may also exist on sst driver.

from linux.

plbossart avatar plbossart commented on August 16, 2024

from linux.

mengdonglin avatar mengdonglin commented on August 16, 2024

@stevyan Could you help to verify if the solution on platforms: BYT, APL, CNL, WHL?

from linux.

markyang avatar markyang commented on August 16, 2024

Summary:
This issue cannot be reproduced on BYT(more than 10 times)

But this issue can still be reproduced on WHL(1st time), CNL(2nd time), and APL(5th time) after unloading the sof modules.

Test env:
sof master: b5d6c71
kernel sof-dev: a71221d

Log:
dmesg-apl.txt
dmesg-cnl.txt
dmesg-whl.txt

from linux.

mengdonglin avatar mengdonglin commented on August 16, 2024

@markyang Thanks for your verification! I need your help to file 2 new bugs for APL and WHL respectively, based on the log. Please also clarify the topology file used in the new bugs.

  • For APL, please file a new bug, since this bug is to track unloading/reloading failure that happens at the 1st time. e.g. "Kernel panic after unloading and reloading driver for multiple times on APL".

  • For WHL, it fails for different reason, please file a separate bug.

  • For CNL, please don't test this case on CNL now since it fails on power management and PM is a feature still in progress for CNL/WHL.

from linux.

markyang avatar markyang commented on August 16, 2024

@mengdonglin The following topologies were used:
APL: sof-apl-pcm512x.tplg
CNL: test-ssp0-mclk-0-I2S-volume-s16le-s16le-48k-24000k-nocodec.tplg
WHL: sof-hda-generic.tplg

from linux.

markyang avatar markyang commented on August 16, 2024

Summary:
This issue cannot be reproduced on APL with PCM512x.
Test steps:

1. remove sound modules:
  sudo modprobe -r sof_pci_dev snd_sof_nocodec snd_sof_intel_hda_common snd_sof_intel_hda snd_sof_intel_byt snd_sof_xtensa_dsp snd_sof snd_soc_acpi_intel_match snd_soc_acpi snd_soc_dmic snd_soc_wm8804_i2c snd_soc_wm8804 snd_soc_pcm512x_i2c snd_soc_pcm512x snd_soc_sst_bxt_pcm512x snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_core snd_pcm
2. load sound modules:
  sudo modprobe snd_soc_pcm512x_i2c 
  sudo modprobe sof_pci_dev
3. repeat step 1 and 2 (10 times)
4. aplay -l

Test env:
sof master: ba4054b
kernel sof-dev: f9b2f98
tplg: sof-apl-pcm512x.tplg

from linux.

markyang avatar markyang commented on August 16, 2024

Summary:
Close the issue. A new issue(/issues/466) has been created to follow the panic on WHL.

from linux.

bardliao avatar bardliao commented on August 16, 2024

@markyang How about #46 ?

from linux.

markyang avatar markyang commented on August 16, 2024

@bardliao I did not retest it on BYT nocodec yet. It's okay on BTY with ALC5651 at least.

from linux.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.