Comments (26)
First of all I think the topic of this bug is not quite correct: this isn't a kernel panic, this is a kernel WARNING. Still, it has to be fixed of course.
I've tried to reproduce the bug but failed so far. The difference is, that my Up^2 has no codec. I tried both topology variants - with a codec and a "nocodec" one. In both cases kernel was reporting some errors, so, apparently the drivers didn't complete the registration, which might then explain, why a different path was taken when unloading and reloading modules, so no warnings were generated. I also noticed, that not all snd-soc modules were unloaded by the "remove" script. These modules were kept:
snd_soc_sst_bxt_pcm512x 16384 0
snd_soc_pcm512x_i2c 16384 0
snd_soc_pcm512x 32768 1 snd_soc_pcm512x_i2c
snd_seq_midi 16384 0
snd_seq_midi_event 16384 1 snd_seq_midi
snd_rawmidi 36864 1 snd_seq_midi
snd_sof 98304 0
snd_sof_xtensa_dsp 20480 0
snd_soc_core 258048 3 snd_soc_pcm512x,snd_soc_sst_bxt_pcm512x,snd_sof
I modified the script to also unload those, and still got no warning. @ZhendanYang could you maybe try to reproduce the problem with the codec board unplugged? To check, whether I've done something wrong or indeed it only occurs with a codec?
from linux.
@lyakh
I modified the "remove" script to unload those snd-soc modules.
I add some lines:
remove_module snd_soc_hdac_hdmi
remove_module snd_soc_hdac_hda
remove_module snd_soc_core
remove_module snd_seq_midi
remove_module snd_seq_midi_event
remove_module snd_rawmidi
And when I run sof_remove.sh, system crashed. I had to reboot through the power button.
from linux.
@ZhendanYang do you have a kernel log from that crash? Logs, attached to this issue "only" contain a kernel warning. Have you tried with no codec?
from linux.
I've got a "nocodec" Up^2 configuration that seems to work - at least aplay and arecord return after some time and I'm seeing interrupts. Although the time is larger than it should be. And I've got an Oops! Investigating...
from linux.
I think I've found the reason for this Oops. It seems to be a case of a use after free. First the DAI driver gets freed along the lines of
sof_pcm_free()
snd_sof_free_topology()
snd_soc_tplg_component_remove()
remove_dai()
And then as resources are freed the driver's .remove method is verified, initially it was NULL, but since the memory was released, it seems to be often reused in our use-cases and the .remove pointer then contains random data, if that memory is again accessible, of course. That happens as
devm_card_release()
snd_soc_unregister_card()
soc_cleanup_card_resources()
soc_remove_dai_links()
soc_remove_link_dais()
soc_remove_dai()
dai->driver->remove()
I'm looking at ways to fix this.
from linux.
@ZhendanYang Could you check if this issue is solved?
from linux.
@mengdonglin
Checked. This issue still exists.
Test env:
kernel: sof-dev 160b45f
sof: master 81eae
sof tool: 57b5212
tplg: sof-apl-nocodec.tplg on APL and test-ssp0-mclk-0-I2S-volume-s16le-s16le-48k-24000k-nocodec.tplg on CNL
from linux.
@ZhendanYang is it the same oops message or different now ?
from linux.
@lgirdwood
The oops message is the same as before.
from linux.
@lgirdwood I think it's a different Oops. As I mentioned above, this whole "issue" is confusing, because attached and quoted logs don't actually provide Oops logs, but some kernel warnings. I was able to reproduce an Oops again - upon the second reloading of the drivers... This time it's a NULL cpu_dai->driver pointer. While I'm investigating it, I wanted to ask whether the actual behaviour is correct even when no Oops is happening. I added some printk()s to the driver and here's what I'm seeing:
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392837] soc_bind_dai_link(): 879 00000000742f5da3 SSP0 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392842] soc_bind_dai_link(): 879 00000000dd7bb785 SSP1 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392847] soc_bind_dai_link(): 879 00000000fb5b92c4 SSP2 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392849] soc_bind_dai_link(): 879 000000002601e15b SSP3 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392855] soc_bind_dai_link(): 879 000000006d4fd906 SSP4 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392857] soc_bind_dai_link(): 879 000000002c00faf7 SSP5 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392861] soc_bind_dai_link(): 879 000000009532faa9 DMIC01 Pin
Sep 27 13:49:48 UP-APL01 kernel: [ 6.392864] soc_bind_dai_link(): 879 00000000fe1882de DMIC16k Pin
First round of DAI linking
Sep 27 13:49:48 UP-APL01 kernel: [ 6.406062] soc_bind_dai_link(): 879 0000000065557754 Port0 0
Sep 27 13:49:48 UP-APL01 kernel: [ 6.406072] soc_bind_dai_link(): 879 00000000bf579ef9 Port1 1
Sep 27 13:49:48 UP-APL01 kernel: [ 6.406076] soc_bind_dai_link(): 879 00000000e37aa70b Port2 2
Sep 27 13:49:48 UP-APL01 kernel: [ 6.406081] soc_bind_dai_link(): 879 0000000000c6d7bb Port3 3
Second round of DAI linking
Sep 27 13:49:51 UP-APL01 kernel: [ 9.679756] soc_pcm_trigger: 1100 (00000000eee08c57:subdevice #0) 0000000065557754 Port0 0 1
Sep 27 13:49:51 UP-APL01 kernel: [ 9.679997] soc_pcm_trigger: 1100 (000000002341c4cc:subdevice #0) 00000000742f5da3 SSP0 Pin 1
Sep 27 13:49:51 UP-APL01 kernel: [ 9.715892] soc_pcm_trigger: 1100 (00000000aac26c92:subdevice #0) 0000000065557754 Port0 0 1
Sep 27 13:49:51 UP-APL01 kernel: [ 9.716281] soc_pcm_trigger: 1100 (000000005909a405:subdevice #0) 00000000742f5da3 SSP0 Pin 1
Above soc_pcm_trigger() is called 4 times with cmd=SNDRV_PCM_TRIGGER_START, every call is for a different substream (the first pointer in parentheses), but the name of all substreams is the same "subdevice #0." Also note, that the first and the third, as well as the second and the fourth substream link to the same cpu_dai.
Sep 27 13:49:56 UP-APL01 kernel: [ 14.725543] soc_pcm_trigger: 1100 (00000000aac26c92:subdevice #0) 0000000065557754 Port0 0 0
Sep 27 13:49:56 UP-APL01 kernel: [ 14.726164] soc_pcm_trigger: 1100 (000000005909a405:subdevice #0) 00000000742f5da3 SSP0 Pin 0
Sep 27 13:49:56 UP-APL01 kernel: [ 14.728767] soc_pcm_trigger: 1100 (00000000eee08c57:subdevice #0) 0000000065557754 Port0 0 0
Sep 27 13:49:56 UP-APL01 kernel: [ 14.729151] soc_pcm_trigger: 1100 (000000002341c4cc:subdevice #0) 00000000742f5da3 SSP0 Pin 0
Now the command = SNDRV_PCM_TRIGGER_STOP, the order is reversed but otherwise everything looks good.
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174468] soc_bind_dai_link(): 879 000000003534241d SSP0 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174478] soc_bind_dai_link(): 879 000000004ae88687 SSP1 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174485] soc_bind_dai_link(): 879 00000000f1a271ee SSP2 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174490] soc_bind_dai_link(): 879 00000000d6d652bd SSP3 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174496] soc_bind_dai_link(): 879 000000005f91f001 SSP4 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174502] soc_bind_dai_link(): 879 00000000c819f31d SSP5 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174509] soc_bind_dai_link(): 879 0000000044ea8ea5 DMIC01 Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.174516] soc_bind_dai_link(): 879 000000006046078f DMIC16k Pin
Sep 27 13:51:44 UP-APL01 kernel: [ 122.187528] soc_bind_dai_link(): 879 00000000ab938673 Port0 0
Sep 27 13:51:44 UP-APL01 kernel: [ 122.187534] soc_bind_dai_link(): 879 00000000bfeeaf3e Port1 1
Sep 27 13:51:44 UP-APL01 kernel: [ 122.187538] soc_bind_dai_link(): 879 0000000063ced702 Port2 2
Sep 27 13:51:44 UP-APL01 kernel: [ 122.187541] soc_bind_dai_link(): 879 00000000d2c3a3d7 Port3 3
The linking stage again upon loading of the modules.
Sep 27 13:51:44 UP-APL01 kernel: [ 122.310478] soc_pcm_trigger: 1100 (00000000a57839e2:subdevice #0) 00000000ab938673 Port0 0 1
Sep 27 13:51:44 UP-APL01 kernel: [ 122.310912] soc_pcm_trigger: 1100 (00000000da02ed2e:subdevice #0) 000000003534241d SSP0 Pin 1
Sep 27 13:51:44 UP-APL01 kernel: [ 122.347331] soc_pcm_trigger: 1100 (00000000bb37370b:subdevice #0) 00000000ab938673 Port0 0 1
Sep 27 13:51:44 UP-APL01 kernel: [ 122.347839] soc_pcm_trigger: 1100 (000000004cd8f939:subdevice #0) 000000003534241d SSP0 Pin 1
START still looks good.
Sep 27 13:51:45 UP-APL01 kernel: [ 123.369430] soc_pcm_trigger: 1100 (00000000a57839e2:subdevice #0) 00000000ab938673 Port0 0 0
Sep 27 13:51:45 UP-APL01 kernel: [ 123.369751] soc_pcm_trigger(): 1125 (null) (null)
The first STOP command driver=NULL... Investigating.
from linux.
The difference between the first and the second module unloading runs is the order of remove_dai() and snd_pcm_do_stop() execution. During the first module unloading the order is
snd_pcm_do_stop()
remove_dai()
During the second unloading it's the opposite. remove_dai() is always called when the sof-pci-dev driver is unloaded. Whereas snd_pcm_do_stop() upon the first unloading is called for the SNDRV_PCM_IOCTL_DROP ioctl(), which seems to be called from the pulseaudio daemon, in the second case it's called from snd_pcm_release(). @lgirdwood any ideas what might be going on there with the pulseaudio? As far as I understand, it isn't the problem, that one of the orders is not allowed, both seem to be legitimate, so handling the second one has to be fixed...
Update: In the first unload case the call to snd_pcm_do_stop() is unrelated to the unloading itself. It is indeed just called from an ioctl() during user-space (pulsedaemon) initialisation. That ioctl() comes 5 seconds after loading the modules. So, if a 6 second delay is added to the test loop, the kernel Oops disappears.
from linux.
I added a bunch of checks for "cpu_dai->driver == NULL" and that seems to avoid that type of Oopses, but firstly I don't think that's a correct solution. It really looks like the driver gets freed too early and I'm not sure how to avoid that. Further after eliminating those bugs I hit another one, beginning with an "error: unexpected fault 0x00000000 trace 0x00004000" and leading to "probe of sof-nocodec failed with error -22" and an Oops.
from linux.
@lyakh sorry, not following - is there a stream playing when you try and remove ? if so we probably need to increase mod ref count.
from linux.
@mengdonglin @ZhendanYang PR #311 can solve
[ 62.430791] sof-audio sof-audio: HDA codec #2 probed OK: response: 8086280a
[ 62.431343] sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:0e.0/ehdaudio0D2'
And I can get "boot success" result on Up2 with pcm512x codec. But, the device will sometimes hang for other reasons.
from linux.
@bardliao I had a similar issue with the sof device #259 and I 'fixed' it by changing the order in which the inits where done. We may be doing something in the PCI device that doesn't belong there?
What I did is try to look in /sys/devices/pic0000:00/0000:00:0e.0 and check if the subdevices were freed when the modules were removed.
from linux.
@mengdonglin @keqiaozhang Could QA do a test for this issue? It is not stable yet, but we made some progress.
from linux.
@plbossart I found that we allocate hdac_dev by devm_kzalloc, but it will finally be freed by kfree. I believe that it is the root cause of current issue. The same code is also used on intel/skylake/skl.c. So, the same issue may also exist on sst driver.
from linux.
from linux.
@stevyan Could you help to verify if the solution on platforms: BYT, APL, CNL, WHL?
from linux.
Summary:
This issue cannot be reproduced on BYT(more than 10 times)
But this issue can still be reproduced on WHL(1st time), CNL(2nd time), and APL(5th time) after unloading the sof modules.
Test env:
sof master: b5d6c71
kernel sof-dev: a71221d
Log:
dmesg-apl.txt
dmesg-cnl.txt
dmesg-whl.txt
from linux.
@markyang Thanks for your verification! I need your help to file 2 new bugs for APL and WHL respectively, based on the log. Please also clarify the topology file used in the new bugs.
-
For APL, please file a new bug, since this bug is to track unloading/reloading failure that happens at the 1st time. e.g. "Kernel panic after unloading and reloading driver for multiple times on APL".
-
For WHL, it fails for different reason, please file a separate bug.
-
For CNL, please don't test this case on CNL now since it fails on power management and PM is a feature still in progress for CNL/WHL.
from linux.
@mengdonglin The following topologies were used:
APL: sof-apl-pcm512x.tplg
CNL: test-ssp0-mclk-0-I2S-volume-s16le-s16le-48k-24000k-nocodec.tplg
WHL: sof-hda-generic.tplg
from linux.
Summary:
This issue cannot be reproduced on APL with PCM512x.
Test steps:
1. remove sound modules:
sudo modprobe -r sof_pci_dev snd_sof_nocodec snd_sof_intel_hda_common snd_sof_intel_hda snd_sof_intel_byt snd_sof_xtensa_dsp snd_sof snd_soc_acpi_intel_match snd_soc_acpi snd_soc_dmic snd_soc_wm8804_i2c snd_soc_wm8804 snd_soc_pcm512x_i2c snd_soc_pcm512x snd_soc_sst_bxt_pcm512x snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_core snd_pcm
2. load sound modules:
sudo modprobe snd_soc_pcm512x_i2c
sudo modprobe sof_pci_dev
3. repeat step 1 and 2 (10 times)
4. aplay -l
Test env:
sof master: ba4054b
kernel sof-dev: f9b2f98
tplg: sof-apl-pcm512x.tplg
from linux.
Summary:
Close the issue. A new issue(/issues/466) has been created to follow the panic on WHL.
from linux.
from linux.
@bardliao I did not retest it on BYT nocodec yet. It's okay on BTY with ALC5651 at least.
from linux.
Related Issues (20)
- Samsung Galaxy Book4 Pro 14" (NP940XGK) - speakers do not work HOT 18
- [LNL] rcu_preempt self-detected stall on CPU HOT 3
- ba-mtlp-sdw-aioc-02: rt7111 codec lost, Msg ignored for Slave 0 HOT 8
- [MTL][SDW] rmmod stuck on kmod-load-unload tests HOT 8
- [BUG] Alder Lake Smart Sound Technology Audio Controller Headphones always reported as connected HOT 6
- [MTL] Attempting to sniff custom verbs, QEMU HDA passthrough not working HOT 9
- Microphone not working (Lenovo YOGA slim 7i) HOT 2
- Simultaneous audio capture from branched capture pipelines fail HOT 1
- HELP! Matebook 14 s does not install any external sound card drivers
- [HD-A] System does not wake up Playback/Capture-> pause -> suspend->resume scenario HOT 12
- [MTL] ThinkBook 13x Gen 4 speakers do not work HOT 5
- [BUG] Huawei MateBook E DRR-W76 - no soundcards found HOT 4
- Hardware-dependent RT714 DMIC settings needed HOT 15
- LNL HDA pause-release MAX issue HOT 8
- [BUG] [CML] jack detection of speaker error on galaxy chromebook HOT 6
- [BUG] Huawei MateBook E DRR-W76 - the speakers are not working HOT 14
- SoundWire locking issues HOT 8
- [FEATURE] Xiaomi Redmibook 16 2024 mic is not working HOT 23
- [BUG] Distorted internal speaker sound on sof-rt5682 HOT 2
- LNL SoundWire xruns during repeated playback HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from linux.