Created attachment 135445 [details]
-- Platform: CHT
-- System architecture: x86_64
-- Kernel version: 4.14.0-041400-generic
-- Linux distribution: Ubuntu 16.04 LTS
-- Mother board model: CHT T3 RVP with CHT-T3 D1 SOC (both T3 and T4 RVPs are reproducible)
-- Display connector: HDMI
Bug detailed description:
While doing CPU thread online/offline test, the system hanging could be observed when putting offline thread back to online. This issue is reproducible since Kernel 4.11-rc1 and is still present in Kernel 4.14. This issue had been bisected to happen since this commit eef57324d926f0d8c7a40069e7d26e0cb0651b47. And after digging further, Kernel panic in setup_vector_irq() with debug messages as below could be captured.
------------[ cut here ]------------
[ 87.353072] irq 298 idata->chip->name hdmi_lpe_audio_irqchip
[ 87.353072] irq 298 apic_chip_data
[ 87.353073] irq 298 data->domain is NULL
[ 87.353120] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 87.353132] IP: setup_vector_irq+0x1ba/0x230
[ 87.353133] PGD 0
1. check cpuinfo $ grep "processor" /proc/cpuinfo
2. disable cpu[1-3] $ echo 0 > /sys/devices/system/cpu/cpu[1-3]/online
3. enable cpu[1-3] $ echo 1 > /sys/devices/system/cpu/cpu[1-3]/online <-- issue happens, system hangs.
First bad commit:
For reference, the bisect result is
Author: Jerome Anand <email@example.com>
Date: Wed Jan 25 04:27:49 2017 +0530
drm/i915: setup bridge for HDMI LPE audio driver
Enable support for HDMI LPE audio mode on Baytrail and
Cherrytrail when HDaudio controller is not detected
Setup minimum required resources during i915_driver_load:
1. Create a platform device to share MMIO/IRQ resources
2. Make the platform device child of i915 device for runtime PM.
3. Create IRQ chip to forward HDMI LPE audio irqs.
HDMI LPE audio driver (a standalone sound driver) probes the
LPE audio device and creates a new sound card.
Signed-off-by: Pierre-Louis Bossart <firstname.lastname@example.org>
Signed-off-by: Jerome Anand <email@example.com>
Acked-by: Jani Nikula <firstname.lastname@example.org>
Signed-off-by: Takashi Iwai <email@example.com>
I can reproduce the system hang on a Zotac PI330 device (CHT)
Can you clarify which debug options you used, I just see a system hang when one core is put online again and can't get a dmesg as detailed as yours.
FWIW, when using a regular 4.12 Fedora install I also see the problem but I get an additional message when taking core 1 out:
[ 163.799497] Cannot set affinity for irq 158
[ 163.801408] smpboot: CPU 1 is now offline
It smells more like a bug in x86 CPU hotplug side to me.
Can anyone check whether it's reproducible with 4.15-rc1? There has been quite lots of fixes / cleanups in x86 code in this regard.
Two findings here...
The first is there is no well-recognized chip_data of HDMI LPE audio IRQ which makes x86 APIC driver try to handle a invalid pointer and then causes kernel panic.
The second is to disable CONFIG_CPUMASK_OFFSTACK to make cpumask_var_t be an array type. Then nothing about invalid pointer in x86 APIC driver will happen.
This issue cannot be reproduced in v4.15-rc1 since the code of making improper reference to invalid pointer in setup_vector_irq() was patched. There is no need to modify this driver now after discussion.