Bug 88433

Summary: [all Bisected]igt/drv_module_reload causes major memory corruption and system hang
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Matt Roper <matthew.d.roper>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: conselvan2, intel-gfx-bugs
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg none

Description lu hua 2015-01-15 02:33:53 UTC
Created attachment 112260 [details]
dmesg

==System Environment==
--------------------------
Regression: yes
good commit: 985850bd145655d10dfcd5a03a3fc38540794ca7
bad commit:cd52471999ef08bca35568525c1d85d883ffedb6

Non-working platforms: all

==kernel==
--------------------------
drm-intel-nightly/9c4bdce37d09c0682f04bb5e6d0567def5c8d786
commit 9c4bdce37d09c0682f04bb5e6d0567def5c8d786
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Jan 13 23:27:51 2015 +0100

    drm-intel-nightly: 2015y-01m-13d-22h-27m-23s UTC integration manifest

==Bug detailed description==
-----------------------------
It causes system hang on drm-intel-nightly and drm-intel-next-queued kernel, works well on drm-intel-fixes kernel.

output:
unbinding /sys/class/vtconsole/vtcon0/: (M) frame buffer device

dmesg
[   47.332841] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[   47.332847] IP: [<ffffffff8110e90b>] kmem_cache_alloc+0xd9/0x113
[   47.332853] PGD 1195a5067 PUD 11937d067 PMD 0
[   47.332856] Oops: 0000 [#1] SMP
[   47.332860] Modules linked in: netconsole configfs ip6table_filter ip6_tables ipv6 iptable_filter ip_tables ebtable_nat ebtables x_tables dm_mod snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi iTCO_wdt iTCO_vendor_support dcdbas i2c_i801 pcspkr serio_raw snd_hda_controller snd_hda_codec snd_hwdep lpc_ich shpchp mfd_core snd_pcm snd_timer snd soundcore battery acpi_cpufreq i915(-) button video drm_kms_helper drm cfbfillrect cfbimgblt cfbcopyarea [last unloaded: snd_hda_intel]
[   47.332896] CPU: 3 PID: 1078 Comm: kworker/u16:3 Not tainted 3.19.0-rc4_drm-intel-nightly_9c4bdc_20150114+ #410
[   47.332899] Hardware name: Dell Inc. OptiPlex 9020/0DNKMN, BIOS A03 09/17/2013
[   47.332904] Workqueue: khelper __call_usermodehelper
[   47.332906] task: ffff880119690000 ti: ffff880114490000 task.ti: ffff880114490000
[   47.332908] RIP: 0010:[<ffffffff8110e90b>]  [<ffffffff8110e90b>] kmem_cache_alloc+0xd9/0x113
[   47.332912] RSP: 0018:ffff880114493c18  EFLAGS: 00010282
[   47.332926] RAX: 0000000000000000 RBX: ffff880115d59de0 RCX: 0000000000001282
[   47.332928] RDX: 0000000000001281 RSI: 00000000000000d0 RDI: ffff88011a003900
[   47.332930] RBP: 0000000000000001 R08: 0000000000015520 R09: ffffffff8104a2ae
[   47.332932] R10: ffff8800db07f4d8 R11: ffffffff81bd1d18 R12: ffff88011a003900
[   47.332934] R13: 00000000000000d0 R14: ffffffff8104f7cb R15: 0000000000800111
[   47.332936] FS:  0000000000000000(0000) GS:ffff88011eac0000(0000) knlGS:0000000000000000
[   47.332938] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.332940] CR2: 0000000000000001 CR3: 00000001167d0000 CR4: 00000000001407e0
[   47.332994] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   47.332996] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   47.332998] Stack:
[   47.332999]  0000000000000000 ffff880115d59de0 ffffffff81bc1e00 0000000000000000
[   47.333003]  ffff8800025e4000 ffff8800daecc000 0000000000800111 ffffffff8104f7cb
[   47.333007]  ffff880114493d54 ffffffff8110e859 0000000000000000 0000000000000000
[   47.333011] Call Trace:
[   47.333014]  [<ffffffff8104f7cb>] ? alloc_pid+0x22/0x3b7
[   47.333017]  [<ffffffff8110e859>] ? kmem_cache_alloc+0x27/0x113
[   47.333020]  [<ffffffff8110e859>] ? kmem_cache_alloc+0x27/0x113
[   47.333024]  [<ffffffff8103a948>] ? copy_process.part.36+0xd98/0x160a
[   47.333027]  [<ffffffff8104a2ae>] ? __call_usermodehelper+0x3e/0x3e
[   47.333030]  [<ffffffff8103b337>] ? do_fork+0xb4/0x2e0
[   47.333032]  [<ffffffff8103b581>] ? kernel_thread+0x1e/0x20
[   47.333035]  [<ffffffff8104a29f>] ? __call_usermodehelper+0x2f/0x3e
[   47.333038]  [<ffffffff8104d108>] ? process_one_work+0x1ad/0x31a
[   47.333041]  [<ffffffff8104d4ef>] ? worker_thread+0x255/0x350
[   47.333044]  [<ffffffff8104d29a>] ? process_scheduled_works+0x25/0x25
[   47.333046]  [<ffffffff81050dc2>] ? kthread+0xc5/0xcd
[   47.333049]  [<ffffffff81050cfd>] ? kthread_freezable_should_stop+0x40/0x40
[   47.333053]  [<ffffffff8179ef2c>] ? ret_from_fork+0x7c/0xb0
[   47.333056]  [<ffffffff81050cfd>] ? kthread_freezable_should_stop+0x40/0x40
[   47.333058] Code: e9 4d 89 f8 4c 89 f1 48 89 ea 48 89 de 41 ff 14 24 49 83 c4 10 49 83 3c 24 00 eb 36 eb 38 49 63 44 24 20 4d 8b 04 24 48 8d 4a 01 <48> 8b 5c 05 00 48 89 e8 65 49 0f c7 08 0f 94 c0 84 c0 0f 85 78
[   47.333098] RIP  [<ffffffff8110e90b>] kmem_cache_alloc+0xd9/0x113
[   47.333101]  RSP <ffff880114493c18>
[   47.333102] CR2: 0000000000000001
[   47.333105] ---[ end trace 0e850f2f346708bd ]---


==Reproduce steps==
---------------------------- 
1. ./drv_module_reload
Comment 1 Chris Wilson 2015-01-15 07:47:37 UTC
Big-bada-boom. Please, please bisect.
Comment 2 Li Xu 2015-01-16 03:25:32 UTC
ea2c67bb4affa84080c616920f3899f123786e56 is the first bad commit
commit ea2c67bb4affa84080c616920f3899f123786e56
Author: Matt Roper <matthew.d.roper@intel.com>
Date:   Tue Dec 23 10:41:52 2014 -0800

    drm/i915: Move to atomic plane helpers (v9)

    Switch plane handling to use the atomic plane helpers.  This means that
    rather than provide our own implementations of .update_plane() and
    .disable_plane(), we expose the lower-level check/prepare/commit/cleanup
    entrypoints and let the DRM core implement update/disable for us using
    those entrypoints.

    The other main change that falls out of this patch is that our
    drm_plane's will now always have a valid plane->state that contains the
    relevant plane state (initial state is allocated at plane creation).
    The base drm_plane_state pointed to holds the requested source/dest
    coordinates, and the subclassed intel_plane_state holds the adjusted
    values that our driver actually uses.

    v2:
     - Renamed file from intel_atomic.c to intel_atomic_plane.c (Daniel)
     - Fix a copy/paste comment mistake (Bob)

    v3:
     - Use prepare/cleanup functions that we've already factored out
     - Use newly refactored pre_commit/commit/post_commit to avoid sleeping
       during vblank evasion

    v4:
     - Rebase to latest di-nightly requires adding an 'old_state' parameter
       to atomic_update;

    v5:
     - Must have botched a rebase somewhere and lost some work.  Restore
       state 'dirty' flag to let begin/end code know which planes to
       run the pre_commit/post_commit hooks for.  This would have actually
       shown up as broken in the next commit rather than this one.

    v6:
     - Squash kerneldoc patch into this one.
     - Previous patches have now already taken care of most of the
       infrastructure that used to be in this patch.  All we're adding here
       now is some thin wrappers.

    v7:
     - Check return of intel_plane_duplicate_state() for allocation
       failures.

    v8:
     - Drop unused drm_plane_state -> intel_plane_state cast.  (Ander)
     - Squash in actual transition to plane helpers.  Significant
       refactoring earlier in the patchset has made the combined
       prep+transition much easier to swallow than it was in earlier
       iterations. (Ander)

    v9:
     - s/track_fbs/disabled_planes/ in the atomic crtc flags.  The only fb's
       we need to update frontbuffer tracking for are those on a plane about
       to be disabled (since the atomic helpers never call prepare_fb() when
       disabling a plane), so the new name more accurately describes what
       we're actually tracking.

    Testcase: igt/kms_plane
    Testcase: igt/kms_universal_plane
    Testcase: igt/kms_cursor_crc
    Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
    Reviewed-by: Ander Conselvan de Oliveira <conselvan2@gmail.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 a2d045dfd034bfb9ec7a5ffb799fa148ff4047c3 1f7f88d3e9fe9cbe7b8409c1d868                                                                                                                 52c65808d7a3 M  Documentation
:040000 040000 7660ab19978e64dac501827a5b4b1e050e88b529 68cd800423301596e3245b0410dc                                                                                                                 1d1a19fc3b35 M  drivers
Comment 3 lu hua 2015-01-16 04:44:08 UTC
(In reply to Li Xu from comment #2)
> ea2c67bb4affa84080c616920f3899f123786e56 is the first bad commit
> commit ea2c67bb4affa84080c616920f3899f123786e56
> Author: Matt Roper <matthew.d.roper@intel.com>
> Date:   Tue Dec 23 10:41:52 2014 -0800
> 
>     drm/i915: Move to atomic plane helpers (v9)
> 

commit ea2c67bb4affa84080c616920f3899f123786e56
Author:     Matt Roper <matthew.d.roper@intel.com>
AuthorDate: Tue Dec 23 10:41:52 2014 -0800
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Mon Jan 12 23:59:31 2015 +0100

    drm/i915: Move to atomic plane helpers (v9)
Comment 4 Jani Nikula 2015-01-16 10:55:07 UTC
Please always assign to bisected bad commit author! I'm also CC'ing Ander for being the reviewer.

At a very quick glance, only intel_plane_duplicate_state looks suspicious. Is plane->state always valid when non-NULL?

Matt, Ander?
Comment 5 Ander Conselvan de Oliveira 2015-01-16 12:41:29 UTC
(In reply to Jani Nikula from comment #4)
> Please always assign to bisected bad commit author! I'm also CC'ing Ander
> for being the reviewer.
> 
> At a very quick glance, only intel_plane_duplicate_state looks suspicious.
> Is plane->state always valid when non-NULL?
> 
> Matt, Ander?

It should, but maybe I missed something in the review. But the error checking in intel_plane_duplicate_state() and callers seems to be correct. Perhaps the bug is in the plane helpers.
Comment 6 Ander Conselvan de Oliveira 2015-01-16 13:14:48 UTC
(In reply to Ander Conselvan de Oliveira from comment #5)
> (In reply to Jani Nikula from comment #4)
> > Please always assign to bisected bad commit author! I'm also CC'ing Ander
> > for being the reviewer.
> > 
> > At a very quick glance, only intel_plane_duplicate_state looks suspicious.
> > Is plane->state always valid when non-NULL?
> > 
> > Matt, Ander?
> 
> It should, but maybe I missed something in the review. But the error
> checking in intel_plane_duplicate_state() and callers seems to be correct.
> Perhaps the bug is in the plane helpers.

Actually the problem is somewhere with intel_plane_destroy. intel_plane_state_destroy doesn't set the plane to NULL and then something bad happens when drm_plane_cleanup tries to destroy it again.
Comment 7 Matt Roper 2015-01-16 15:08:39 UTC
(In reply to Ander Conselvan de Oliveira from comment #6)
> (In reply to Ander Conselvan de Oliveira from comment #5)
> > (In reply to Jani Nikula from comment #4)
> > > Please always assign to bisected bad commit author! I'm also CC'ing Ander
> > > for being the reviewer.
> > > 
> > > At a very quick glance, only intel_plane_duplicate_state looks suspicious.
> > > Is plane->state always valid when non-NULL?
> > > 
> > > Matt, Ander?
> > 
> > It should, but maybe I missed something in the review. But the error
> > checking in intel_plane_duplicate_state() and callers seems to be correct.
> > Perhaps the bug is in the plane helpers.
> 
> Actually the problem is somewhere with intel_plane_destroy.
> intel_plane_state_destroy doesn't set the plane to NULL and then something
> bad happens when drm_plane_cleanup tries to destroy it again.

Yep, Ander's analysis looks correct.  When I first wrote the patchset, the DRM core wasn't cleaning up plane state, so I had to call the state destruction in intel_plane_destroy() so that we wouldn't leak it on driver unload.  But the core got updated to do the destruction before my patches actually landed and I didn't notice before this series got merged, so we're now doing a double kfree() (and possibly a double framebuffer unreference as well).

The fix is to just not cleanup the plane state in the driver anymore since the core will handle it for us.  I'll send a patch for that shortly.
Comment 8 Matt Roper 2015-01-16 15:34:00 UTC
The following patch should solve the problem:

  [PATCH] drm/i915: Don't cleanup plane state in intel_plane_destroy()
Comment 9 Jani Nikula 2015-01-16 15:58:38 UTC
(In reply to Matt Roper from comment #8)
> The following patch should solve the problem:
> 
>   [PATCH] drm/i915: Don't cleanup plane state in intel_plane_destroy()

Also available at http://patchwork.freedesktop.org/patch/40566/ - please test.
Comment 10 lu hua 2015-01-19 08:11:17 UTC
(In reply to Jani Nikula from comment #9)
> (In reply to Matt Roper from comment #8)
> > The following patch should solve the problem:
> > 
> >   [PATCH] drm/i915: Don't cleanup plane state in intel_plane_destroy()
> 
> Also available at http://patchwork.freedesktop.org/patch/40566/ - please
> test.

Apply this patch on the latest -nightly kernel, the hang goes away. output shows "module successfully unloaded" but not reload, report bug 88573 to track it.
I think your patch fixed this bug.
Comment 11 Jani Nikula 2015-01-19 11:44:36 UTC
commit fd7bf8b626e1f9cfb9cd5c3104dcedae2569899b
Author: Matt Roper <matthew.d.roper@intel.com>
Date:   Fri Jan 16 07:25:24 2015 -0800

    drm/i915: Don't cleanup plane state in intel_plane_destroy()
Comment 12 lu hua 2015-01-20 07:27:38 UTC
Verified.Fixed on the latest -nightly kernel.
[root@x-hsw27 tests]# ./drv_module_reload
unbinding /sys/class/vtconsole/vtcon0/: (M) frame buffer device
module successfully unloaded
module successfully loaded again
Comment 13 Jari Tahvanainen 2017-07-03 08:46:55 UTC
Closing old verified+fixed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.