Bug 53566 - bisected regression 3.5+ kernel
Summary: bisected regression 3.5+ kernel
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium major
Assignee: Ben Skeggs
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-16 02:39 UTC by Vlad K
Modified: 2013-08-31 02:19 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
kern.log (640.38 KB, text/x-log)
2012-08-16 02:40 UTC, Vlad K
no flags Details
restrict pce usage based on punits values (1.28 KB, patch)
2012-08-27 06:34 UTC, Ben Skeggs
no flags Details | Splinter Review
crash picture, stack trace shows nv50_fb_vram_del at top (151.41 KB, image/jpeg)
2012-12-28 00:04 UTC, Jonathan Vasquez
no flags Details

Description Vlad K 2012-08-16 02:39:08 UTC
It seems that there is a regression present in 3.5+ kernels with nouveau on GTX 560 card, X does not start and artifact present during boot (rectange on top right). Reverting commit below resolves issue, X can be started and artifact is gone. Attached is the kern.log from machine booted with drm.debug=0x04


1a46098e910b96337f0fe3838223db43b923bad4 is the first bad commit
commit 1a46098e910b96337f0fe3838223db43b923bad4                                                                       
Author: Ben Skeggs <bskeggs@redhat.com>                                                                               
Date:   Fri May 4 15:17:28 2012 +1000                                                                                 
                                                                                                                                                                                                                                                                       
    drm/nvc0/ttm: use copy engines for async buffer moves                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                       
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>
Comment 1 Vlad K 2012-08-16 02:40:46 UTC
Created attachment 65623 [details]
kern.log
Comment 2 mog55356 2012-08-18 23:39:51 UTC
Based on your description of the problem and the attached kern.log, this seems to be a duplicate of my bug, number 53101. Thanks for fixing the regression. Hopefully your patch gets accepted upstream and gets integrated into distro kernels sooner rather than later.

https://bugs.freedesktop.org/show_bug.cgi?id=53101https://bugs.freedesktop.org/show_bug.cgi?id=53101

*** This bug has been marked as a duplicate of bug 53101 ***
Comment 3 Ben Skeggs 2012-08-27 06:34:24 UTC
Created attachment 66162 [details] [review]
restrict pce usage based on punits values

Re-opening as the bug this was marked a duplicate of is a mess and could possibly be multiple issues.

I've attached a patch which should help at least users with NVCE (GF114) chipsets.
Comment 4 Vlad K 2012-08-29 01:05:21 UTC
Ben, I am able to boot with 3.5.3 + this patch and so far see no other issues. If I can be of any other help, please let me know.
Comment 5 Kelly Doran 2012-08-29 19:33:58 UTC
I am using 3.6.0-rc3 and the related patch that showed up on git recently and everything is working great on my card now.
Comment 6 Vlad K 2012-09-05 21:24:27 UTC
Unfortunately this is not completely fixed. I am still getting same problem after 1-2 days of uptime, X crashes and stuck in restart loop - forced to reboot.

dmesg:


[225619.491763] [drm] nouveau 0000:01:00.0: PFIFO: read fault at 0x0008028000 [PAGE_NOT_PRESENT] from PFIFO/PFIFO on channel 0x000013a000
[225623.000984] [drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon
[225626.049326] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[225629.047337] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
[225634.044033] [drm] nouveau 0000:01:00.0: Failed to idle channel 4.
[225637.042033] [drm] nouveau 0000:01:00.0: Failed to idle channel 3.
[225646.703625] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[225649.701636] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
[225659.263372] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[225662.261307] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
[225671.806976] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
[225674.804998] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
Comment 7 Vlad K 2012-09-16 17:15:24 UTC
I tried the latest git and hit a bug in under 24 hours :(


Sep 16 11:19:32 desktop kernel: [33219.647922] nouveau W[   PFIFO][0000:01:00.0] unknown status 0x40000000
Sep 16 12:04:47 desktop kernel: [35932.327284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000012
Sep 16 12:04:47 desktop kernel: [35932.327317] IP: [<ffffffffa0498cc5>] nouveau_mm_free+0x85/0x180 [nouveau]
Sep 16 12:04:47 desktop kernel: [35932.327357] PGD 222c3d067 PUD 2220d6067 PMD 0 
Sep 16 12:04:47 desktop kernel: [35932.327375] Oops: 0002 [#1] PREEMPT SMP 
Sep 16 12:04:47 desktop kernel: [35932.327392] Modules linked in: tun bnep rfcomm bluetooth rfkill pci_stub vboxpci(O) vboxnetadp(O) cpufreq_stats parport_pc vboxnetflt(O) ppdev lp parport vboxdrv(O) binfmt_misc zram(C) zsmalloc(C) nfsd exportfs auth_rpcgss nfs_acl nfs lockd fscache sunrpc fuse ext3 jbd sha256_generic aes_x86_64 aes_generic cbc dm_crypt sbs sbshc max6650 loop firewire_sbp2 snd_hda_codec_hdmi joydev powernow_k8 hid_generic mperf snd_hda_codec_realtek snd_usb_audio snd_usbmidi_lib snd_seq_midi snd_seq_midi_event snd_rawmidi kvm_amd kvm evdev edac_mce_amd microcode pcspkr edac_core psmouse serio_raw k10temp nouveau snd_hda_intel mxm_wmi snd_hda_codec video i2c_piix4 ttm snd_hwdep drm_kms_helper snd_pcm drm snd_page_alloc snd_seq i2c_algo_bit snd_seq_device i2c_core snd_timer snd soundcore nvidiafb vgastate processor wmi button thermal_sys ext4 crc16 jbd2 mbcache btrfs libcrc32c zlib_deflate dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq raid1 raid0 mult
Sep 16 12:04:47 desktop kernel: ipath linear md_mod usbhid hid firewire_ohci firewire_core r8169 sg sd_mod crc_t10dif ohci_hcd crc32c_intel xhci_hcd crc_itu_t mii ehci_hcd ahci libahci usbcore libata usb_common scsi_mod
Sep 16 12:04:47 desktop kernel: [35932.327873] CPU 6 
Sep 16 12:04:47 desktop kernel: [35932.327884] Pid: 4607, comm: kwin Tainted: G         C O 3.6.0-rc5-drmgit02+ #1 To be filled by O.E.M. To be filled by O.E.M./SABERTOOTH 990FX
Sep 16 12:04:47 desktop kernel: [35932.327907] RIP: 0010:[<ffffffffa0498cc5>]  [<ffffffffa0498cc5>] nouveau_mm_free+0x85/0x180 [nouveau]
Sep 16 12:04:47 desktop kernel: [35932.327943] RSP: 0018:ffff88021db27c28  EFLAGS: 00010246
Sep 16 12:04:47 desktop kernel: [35932.327956] RAX: 0000000000000000 RBX: ffff8802241ad498 RCX: ffff88021dfb08c0
Sep 16 12:04:47 desktop kernel: [35932.327970] RDX: 000000000000000a RSI: dead000000100100 RDI: dead000000200200
Sep 16 12:04:47 desktop kernel: [35932.327985] RBP: ffff880045fff180 R08: 00000000000165a0 R09: ffff88022ed965a0
Sep 16 12:04:47 desktop kernel: [35932.328000] R10: ffffea00016f74c0 R11: ffffffffa0498da9 R12: ffff88021dfb08c0
Sep 16 12:04:47 desktop kernel: [35932.328014] R13: ffff88021db27c60 R14: ffff8802241ad420 R15: ffff88014af16040
Sep 16 12:04:47 desktop kernel: [35932.328030] FS:  00007fba186d0780(0000) GS:ffff88022ed80000(0000) knlGS:00000000f10fbb70
Sep 16 12:04:47 desktop kernel: [35932.328046] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 12:04:47 desktop kernel: [35932.328058] CR2: 0000000000000012 CR3: 0000000222fe1000 CR4: 00000000000407e0
Sep 16 12:04:47 desktop kernel: [35932.328080] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 16 12:04:47 desktop kernel: [35932.328097] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Sep 16 12:04:47 desktop kernel: [35932.328115] Process kwin (pid: 4607, threadinfo ffff88021db26000, task ffff88021ca9d100)
Sep 16 12:04:47 desktop kernel: [35932.328133] Stack:
Sep 16 12:04:47 desktop kernel: [35932.328141]  0000000000000000 ffff88014a07a900 ffff88014a07a9c0 ffff8802241ad498
Sep 16 12:04:47 desktop kernel: [35932.328185]  ffff8802241ad400 ffffffffa04ac29f 0000000000008000 ffff88005bdd3e00
Sep 16 12:04:47 desktop kernel: [35932.328215]  ffff88014af16000 ffff880221df1178 ffff880221df1580 ffff8802245f7378
Sep 16 12:04:47 desktop kernel: [35932.328237] Call Trace:
Sep 16 12:04:47 desktop kernel: [35932.328270]  [<ffffffffa04ac29f>] ? nv50_fb_vram_del+0x9f/0xe0 [nouveau]
Sep 16 12:04:47 desktop kernel: [35932.328296]  [<ffffffffa043f786>] ? ttm_bo_cleanup_memtype_use+0x66/0xa0 [ttm]
Sep 16 12:04:47 desktop kernel: [35932.328321]  [<ffffffffa044091c>] ? ttm_bo_release+0x1dc/0x220 [ttm]
Sep 16 12:04:47 desktop kernel: [35932.328344]  [<ffffffffa0440995>] ? ttm_bo_unref+0x35/0x60 [ttm]
Sep 16 12:04:47 desktop kernel: [35932.328388]  [<ffffffffa05037b2>] ? nouveau_gem_object_del+0x52/0x80 [nouveau]
Sep 16 12:04:47 desktop kernel: [35932.328416]  [<ffffffffa040ec18>] ? drm_gem_handle_delete+0xd8/0x120 [drm]
Sep 16 12:04:47 desktop kernel: [35932.328444]  [<ffffffffa040f000>] ? drm_gem_destroy+0x40/0x40 [drm]
Sep 16 12:04:47 desktop kernel: [35932.328468]  [<ffffffffa040d164>] ? drm_ioctl+0x3c4/0x460 [drm]
Sep 16 12:04:47 desktop kernel: [35932.328492]  [<ffffffff81302e17>] ? sys_recvfrom+0xf7/0x140
Sep 16 12:04:47 desktop kernel: [35932.328508]  [<ffffffff81153a01>] ? do_vfs_ioctl+0x81/0x540
Sep 16 12:04:47 desktop kernel: [35932.328524]  [<ffffffff811545eb>] ? poll_select_copy_remaining+0xab/0x120
Sep 16 12:04:47 desktop kernel: [35932.328540]  [<ffffffff81153f48>] ? sys_ioctl+0x88/0xa0
Sep 16 12:04:47 desktop kernel: [35932.328555]  [<ffffffff8140c8f9>] ? system_call_fastpath+0x16/0x1b
Sep 16 12:04:47 desktop kernel: [35932.328569] Code: 89 45 34 8b 41 38 01 45 38 80 79 30 00 75 2b 48 8b 51 10 48 8b 41 18 48 be 00 01 10 00 00 00 ad de 48 bf 00 02 20 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 71 10 48 89 79 18 48 8b 11 48 8b 41 
Sep 16 12:04:47 desktop kernel: [35932.328894] RIP  [<ffffffffa0498cc5>] nouveau_mm_free+0x85/0x180 [nouveau]
Sep 16 12:04:47 desktop kernel: [35932.328930]  RSP <ffff88021db27c28>
Sep 16 12:04:47 desktop kernel: [35932.328940] CR2: 0000000000000012
Sep 16 12:04:47 desktop kernel: [35932.369824] ---[ end trace e66067c6ec707dbe ]---
Comment 8 Vlad K 2012-09-18 17:52:31 UTC
Had another GPU lockup with 3.5.3 + Ben's patch, happened after ~2days. This time with some different error messages.



Sep 18 11:55:27 desktop kernel: [171976.674871] [drm] nouveau 0000:01:00.0: multiple instances of buffer 134 on validation list
Sep 18 11:55:27 desktop kernel: [171976.674906] [drm] nouveau 0000:01:00.0: validate_init
Sep 18 11:55:27 desktop kernel: [171976.674910] [drm] nouveau 0000:01:00.0: validate: -22
Sep 18 11:55:27 desktop kernel: [171976.677029] [drm] nouveau 0000:01:00.0: PFIFO: read fault at 0xab00000000 [PT_NOT_PRESENT] from PGRAPH/GPC1/(unknown enum 0x00000008) on channel 0x0000ea8000
Sep 18 11:55:31 desktop kernel: [171980.505052] [drm] nouveau 0000:01:00.0: GPU lockup - switching to software fbcon
Sep 18 11:55:34 desktop kernel: [171983.512347] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
Sep 18 11:55:36 desktop kernel: [171985.511208] [drm] nouveau 0000:01:00.0: PFIFO - playlist update failed
Sep 18 11:55:39 desktop kernel: [171988.509023] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
Sep 18 11:55:41 desktop kernel: [171990.507779] [drm] nouveau 0000:01:00.0: 0x2634 != chid: 0x00100002
Sep 18 11:55:41 desktop kernel: [171990.507890] [drm] nouveau 0000:01:00.0: PFIFO: unknown status 0x000
Comment 9 Jonathan Vasquez 2012-12-28 00:02:45 UTC
This never happened to me before but my computer just crashed on Gentoo Linux running vanilla kernel 3.7.1 with the nouveau driver. I believe this bug report might be similar to the problem I experienced. 

I don't have the log since I can't find it in /var/log, but I have pictures I took off my monitor. Hopefully this helps.

I uploaded the image to imgur

http://i.imgur.com/P4hjL.jpg
Comment 10 Jonathan Vasquez 2012-12-28 00:04:31 UTC
Created attachment 72200 [details]
crash picture, stack trace shows nv50_fb_vram_del at top
Comment 11 Aleksi Torhamo 2013-01-10 09:17:39 UTC
The nv50_fb_vram_del kernel crashes are probably fixed by the patch in 
http://lists.freedesktop.org/archives/nouveau/2013-January/011996.html

That issue is probably totally unrelated to the original bug report, though.
Comment 12 Lucas Stach 2013-06-06 10:25:48 UTC
(In reply to comment #6)
> Unfortunately this is not completely fixed. I am still getting same problem
> after 1-2 days of uptime, X crashes and stuck in restart loop - forced to
> reboot.
> 
> dmesg:
> 
> 
> [225619.491763] [drm] nouveau 0000:01:00.0: PFIFO: read fault at
> 0x0008028000 [PAGE_NOT_PRESENT] from PFIFO/PFIFO on channel 0x000013a000
> [225623.000984] [drm] nouveau 0000:01:00.0: GPU lockup - switching to
> software fbcon
> [225626.049326] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
> [225629.047337] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
> [225634.044033] [drm] nouveau 0000:01:00.0: Failed to idle channel 4.
> [225637.042033] [drm] nouveau 0000:01:00.0: Failed to idle channel 3.
> [225646.703625] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
> [225649.701636] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
> [225659.263372] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
> [225662.261307] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.
> [225671.806976] [drm] nouveau 0000:01:00.0: Failed to idle channel 1.
> [225674.804998] [drm] nouveau 0000:01:00.0: Failed to idle channel 2.

I can reproduce this problem on a desktop NVe7 and kernel 3.10-rc3 with nouveau git changes on top. Once in about one or two days of uptime GPU hangs with a PFIFO read or write fault.
Comment 13 Ilia Mirkin 2013-08-31 02:19:08 UTC
This bug has devolved into "I have various issues with nouveau", so I'm closing it. The original problem that Vlad had appears to be fixed, and the logic not to instantiate bogus copy engines remains in the current code. That's not to say that all problems with nouveau are closed, but bugs have to be about one at a time :)

Feel free to open new issues if bugs remain, but please look at the existing bug list and follow http://nouveau.freedesktop.org/wiki/Bugs/.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.