Bug 28454

Summary: Intel Clarkdale Hard Lockup
Product: DRI Reporter: Daniel Kasak <dan>
Component: DRM/IntelAssignee: Jesse Barnes <jbarnes>
Status: CLOSED INVALID QA Contact:
Severity: major    
Priority: medium CC: gordon.jin, john
Version: XOrg gitKeywords: NEEDINFO
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:

Description Daniel Kasak 2010-06-08 17:00:05 UTC
I just got an Intel Clarkdale system, and installed Sabayon Linux-5.3-rc on it. This has the following components:

 - 2.6.34-sabayon kernel ( Linux kakak 2.6.34-sabayon #1 SMP Mon May 31 16:00:15 UTC 2010 x86_64 Intel(R) Core(TM) i5 CPU 650 @ 3.20GHz GenuineIntel GNU/Linux )

 - mesa-7.8.1

 - xf86-video-intel-2.11.0

 - libdrm-2.4.19

With just gnome running, with NO compositing manager, I am getting random hard lockups. CTRL-ALT-SysRq magic system requests don't work. I am unable to get a batch buffer.

I have posted at the phoronix forums: http://www.phoronix.com/forums/showthread.php?t=24205 and it was recommended that I file a bug report. Also see the phoronix review which mentions hard lockups: http://www.phoronix.com/scan.php?page=article&item=intel_clarkdale_gpu&num=1

I have set the severity as 'blocker' as this system is wildly unstable - I had 3 hard lockups in 4 hours yesterday, and my first one this morning happened 3 minutes after powering on.
Comment 1 Jesse Barnes 2010-06-09 10:33:30 UTC
Can you try the latest stable kernel, 2.6.34.x?  Or 2.6.35-rc2...
Comment 2 Daniel Kasak 2010-06-10 18:54:09 UTC
I've tested with a couple of different combinations. Firstly, upgrading to 2.6.35-rc2-git2 made lockups about half as frequent ... about 1 every 4 hours now.

Next I activated ecomorph ( compiz port for enlightenment-0.17 ). There were rendering issues with mesa-7.8.1, so I built mesa from git ( plus upgraded to libdrm-2.4.20 ). This fixed the rendering issues, but I started getting more frequent lockups again.

Next I upgraded xf86-video-intel from 2.11.0 to git. This gives me a hard lockup when X starts ... so I've had to downgrade that.

In all, I'm in a slightly better position than before ... still getting regular lockups, but not *quite* as many.
Comment 3 Daniel Kasak 2010-06-10 19:17:57 UTC
Actually, scratch that. Just had another 2 in quick succession ( power cycling between lockups ). Back to FBDEV :(
Comment 4 Jesse Barnes 2010-06-11 09:12:43 UTC
Strange, I've run with this configuration for quite some time.  I haven't seen hangs since the PIPE_CONTROL patchset landed.

You could also try the drm-intel-next branch from git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel.git, it has a couple of fixes that haven't landed upstream yet, including one that can cause a panic if page flipping is in use.
Comment 5 Jesse Barnes 2010-07-01 13:56:38 UTC
Any update here?  I'm still using this config pretty happily...
Comment 6 Daniel Kasak 2010-07-01 16:28:34 UTC
It still locks my system hard.

I've tested with anholt's drm-intel-next kernel ( 2.6.34 ) as per your recommendation above, and also kept up with the latest git sources of libdrm, mesa and xf86-video-intel. This kernel has some issues, including udev consuming 100% of one of my cores. That aside ( I can kill it and it doesn't break too much ), I get anywhere from a couple of seconds to a couple of hours of usage, and then the system locks hard again.

Since this is a work desktop, I can't afford to have it crashing at all, and in particular leaving filesystem corruption. So I'm using fbdev. But I'm not impressed.
Comment 7 Jesse Barnes 2010-07-01 16:37:03 UTC
Hm ok, if you get a chance (I know what you mean about work machines, I hate when mine fails too), can you collect some additional information about the crash?  http://intellinuxgraphics.org/how_to_report_bug.html has a good overview of what we like to see in order to narrow things down.
Comment 8 Daniel Kasak 2010-07-07 21:53:42 UTC
I'm not able to get *anything* out of my system after one of these lockups, so obtaining the last batchbuffer is not possible. If you want other info on my system, I'll have to rebuild my kernel with debugfs support ( or something like this ). I can do it if you think this will be handy.
Comment 9 Jesse Barnes 2010-07-08 08:54:17 UTC
debugfs is mainly handy after the GPU hangs (assuming the machine is still alive).  Since your machine hard locks, it might be better to set up netconsole so you can capture any kernel log output that occurs around the time of the hang.
Comment 10 Jesse Barnes 2010-07-23 13:36:58 UTC
Any news?  Another option would be to try an Ubuntu or Fedora live disc and see if they're as unstable as your current config.  There are a lot of aspects of the configuration that could go wrong, using a whole separate stack would help rule some of those out.
Comment 11 Jesse Barnes 2010-07-23 13:37:43 UTC
Oh and what platform is this specifically?  A Dell, HP or other?  Or a home made system with a specific board?
Comment 12 Daniel Kasak 2010-07-25 17:07:48 UTC
No positive news at this point. I updated to 2.6.35-rc4+ ... d44a78e83f7549b3c4ae611e667a0db265cf2e00 in anholt's kernel tree, and also c20a3628c7c6b7c41efe309b712bf93eb4e92039 in mesa on Friday, and this didn't help things at all. I also selected debugfs and netconsole ( as a module ), but netconsole didn't build and I don't have time to figure out why at this point.

As for trying other distros - yeah I suppose I can try that. I'm currently using Sabayon, with some packages ( mesa, libdrm, kernel, etc, built via portage or from git ).

This system is a Lenova 'M Series' ThinkCentre, bought by my employer. It is in it's stock-standard state - I'm not allowed to open the box at all.

Also, I'm starting to wonder if the issue is actually in xf86-video-intel. I get lockups whether I have any 3D clients running or not.
Comment 13 John Perkins 2010-08-02 15:15:15 UTC
(In reply to comment #11)
> Oh and what platform is this specifically?  A Dell, HP or other?  Or a home
> made system with a specific board?

FYI: we're having the same issues with a new batch of Dell Optiplex 980 systems here at our site.  All have the Clarkdale chip in them (according to the X-server).

We're running 64-bit RHEL 5.4, kernel 2.6.18-194.3.1.el5.  X-server and Intel device driver RPMs:

xorg-x11-server-Xorg-1.1.1-48.76.el5_5.1.x86_64
xorg-x11-drv-i810-1.6.5-9.36.el5.x86_64

If there is any specific debug data that might be helpful, I can see if I can generate that and submit it for review.

John
Comment 14 John Perkins 2010-08-02 15:38:32 UTC
(In reply to comment #7)
> Hm ok, if you get a chance (I know what you mean about work machines, I hate
> when mine fails too), can you collect some additional information about the
> crash?  http://intellinuxgraphics.org/how_to_report_bug.html has a good
> overview of what we like to see in order to narrow things down.

-- chipset: IGDNG_D
-- system architecture: x86_64
-- xf86-video-intel/xserver/mesa/libdrm version: 
    xf86-video-intel: xorg-x11-drv-i810-1.6.5-9.36.el5.x86_64
    xserver: xorg-x11-server-Xorg-1.1.1-48.76.el5_5.1.x86_64
    mesa: mesa-libGL-6.5.1-7.8.el5.x86_64
    libdrm: libdrm-2.0.2-1.1.x86_64
-- kernel version: 2.6.18-194.3.1.el5
-- Linux distribution: Redhat Enterprise Linux 5.4
-- Machine or mobo model: Dell Optiplex 980
-- Display connector: displayport with DP->DVI adapter

Lockups are such that I cannot recover without poweroff/poweron.  No errors related to crash are logged to local or remote syslog host, nor are there any errors reported in /var/log/Xorg.0.log.  I can submit them (or other debugging output) if needed.

I am not able to get a dump of the VBIOS; the instructions for getting this dump do not work:

notam(su): lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02)
notam(su): echo 1 > /sys/devices/pci0000:00/0000:00:02.0/rom
/sys/devices/pci0000:00/0000:00:02.0/rom: Permission denied.

Unfortunately, there is no /sys/kernel/debug/dri directory present on our systems, either, making it difficult to get a "last batch buffer before GPU hang" dump.
Comment 15 Dave Airlie 2010-08-02 15:50:11 UTC
John,

if you are actually running RHEL (not CentOS), you should probably report the bug via Issue Tracker or calling Red Hat support. Its unlikely we'll backport fixes unless there are customers raising the issues via the appropriate channels.

Dave.
Comment 16 Chris Wilson 2010-09-10 07:32:46 UTC
(In reply to comment #12)
> Also, I'm starting to wonder if the issue is actually in xf86-video-intel. I
> get lockups whether I have any 3D clients running or not.

Daniel, it's likely to be a hard lockup inside the kernel. There were a *lot* of Ironlake fixes that we have pushed recently, can you test

git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-fixes

which is 2.6.36-rc3 + regression fixes?
Comment 17 Daniel Kasak 2010-11-02 21:45:30 UTC
> Daniel, it's likely to be a hard lockup inside the kernel. There were a *lot*
> of Ironlake fixes that we have pushed recently

Just tested 2.6.36 and this still locks hard every 2 hours or so.
Comment 18 Jesse Barnes 2011-01-31 10:17:35 UTC
Assuming we've fixed this in 2.6.37 or current Linus, please re-open if not.
Comment 19 Daniel Kasak 2011-03-10 18:29:47 UTC
Just did a fresh install with a 2.6.37 kernel and tried the Intel driver again. Result: constant hard lockups :(

Re-opening and switching back to fbdev once more ...
Comment 20 Jesse Barnes 2011-06-08 11:11:32 UTC
We might just have to buy an affected machine to root cause this one, it worries me.
Comment 21 Jesse Barnes 2011-06-15 10:00:18 UTC
Gordon, is this something you can reproduce with your systems?
Comment 22 Gordon Jin 2011-06-16 02:08:44 UTC
No, (In reply to comment #21)
> Gordon, is this something you can reproduce with your systems?

No, Clarkdale works quite solid on my side.
Comment 23 Jesse Barnes 2011-06-16 10:24:39 UTC
Daniel, can you reproduce this on your system with a different distro, e.g. a recent Ubuntu livecd or something?  I suspect a configuration issue somehow...
Comment 24 Chris Wilson 2011-07-18 07:39:58 UTC
Downgrading proirity, if wasn't fixed in the last 4 releases, it's not going to be fixed by tomorrow.
Comment 25 Jesse Barnes 2011-08-01 12:35:36 UTC
timeout, hopefully this isn't an issue anymore anyway (and please confirm with another distro before re-opening too, to rule out config)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.