Bug 7770

Summary:

PCI domain mismatch between X server and kernel, leaving clients unable to use direct rendering

Product:

DRI

Reporter:

Émeric Maschino <emeric.maschino>

Component:

DRM/other

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED WORKSFORME

QA Contact:

Severity:

major

Priority:

high

CC:

alexdeucher, emeric.maschino, idr, morgoth6, oystein, plasm

Version:

unspecified

Hardware:

IA64 (Itanium)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
xorg.conf file	none
Xorg.0.log file	none
glxinfo output	none
Output of strace glxinfo revealing the failed ioctl call	none
Output of strace glxinfo with /sys/module/drm/parameters/debug set to 1	none
Kernel output obtained with dmesg	none
glxinfo kernel output with /sys/modules/drm/parameters/debug set to 1	none
Possible workaround	none
dmesg.log with DRM git patch 205c573e449b38d759273f6a51eb8c1131585ece applied	none
Xorg.0.log file with DRM git patch 205c573e449b38d759273f6a51eb8c1131585ece applied	none
log from drm compiled without hardcoded domain	none
quick drm domain patch	none

Description Émeric Maschino 2006-08-04 11:55:32 UTC

This error occurs when starting the X.org server, when invoking glxinfo, or when
starting glxgears. libGL reverts then to (slow) indirect rendering.

I can reproduce this issue on Fedora Core Rawhide (X.org 7.1, kernel 2.6.17,
Mesa 6.5), Debian GNU/Linux Testing "Etch" (X.org 7.0, kernel 2.6.16, Mesa
6.4.2) and openSUSE Factory (X.org 6.9, kernel 2.6.16, Mesa 6.5).

This issue isn't related to the graphics adapter: I've tested ATI FireGL X1
(using the r300 module) and MGA Matrox Millenium G400 DualHead (using the mga
module) graphics adapters. Same problem. All access rights are fine and this
error also occurs as root.

Stracing glxinfo reveals that an ioctl call fails:
open("/dev/dri/card0", O_RDWR)          = 4
ioctl(4, DECODER_SET_PICTURE, 0x60000fffff632df0) = -1 EACCES (Permission denied)

Google finds a few bug reports regarding this issue, but there are all closed.
Generally, they were reported against x86 or x86_64 architectures. The most
common cause was a missing cmpxchg CPU instruction (something related to SSE
IIRC). Obviously, Itanium CPUs don't support SSE (as several other platforms) so
the problem isn't probably there since DRI is also working on non-SSE architectures.

This is my first bug report on freedesktop.org, so I don't know whether it's
possible to attach logs (xorg.conf, Xorg.0.log, glxinfo strace output) or not.
Will try. Otherwise, drop me an email at emeric dot maschino at jouy dot inra
dot fr and I will send them you back.

Comment 1 Émeric Maschino 2006-08-04 11:57:17 UTC

Created attachment 6456 [details]
xorg.conf file

Comment 2 Émeric Maschino 2006-08-04 11:57:58 UTC

Created attachment 6457 [details]
Xorg.0.log file

Comment 3 Émeric Maschino 2006-08-04 11:59:10 UTC

Created attachment 6458 [details]
glxinfo output

Comment 4 Émeric Maschino 2006-08-04 12:00:03 UTC

Created attachment 6459 [details]
Output of strace glxinfo revealing the failed ioctl call

Comment 5 Michel Dänzer 2006-08-06 12:38:48 UTC

We need to find out which ioctl is failing. The kernel output might give a hint
if setting /sys/module/drm/parameters/debug to 1 before running glxinfo.

Comment 6 Émeric Maschino 2006-08-07 10:24:05 UTC

Created attachment 6487 [details]
Output of strace glxinfo with /sys/module/drm/parameters/debug set to 1

Comment 7 Michel Dänzer 2006-08-08 04:24:37 UTC

Thanks, but I only see strace output in there, not kernel output (as obtainable
with dmesg).

Comment 8 Émeric Maschino 2006-08-08 12:09:09 UTC

Created attachment 6501 [details]
Kernel output obtained with dmesg

Comment 9 Émeric Maschino 2006-08-08 12:12:34 UTC

(In reply to comment #7)
> Thanks, but I only see strace output in there, not kernel output (as obtainable
> with dmesg).

Sorry, I didn't know what informations you were looking for. BTW, having
/sys/module/drm/parameters/debug set to 1 doesn't change the kernel output
produced by dmesg. Am I missing something trivial here? Does this kernel output
contain the informations you were expecting?

Comment 10 Michel Dänzer 2006-08-09 00:44:42 UTC

(In reply to comment #9)
> BTW, having /sys/module/drm/parameters/debug set to 1 doesn't change the kernel
> output produced by dmesg. Am I missing something trivial here? Does this kernel
> output contain the informations you were expecting?

No, did you set it before running glxinfo? If so, does reading the file return 1
after setting it?

Comment 11 Émeric Maschino 2006-08-09 12:20:36 UTC

Created attachment 6508 [details]
glxinfo kernel output with /sys/modules/drm/parameters/debug set to 1

I have the strong impression that the buffer isn't big enough to catch all the
produced logs...

Comment 12 Émeric Maschino 2006-08-09 12:23:18 UTC

(In reply to comment #10)
> No, did you set it before running glxinfo? If so, does reading the file return 1
> after setting it?

Well, hum. I simply forget to invoke glxinfo once the debug parameter was set to
1. Hope the output will be helpful this time.

Comment 13 Michel Dänzer 2006-08-15 01:27:04 UTC

(In reply to comment #12)
> 
> Hope the output will be helpful this time.

Yes, thanks. It looks like the failing ioctl is a red herring - it's the
setversion ioctl, which only succeeds for the X server, but failure is ignored.
It looks like the real problem is that the X server thinks the device is on PCI
domain 1 whereas the kernel thinks it's on domain 0. Changing bug fields to
reflect this. If you could try xorg-server 1.1 from X.Org 7.1, this problem
might be fixed there with some luck.

It may be possible to work around this in the X server DRI module though, I'll
attach a test patch.

Comment 14 Michel Dänzer 2006-08-15 03:07:40 UTC

Created attachment 6560 [details] [review]
Possible workaround

This patch for the X server dri module might serve as a workaround. Arguably,
it should really refuse to enable the DRI in this case though.

Comment 15 Émeric Maschino 2006-08-16 13:09:07 UTC

> Yes, thanks. It looks like the failing ioctl is a red herring - it's the
> setversion ioctl, which only succeeds for the X server, but failure is ignored.
> It looks like the real problem is that the X server thinks the device is on PCI
> domain 1 whereas the kernel thinks it's on domain 0. Changing bug fields to
> reflect this. If you could try xorg-server 1.1 from X.Org 7.1, this problem
> might be fixed there with some luck.

Unfortunately, this bug is still present with X.org 7.1.1 and xorg-server 1.1.1.
Well, at least this is how the packages are numbered with current Fedora Core
Rawhide. xdpyinfo and glxinfo seem to confirm these release numbers.

Comment 16 Émeric Maschino 2006-08-16 13:20:43 UTC

(In reply to comment #14)
> Created an attachment (id=6560) [edit]
> Possible workaround
> 
> This patch for the X server dri module might serve as a workaround. Arguably,
> it should really refuse to enable the DRI in this case though.

I've tried it with X.org 7.1.1 as shipped with Fedora Core Rawhide. Great job:
glxinfo now reports "Direct rendering: Yes". Many thanks :-)

I don't know if this is related, but with your patch, glxinfo and glxgears
report than visual 0x4b isn't supported. BTW, glxgears gives me the following
output:

libGL warning: 3D driver claims to not support visual 0x4b
3452 frames in 5.0 seconds = 690.264 FPS

Two comments. First, I don't know if DRI is *really* working, since these
numbers seem a bit low. Second, these two lines are the only ones that are
displayed. Shortly after, the system locks hard, the screen was frozen, I can
move the mouse and there was a huge activity on the HDD. Unfortunately, I can't
remotely access to the system to diagnose what's going wrong.

Anyway, thanks for your time and consideration.

Comment 17 Michel Dänzer 2006-08-21 09:52:18 UTC

> libGL warning: 3D driver claims to not support visual 0x4b
> 3452 frames in 5.0 seconds = 690.264 FPS
> 
> Two comments. First, I don't know if DRI is *really* working, since these
> numbers seem a bit low. 

The first line (which is reported in another entry but harmless, BTW) wouldn't
be there if it wasn't using direct rendering.

> Second, these two lines are the only ones that are displayed. Shortly after,
> the system locks hard, [...]

Make sure you're running current xf86-video-ati git, a stability fix for some
R300 family cards went in there only recently.

Comment 18 Émeric Maschino 2006-08-23 13:54:57 UTC

(In reply to comment #17)
> Make sure you're running current xf86-video-ati git, a stability fix for some
> R300 family cards went in there only recently.

OK, I'll look at this. But I presume it would be preferable to open a separate
bug   to track stability problems.

BTW, back to the initial problem. What's the "status" of the patch you proposed?
Will it be integrated into mainstream or is it a "quick and dirty" hack that
also requires more deeper changes elsewhere in the code and thus won't be
integrated as is?

Comment 19 Michel Dänzer 2006-08-24 07:10:20 UTC

(In reply to comment #18)
> BTW, back to the initial problem. What's the "status" of the patch you proposed?
> Will it be integrated into mainstream or is it a "quick and dirty" hack that
> also requires more deeper changes elsewhere in the code and thus won't be
> integrated as is?

As I said, it's just a workaround, and IMO the X server should really refuse to
enable the DRI in the first place in this situation, as it can't assume the
different PCI IDs refer to the same device. The real fix would be to make the X
server's PCI domain numbering consistent with the kernel's. With some luck, the
pci-rework branch will take care of this. Adding Ian Romanick to the CC list.

Comment 20 Michel Dänzer 2006-09-14 01:48:35 UTC

Can you try again without the workaround but with the DRM from current git?
Commit 205c573e449b38d759273f6a51eb8c1131585ece might have an impact on this.

Comment 21 Émeric Maschino 2006-09-14 14:29:35 UTC

(In reply to comment #20)
> Can you try again without the workaround but with the DRM from current git?
> Commit 205c573e449b38d759273f6a51eb8c1131585ece might have an impact on this.

Thanks for the update. Tried it without the workaround but still the permission
problem. Do you need some kind of backtrace with the git changes applied? Just
let me know. BTW, I just noticed that current openSUSE FACTORY 10.2 Alpha4Plus
X.org 7.1.0 doesn't exhibit the permission issue but states that DRI is disabled
(although it's listed in the Modules section of the /etc/X11/xorg.conf file).
Actual Debian Etch X.org 7.0 and Fedora Core Rawhide X.org 7.1.1 still have the
permission issue.

Comment 22 Michel Dänzer 2006-09-15 06:55:23 UTC

(In reply to comment #21)
> Do you need some kind of backtrace with the git changes applied?

Just the usual X server log file and kernel output would be nice for a start.

Comment 23 Émeric Maschino 2006-09-19 11:02:10 UTC

Created attachment 7076 [details]
dmesg.log with DRM git patch 205c573e449b38d759273f6a51eb8c1131585ece applied

Comment 24 Émeric Maschino 2006-09-19 11:04:21 UTC

Created attachment 7077 [details]
Xorg.0.log file with DRM git patch 205c573e449b38d759273f6a51eb8c1131585ece applied

Comment 25 Émeric Maschino 2006-09-19 11:06:38 UTC

(In reply to comment #22)

> Just the usual X server log file and kernel output would be nice for a start.

Sorry for the delay. Please have a look at attachments id 7076 and 7077.

Comment 26 Tim Yamin 2006-10-11 10:13:13 UTC

That workaround patch makes it work for me (Gentoo xorg-server 1.1.1-r1). If you
need any information off me please just ask :)

Comment 27 Émeric Maschino 2006-10-12 13:30:50 UTC

(In reply to comment #26)
> That workaround patch makes it work for me (Gentoo xorg-server 1.1.1-r1). If you
> need any information off me please just ask :)

Just to clarify the situation, Tim was talking about the workaround id=6560, not
about the DRM git patch 205c573e449b38d759273f6a51eb8c1131585ece. So this issue
is unfortunately still present :-(

Comment 28 Benjamin Close 2008-01-11 02:38:12 UTC

Bugzilla Upgrade Mass Bug Change

NEEDSINFO state was removed in Bugzilla 3.x, reopening any bugs previously listed as NEEDSINFO.

  - benjsc
    fd.o Wrangler

Comment 29 Tim Yamin 2008-02-01 09:45:58 UTC

This bug is solved for me (w/ xorg-server 1.4.0.90) by attachment #14066 [details] [review] from bug #14326 which I think is the correct fix for this problem so marking bug as a dupe.

*** This bug has been marked as a duplicate of bug 14326 ***

Comment 30 Michel Dänzer 2008-04-08 10:17:04 UTC

*** Bug 15404 has been marked as a duplicate of this bug. ***

Comment 31 Michel Dänzer 2008-04-08 10:19:28 UTC

I think this was incorrectly resolved as duplicate.

Comment 32 Michel Dänzer 2008-04-08 10:21:34 UTC

The xserver side should be fixed with pciaccess, now it's probably up to the DRM to use a real implementation of drm_get_pci_domain() instead of hardcoding it to 0.

Comment 33 Marcin Kurek 2008-04-08 14:58:35 UTC

Created attachment 15767 [details]
log from drm compiled without hardcoded domain

It seems you have right as there is no crash when I replace the hardcoded domain from 0 to pci_domain_nr() in drmP.h. But GL still doesn't work as for some reason it doesn't add required visuals. 

[morgoth6@pegasos ~]$ glxinfo 
name of display: :0.0
Error: couldn't find RGB GLX visual

   visual  x  bf lv rg d st colorbuffer ax dp st accumbuffer  ms  cav
 id dep cl sp sz l  ci b ro  r  g  b  a bf th cl  r  g  b  a ns b eat
----------------------------------------------------------------------
0x21 24 tc  1  0  0 c  .  .  0  0  0  0  0  0  0  0  0  0  0  0 0 None
0x22 24 dc  1  0  0 c  .  .  0  0  0  0  0  0  0  0  0  0  0  0 0 None
0x6f 32 tc  1  0  0 c  .  .  0  0  0  0  0  0  0  0  0  0  0  0 0 None

Comment 34 Michel Dänzer 2008-04-09 00:03:48 UTC

(In reply to comment #33)
> It seems you have right as there is no crash when I replace the hardcoded
> domain from 0 to pci_domain_nr() in drmP.h.

Can you provide a patch for that?

Unfortunately, this change will probably break older X servers that hardcoded the domain to 0, so a full solution may require some DRM interface versioning magic.


> Error: couldn't find RGB GLX visual

That's a different issue which should be fixed now in Mesa Git.

Comment 35 Marcin Kurek 2008-04-09 00:39:24 UTC

Sure. It was just a quick hack to see is this would fix the problem here.

Comment 36 Marcin Kurek 2008-04-09 00:39:53 UTC

Created attachment 15774 [details] [review]
quick drm domain patch

Comment 37 Marcin Kurek 2008-04-13 03:25:54 UTC

True. With recent mesa patch GL works just fine here only the number of visuals decreased a lot:

   visual  x  bf lv rg d st colorbuffer ax dp st accumbuffer  ms  cav
 id dep cl sp sz l  ci b ro  r  g  b  a bf th cl  r  g  b  a ns b eat
----------------------------------------------------------------------
0x21 24 tc  0 32  0 r  y  .  8  8  8  8  0 24  0  0  0  0  0  0 0 None
0x22 24 dc  0 32  0 r  y  .  8  8  8  8  0 24  0  0  0  0  0  0 0 None
0x6f 32 tc  0 32  0 r  .  .  8  8  8  8  0 24  0  0  0  0  0  0 0 None

Comment 38 Émeric Maschino 2009-02-27 14:13:58 UTC

Hi,

I'm currently running Debian GNU/Linux Testing "Squeeze" on my hp workstation zx6000 sporting an ATI FireGL X1 graphics adapter.

I've upgraded the X Window system with the (Debian Experimental?) X.org 7.4~5, X server 1.5.99.902 (i.e. 1.6.0 RC2), Mesa 7.3.1 and open source radeon driver 6.11.0 packages available on the Debian FTP mirrors, and DRI is back again on ia64/Itanium :-)

The bad news is that it's highly unstable. I mean, you can query information with glxinfo without a problem, but don't try a GL screensaver or play with glxgears: you'll lock your system hard within seconds.

I'm now trying to find out how to provide valuable debug information to the X.org/Debian developers. It would be nice make 3D hardware acceleration on ia64/Itanium a reality again.

Thank you for the accomplished work until now.

     Émeric

Comment 39 Michel Dänzer 2009-02-28 02:39:42 UTC

(In reply to comment #38)
> The bad news is that it's highly unstable. I mean, you can query information
> with glxinfo without a problem, but don't try a GL screensaver or play with
> glxgears: you'll lock your system hard within seconds.

Does it work better with different values for Option "AGPMode", or with Option "BusType" "PCI"?

Comment 40 Émeric Maschino 2009-03-01 06:03:09 UTC

(In reply to comment #39)
> Does it work better with different values for Option "AGPMode", or with Option
> "BusType" "PCI"?

Yes, indeed!

Downgrading (well, performance wise, there's no such a big difference) to AGP 2x (rather than the default AGP 4x) fixed the issue. I'm now getting ~2790fps with glxgears.

Adding BusType "AGP" with the default AGP 4x setting makes the system more stable... for roughly a minute. I'm getting ~2800fps with glxgears but the system will eventually lock hard.

With AGP Fast Writes option enabled, X mode can't be started at all and the system locks hard, even with AGP 2x mode.

Both XAA and EXA acceleration architectures work properly.

Many thanks to all people involved in this great job.

    Émeric

Comment 41 Alex Deucher 2009-03-01 09:41:05 UTC

I've gone ahead and added an AGP quirk for your system:
a7f465f73363fce409870f62173d518b1bc02ae6

Comment 42 Émeric Maschino 2009-03-06 13:39:23 UTC

Hello Alex,

(In reply to comment #41)
> I've gone ahead and added an AGP quirk for your system:
> a7f465f73363fce409870f62173d518b1bc02ae6

Thank you, but could you remove this AGP quirk, please?

Here are the reasons.

"Downgrading" to AGP 2x or even AGP 1x only makes the problem appears later. I mean, rather than locking the system within seconds, it will take several minutes, but the system will eventually lock. Well, it doesn't really lock in fact, I was mistaken in my previous post. I had some free time to perform tests since then.

At AGP 2x/1x, simple GL-applications (glxgears, GL screensavers and even Quake 2) "usually" run without a problem. I say "usually", because if you enable the shadows in Quake 2 (gl_shadows variable set to 1 in config.cfg), you will experience the issue that I will now describe.

Independently of AGP speed, serious GL-applications like the SPECviewperf 7.1.1 suite (need to be recompiled for Linux ia64), completely flood the system within seconds. It's not hard locked as I thought initially. Indeed, I can ssh to it and the top command reveals that the Xorg process eats all the CPU (and sometimes more with a whooping 320% CPU utilization peak!). At this stage, I can't restart the X server locally or kill it remotely and a reboot is welcome.

Is there something I can try to help figure out what is the cause of this outrageous CPU utilization?

Thanks,

     Émeric

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.