Description
Jan "Yenya" Kasprzak
2008-10-21 14:50:09 UTC
I see similar problems with a bunch of radeon cards here, but was told this was due to current drivers not supporting soft-booting secondary cards at all since the libpciaccess changes. Have I misunderstood something here? Similar problem for me, except that I was going FC6 -> FC10. I was asked to add details of my hardware here: VideoCard0: NV: Found NVIDIA GeForce4 MX 440 VideoCard1: Chipset is SiS6326 AGP (H0) (revision 0x0b) Relevant thread at: http://lists.freedesktop.org/archives/xorg/2008-December/040961.html Here's some more interesting info from my X log. (--) PCI: (0@1:0:0) nVidia Corporation NV17 [GeForce4 MX 440] rev 163, Mem @ 0xf0000000/0, 0xe0000000/0, 0xe8000000/0, BIOS @ 0x?? ??????/131072 (--) PCI:*(0@2:2:0) Silicon Integrated Systems [SiS] 86C326 5598/6326 rev 11, Mem @ 0xf4000000/0, 0xf3000000/0, I/O @ 0x00009800/0 , BIOS @ 0x????????/65536 ... (II) Primary Device is: PCI 02@00:02:0 (--) NV: Found NVIDIA GeForce4 MX 440 at 01@00:00:0 I have no idea where that "Primary Device is" part comes from, but I note that it's neither if the previously mentioned PCI IDs. ...ok, maybe it is, just formatted differently. Confusing :). I think same bug hit me too: i have two cards plugged in: --- 00:05.0 VGA compatible controller: S3 Inc. 86c764/765 [Trio32/64/64V+] (rev 54) [old PCI s3] 01:00.0 VGA compatible controller: nVidia Corporation NV44A [GeForce 6200] (rev a1) [newer agp card] --- Until i add Option "NoINT10" "1" in s3's "Device" section i has just two black screens and one working button - poweroff. X log and config will follow .... Created attachment 20835 [details]
X.org log with s3 and nouveau
At least nouveau screen is working ...
Created attachment 20836 [details]
xorg conf: two screens with s3 and nouveau
My s3 only has 1Mb (and works fine as primary/only one device) - so i must use two separate screens, with different modes and bit depth, correct?
All of the following applies to the stock linux 2.6 kernel from a fresh installation of Fedora 10. I have been looking into the int10 hang when initializing the BIOS of a secondary card. Since the thread on xorg@lists.freedesktop.org suggested libpciaccess as the faulty component, I checked the code. This is what I found: - The function responsible for reading the ROM of the PCI video card is pci_device_linux_sysfs_read_rom() for the Fedora 10 case. - This function pci_device_linux_sysfs_read_rom() is *not* exercised at all when using (only) the primary display, even when an option such as UseBIOS is in effect. So this function might as well be broken and nobody with a single display would notice. - pci_device_linux_sysfs_read_rom() is exercised when initializing a secondary display (using "vesa" in my case) and its ROM needs to boot up. I introduced a bit of logging in the patch libpciaccess-partial-fix-with-debug.patch that outputs messages to a file in /tmp . The basic problem is that, despite all the sysfs dance to enable the ROM, the kernel terminates the read with 0 bytes when trying to read the ROM: Reading ROM from /sys/bus/pci/devices/0000:00:09.0/rom into address 0xb7f4a008 ROM size for /sys/bus/pci/devices/0000:00:09.0/rom is 32768 using 32768 Reading ROM from /sys/bus/pci/devices/0000:00:09.0/rom reached 0-sized read (EOF?) at offset 0 Dump of ROM from /sys/bus/pci/devices/0000:00:09.0/rom (0 bytes): Reading ROM failed with short read, using /dev/mem to read from 0xdffe0000 I introduced an attempt at a fallback that calls pci_device_linux_devmem_read_rom() when the total amount read is less than the expected ROM size. In current git for libpciaccess, the buffer remains uninitialized and hangs the machine. I hoped that the fallback would be enough to read the ROM and fix this problem. However, I ran into another problem. The attempted fallback ends up using pread() on /dev/mem at the offset matching the one reported for the ROM. However, this failed with EINVAL (Invalid argument). By using strace on the stock X server and the modified libpciaccess library, I saw that the pread implementation calls into pread64() with an very big offset of 18446744073172549632 (0xffffffffdffe0000), which is the required offset, sign-extended into 64 bits instead of zero-extended as required. This might point to a bug in glibc headers or code, but I worked around this by replacing the call with a pread64() call, as seen in libpciaccess-partial-fix-without-debug.patch Now, here comes the third problem: the passed address makes pread64() return EFAULT (Invalid address). I did not have time to find out whether this address is intended or not. However libpciaccess-partial-fix-without-debug.patch is enough to replace the hang with a graceful exit that allows the user to sort-of regain control of the machine. Final strace is attached, search for EFAULT in the text. Please comment on this. Created attachment 20837 [details] [review] Patch to add error checking and debugging to file on libpciaccess This is the patch that I used to create the debug log. Notice that the kernel terminates the read from ROM at offset 0. Created attachment 20838 [details]
Log file created with debug patch.
Created attachment 20839 [details] [review] Patch to add error checking and attempted fallback on libpciaccess, cleaned up. This patch is enough to turn the hang into the error it should have been. Created attachment 20840 [details]
strace ouput with debug patch
This is the strace output. You can search for EFAULT when trying to read using pread64() on /dev/mem .
Wonderful! I was hoping to get onto this sometime, but you got further than I would've been able to. Am I correct in understanding that the real problem is probably that /sys/bus/pci/devices/0000:00:09.0/rom (or whatever) actually returns a 0 size, so this is really a kernel problem, rather than an Xorg problem? (In reply to comment #13) > Wonderful! I was hoping to get onto this sometime, but you got further than I > would've been able to. Am I correct in understanding that the real problem is > probably that /sys/bus/pci/devices/0000:00:09.0/rom (or whatever) actually > returns a 0 size, so this is really a kernel problem, rather than an Xorg > problem? > Apparently it is. Assuming that the sysfs interface us supposed to give access to any PCI ROM, not just the ones from VGA chipsets, then the interface is not (always) working as documented. My work machine has three chipsets with ROMs, as declared by sysfs: [root@srv64 ~]# cd /sys/devices/ [root@srv64 devices]# find . -name rom ./pci0000:00/0000:00:01.0/0000:01:05.0/rom ./pci0000:00/0000:00:11.0/rom ./pci0000:00/0000:00:12.0/rom These devices match the following declarations in the output of lspci -v: 00:01.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge (prog-if 00 [Normal decode]) Flags: bus master, 66MHz, medium devsel, latency 99 Bus: primary=00, secondary=01, subordinate=01, sec-latency=68 I/O behind bridge: 0000e000-0000efff Memory behind bridge: fde00000-fdefffff Prefetchable memory behind bridge: d8000000-dfffffff Capabilities: [b0] Subsystem: Intel Corporation Unknown device d600 01:05.0 VGA compatible controller: ATI Technologies Inc RC410 [Radeon Xpress 200] (prog-if 00 [VGA controller]) Subsystem: Intel Corporation Unknown device d600 Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 17 Memory at d8000000 (32-bit, prefetchable) [size=128M] I/O ports at ee00 [size=256] Memory at fdef0000 (32-bit, non-prefetchable) [size=64K] [virtual] Expansion ROM at fde00000 [disabled] [size=128K] Capabilities: [50] Power Management version 2 Capabilities: [80] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable- Kernel driver in use: radeon 00:11.0 IDE interface: ATI Technologies Inc 437A Serial ATA Controller (rev 80) (prog-if 8f [Master SecP SecO PriP PriO]) Subsystem: Intel Corporation Unknown device d600 Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 23 I/O ports at ff00 [size=8] I/O ports at fe00 [size=4] I/O ports at fd00 [size=8] I/O ports at fc00 [size=4] I/O ports at fb00 [size=16] Memory at fe02f000 (32-bit, non-prefetchable) [size=512] Expansion ROM at 40000000 [disabled] [size=512K] Capabilities: [60] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable- Kernel driver in use: sata_sil Kernel modules: sata_sil 00:12.0 IDE interface: ATI Technologies Inc 4379 Serial ATA Controller (rev 80) (prog-if 8f [Master SecP SecO PriP PriO]) Subsystem: Intel Corporation Unknown device d600 Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 22 I/O ports at fa00 [size=8] I/O ports at f900 [size=4] I/O ports at f800 [size=8] I/O ports at f700 [size=4] I/O ports at f600 [size=16] Memory at fe02e000 (32-bit, non-prefetchable) [size=512] Expansion ROM at 40080000 [disabled] [size=512K] Capabilities: [60] Power Management version 2 Capabilities: [50] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable- Kernel driver in use: sata_sil Kernel modules: sata_sil So I have a radeon chipset with ROM, and two SATA controllers. Now, consider the following Perl snippet: [root@srv64 0000:01:05.0]# perl -w -e ' use IO::Handle; sysopen(ROM, "rom", 2) or die($!); binmode(ROM); syswrite(ROM, "1", 1); sysseek(ROM, 0, 0); $buffer = ""; while (sysread(ROM, $buffer, 1024) > 0) { print $buffer; }; sysseek(ROM, 0, 0); syswrite(ROM, "0", 1); close(ROM);' > /tmp/radeon_rom.bin On my work machine, if I change directory to /sys/devices/pci0000:00/0000:00:01.0/0000:01:05.0/ (the radeon chipset) and paste the script, I get: [root@srv64 0000:01:05.0]# ls -l /tmp/radeon_rom.bin -rw-r--r-- 1 root root 49152 dic 5 16:33 /tmp/radeon_rom.bin ...proving that the Radeon ROM is readable. However, if I try the same with either SATA ROM, I get a 0-sized file. So the kernel behavior (2.6.26.6-49.fc8) is at least inconsistent. I have opened kernel bug at http://bugzilla.kernel.org/show_bug.cgi?id=12168 for this issue. so, am I right in thinking that this is not a bug in any X.org code? It's a blocker for the 1.6 release currently, and so it needs to be dealt with. The short version: apply patch from comment #11. The long version: No. It's a combination of bugs. So far, we've identified bugs in: - Xorg - libc possibly - kernel PCI Comment #11 has a patch attached which, if I understand correctly, solves at least one Xorg problem, and works around the libc problem. That leaves only the kernel PCI problem. We may identify more bugs when the kernel PCI problem is solved, but that's the only known problem. (In reply to comment #17) > The short version: apply patch from comment #11. > > The long version: > > No. It's a combination of bugs. So far, we've identified bugs in: > - Xorg > - libc possibly > - kernel PCI > > Comment #11 has a patch attached which, if I understand correctly, solves at > least one Xorg problem, and works around the libc problem. That leaves only > the kernel PCI problem. We may identify more bugs when the kernel PCI problem > is solved, but that's the only known problem. > The patch actually solves the libpciaccess problems, but there is still the issue that (apparently) xorg provides an invalid address for the ROM copy buffer, which botched the attempt at a fallback. I was too tired to check the provenance of the buffer address (the one that reports EFAULT in the strace). I will try to check that in the next few days, unless somebody beats me at it. It is either an invalid address, or my poor understanding of the arguments to the pread64 function as provided by libc from Fedora 10. Created attachment 21246 [details]
Xorg.0.log of xserver with recent git
News for this bug: at least on my home machine (ProSavageDDR + OAK Spitfire), a recent git tree for the xserver appears to be successful in reading the Oak video BIOS, but still hangs at startup. My scripts still fail at reading the same ROM.
I notice that the strace output ends up with a couple of vm86old() calls failing with ENOSYS. My kernel is 2.6.28-rc7 on a Pentium-4 machine running 32-bit code only. No 64-bit support. So vm86 mode should be usable, right?
News for my setup (Box 2 from the original report: x86_64 with 2x Radeon HD 3450 PCIe cards): xorg-x11-server-Xorg-1.5.3-5.fc10.x86_64 xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 kernel 2.6.28-rc8 (from kernel.org, not from Fedora) I am now able to start the X server on both cards and use the dual-seat setup, with the following problems: 1. when the secondary X server is started, the _primary_ card gets also rebooted (it displays blank screen, then switches to text mode, and in the first row there is written something that looks like a VGA BIOS version). The workaround is to start the primary X server _after_ the secondary card gets booted and initialized. Fortunately xdm can do this. 2. when I kill the secondary X server, the computer locks up (no response to ping, no reaction to pressing NumLock, etc.). It happens no matter how the X server is killed - I tried Ctrl-Alt-Backspace and sending SIGTERM - the computer stop responding to ping about two seconds after that. Other than that, it seems that my configuration is mostly usable as dual-seat now. With a fairly recent kernel ("kernel-2.6.27.4-47.rc3.fc10.x86_64") and Xorg server, I'm still unable to use more than one screen (although it appears I can now use one of the secondary cards on its own). Will attach a log in a moment or two ... [root@bill ~]# Xorg -version X.Org X Server 1.5.3 Release Date: 5 November 2008 X Protocol Version 11, Revision 0 Build Operating System: Linux 2.6.18-92.1.18.el5 x86_64 Current Operating System: Linux bill.wcn.co.uk 2.6.24.7-92.fc8 #1 SMP Wed May 7 16:26:02 EDT 2008 x86_64 Build Date: 11 December 2008 05:27:30PM Build ID: xorg-x11-server 1.5.3-6.fc10 Before reporting problems, check http://wiki.x.org to make sure that you have the latest version. Created attachment 21318 [details]
Xorg.0.log from failed attempt to use secondary card
Oops. No, what actually happened was, attempting to use the primary plus one secondary card led to a working display, but only the primary card working. The secondary card still fails (in fact it appears from the log to misidentify the secondary cards, too).
Created attachment 21319 [details]
Config file related to previous log
Created attachment 21321 [details]
Hardware concerned with previous log and conf file.
Created attachment 21323 [details]
Log from attempt to use primary and a secondary card together
This led to a display working, but only the primary card displaying anything (and mouse was bound to the one screen, so I didn't get the impression the second was working at all, as opposed to just failure to display anything on the screen).
To transfer the knowledge back from the kernel bug, the file /sys/bus/pci/devices/<deviceID>/enable needs to have a 1 echoed into it before reading the ROM. The libpciaccess code needs to do this. In combination with the patch from comment #11, this should get things working either the same or better (depending on the setup). I'm not likely to get around to that for 2-3 weeks, unfortunately; if someone else wants to supply such a patch, that would be wonderful. HTH, This won't get fixed in server-1.6 unless someone attaches a patch in the next week. Created attachment 21937 [details] [review] Add enable/disable through sysfs around actual reading of ROM. Could you please check this patch for correctness? It includes the previous fixes plus an attempt to enable and disable the card around the reading of the ROM. Looks fine on a quick skim-over; I'll unfortunately have no chance to review it properly until at least Tuesday. Hopefully someone else can. Created attachment 22033 [details] [review] Add enable/disable through sysfs around actual reading of ROM (try 2). Disregard the previous patch. The previous patch assumed that the rom failed to be read if the total amount read is less than the reported size via fstat(). However, it seems the kernel makes no attempt to calculate the actual size of the ROM on a simple listing, but only on an actual read. So the condition always fails because the actual size is always less than the reported size (the reported size seems to be a multiple of 32 Kb). This one considers a failed read only in the case in which no data was read at all (total size == 0). removing from 1.6 blocker list now that it's been identified to be libpciaccess. hopefully jbarnes or idr can take a look. Yeah, enabling the device is necessary otherwise the ROM read won't work (there's actually a kernel patch for this queued too; we don't make the "enable required" obvious enough). Aside from whitespace issues the proposed patch seems ok to me. Just so it's all clear; the kernel patch is not required for this to work properly, but it does fix some problems we ran into while debugging the problem. I'm still hoping to get around to testing the patch sometime; if someone decides we don't need more confirmation though, I'd be glad not to need to do that. :) In order to test Alex's patch, I applied it to the sources contained in Fedora 10's libpciaccess. It conflicted with a patch called libpciaccess-fd-cache.patch, so I got rid of that. I also had to modify the file paths to get it to apply cleanly. After application of the patch, I tried starting Xorg in 3 different configurations. None of the three resulted in a server lockup during the loading of the int10 module, so we appear to have overcome that particular problem. Primary video card: sis driver Secondary video card: nv driver When I started Xorg with just the primary configured, it worked fine just like it always has. When I started Xorg with just the secondary configured, the primary screen went black, and I couldn't get any output. However, looking at the Xorg.0.log (yes, this was the correct logfile), it appeared to have done everything appropriate except actually display something. Logging in remotely also showed all the appropriate processes running. However, I was unable to switch to a text-based virtual terminal. When I started Xorg with both cards configured, the primary screen displayed that startup thing with the alternating black & white pixels, and then it appeared to get stuck there. It seems to me that the appropriate action is to apply the patch, mark the bug fixed, and then open a new bug for this new problem. Thoughts anyone? Incidentally, a message from Steven J. Newbury may be relevant here: http://lists.freedesktop.org/archives/xorg/2009-February/043918.html Steven J. Newbury (see comment #35) posted a follow-up basically stating that he agrees that his problems are like those I continue to get. It seems that the wrong ROM is being invoked for his video card; both are read, but the wrong one is run. I can't confirm this myself, because the error that shows this appears to come from the Radeon driver, which I'm not using. Anyway, if we get that, we'll be one step closer to having multi-video-card Xorg working again. (In reply to comment #36) > Steven J. Newbury (see comment #35) posted a follow-up basically stating that > he agrees that his problems are like those I continue to get. It seems that > the wrong ROM is being invoked for his video card; both are read, but the wrong > one is run. I can't confirm this myself, because the error that shows this > appears to come from the Radeon driver, which I'm not using. > > Anyway, if we get that, we'll be one step closer to having multi-video-card > Xorg working again. > It does kind of work for me despite the log message regarding the wrong card, it may be that the BIOS in the X800 is capable of bringing up the 9250. Interestingly they both report the same timings despite being rather different hardware, although that would be expected if it's using the timings from the same BIOS to program both cards! I'm still seeing a complete lockup (nothing further written to log, even with disk mounted using sync option; power button has to be held down to switch off) on my system. Three identical radeon PCI cards (same part numbers, etc) so it's not down to "the wrong BIOS" as far as I can tell. Primary card works fine as a single head, and I've swapped cards around to verify it's not that the other cards are "broken". What can I do to help debug this? Given the move towards KMS, perhaps we should concentrate on getting cards initialised outside of the X server using a standalone int10 utility? Has anybody got something like that working yet? Bill: The only thing this patch fixes is the lockup at the int10 point. While it's now getting further than it was, and we possibly need a new bug report, because the new bug may not be in libpciaccess, as far as real results go, there's no difference, at least from my point of view. Stephen: No, it hadn't occurred to me. I didn't even know that KMS was kernel modesetting until I Googled it. Created attachment 23616 [details]
Xorg log attempt with both screens enabled
The attached log illustrates what happens on my computer when I try to start Xorg. One interesting thing to note is that both screen cards seem to be attempting to use int10. Is that a problem? I was under the impression that int10 was by default used only for the secondary card.
Tim: my problem was that I was still seeing the lockup as of Friday evening, and it looked from the changelogs in Fedora's rawhide packages as though this patch was included ... and it was still locking up for me. I'll have another try tonight ... Some additional work has been done on this by some others. If I understand correctly, what's needed at the moment is something called a "VGA arbiter". Presumably equivalent functionality was removed along with the libpciaccess update. There has been some work at replacing it. The work is documented here: http://www.x.org/wiki/VgaArbiter To summarise, three things are needed: - kernel VGA arbiter - userspace library that uses the arbiter - xserver patch that patches xserver to use the library and new kernel interface Unfortunately, while that summary is still accurate, the page above is a) a little out of date, and b) the code was never included in any of the appropriate projects. The most up-to-date version (ie. the version that works with the current code) is elsewhere (see below), but has not been tested, hence this message. The following patch first needs to be applied to your Linux kernel (with apologies to non-Linux types): http://people.freedesktop.org/~airlied/kernel-vga-arbiter.patch Then this patch needs to be applied to your xserver: http://cgit.freedesktop.org/~airlied/xserver/log/?h=vga-arbiter I'm unaware of any copy of the userspace library more up-to-date than the one on this page: http://git.c3sl.ufpr.br/pub/scm/multiseat/libvgaaccess.git/ Instructions for using that last link are on the wiki page above. If anyone has a chance to test this, feedback could be useful. Yes, I know this isn't directly related to the original problem, but most of the people really interested in multi-card xorg are already watching this bug. (In reply to comment #43) > Some additional work has been done on this by some others. If I understand > correctly, what's needed at the moment is something called a "VGA arbiter". > Presumably equivalent functionality was removed along with the libpciaccess > update. There has been some work at replacing it. The work is documented > here: The VGA arbiter is a second step to get the multiple card working correctly. Not all video cards need to be programmed relying on the crappy VGA legacy registers. It's worst: seems that there's some drivers that can entirely scape from VGA interface but still using it. So a good plan is to first solve the problem of secondary cards initialization and then tackle the VGA arbitration. Based on the visible garbage occasionally appearing on the primary screen whilst the BIOS is trying to POST the secondary card(s), I'd guess that the arbiter, or simply disabling the other card(s) while doing int10, would probably help a lot. I'll try the patches, but I don't have a huge amount of time. If you have a .src.rpm available for libvgawhatsit, patched kernel etc. that would be marvellous :o) Ok, since the problem described in this bug (ie. the int10 one) is now, if I understand recent mailing lists posts correctly, fixed, I've created a few more bugs: #20816 is a master bug for getting multi-card xorg working #20817 is a bug specifically for the VGA arbiter Anyone who wants to discuss what needs to be done to help with/etc the VGA arbiter should from now on discuss that at bug #20817. Discussing whether the VGA arbiter is the best thing to do next is should be done at #20816. Thanks, I'm using the patchs for libpciaccess and my system still hanging (sis and ati cards). My Xorg locks when it tries to initialize int10 module, specifically when it starts to emulate the operations, running into a problem that I described in 'http://bugs.freedesktop.org/show_bug.cgi?id=20816'. It seems not fixed for me. I have the same problem with two PCI nVidia cards using Xinerama. I described my problems on the bug 20816[1]. Also I added my files (Xorg.0.log, xorg.conf, and lspci output) there. My problem is simple, the system hang when tries to load the int10 module (the original), now with the fix to the pciaccess library, the system doesn't hang but I got this error. (II) LoadModule: "int10" (II) Reloading /usr/lib/xorg/modules//libint10.so (II) NV(1): Initializing int10 (EE) NV(1): Cannot read V_BIOS (3) Input/output error and the system doesn't up. I found this: + If I use two nVidia cards PCI, the system hang because I have a Intel Card too (although I don't use it, the Xorg can see it). + If I use a Nvidia PCI and one Matrox ATI AGP , this card (the Matrox) inhibits to the Intel, but if I choose it how primary the result is the same (Cannot read V_BIOS (3) Input/output error). But If I choose the Intel how primary (in the BIOS), I can load both, but only how two monitors individual, I can't load the xinerama support. Please how can fix this problem, I can't apply patches on my system. Because I still use Ubuntu Hardy while this problem remains (two new releases has been freezed). Regards. [1] https://bugs.freedesktop.org/show_bug.cgi?id=20816 Mass closure: This bug has been untouched for more than six years, and is not obviously still valid. Please reopen this bug or file a new report if you continue to experience issues with current releases. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.