Mesa CI reports a low error rate (2/700k), however the number of intermittent failures is consistently nonzero. This is worse than our historical results. The rarity of the failures makes it difficult to pinpoint the regression, however there are several repeating errors: i965: Failed to submit batchbuffer: Bad address piglit.spec.!opengl 1_1.copypixels-draw-sync ivb piglit.spec.!opengl 1_3.gl-1_3-texture-env snb intel_batchbuffer.c:937: submit_batch: Assertion `entry->handle == batch->batch.bo->gem_handle' failed. piglit.spec.!opengl 1_3.gl-1_3-texture-env.bdwm64 piglit.shaders.glsl-fs-raytrace-bug27060 skl HSW tesselation failures deqp-gles31': corrupted double-linked list: 0x0000561e9518be50 *** dEQP-GLES31.functional.debug.error_filters.case_0.bdwm64
Do we know what kernel version is running on the machines with failures? We do slightly different things on v4.13 and later. Wondering if it's only happening on machines with older kernels, or newer ones, or both.
Unfortunately I saw this recently on 4.14 and 4.11 http://otc-mesa-ci.jf.intel.com/job/Leeroy/1934476/ - ivbgt2-01 4.14 http://otc-mesa-ci.jf.intel.com/job/Leeroy/1934454/ - sklgt2-04 4.11
Running piglit.shaders.glsl-fs-raytrace-bug27060, I found this valgrind warning : https://patchwork.freedesktop.org/patch/212413/
The Broadwell failure is interesting as it's clearly a memory corruption issue. Running the dEQP-GLES31.functional.debug.* tests under valgrind, I can see a few errors from the CTS suite : Test case 'dEQP-GLES31.functional.debug.negative_coverage.callbacks.state.get_nuniformfv'.. ==12081== Use of uninitialised value of size 8 ==12081== at 0x59B505E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25) ==12081== by 0x59B55A8: std::ostreambuf_iterator<char, std::char_traits<char> > std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>(std::ostreambuf_iterator<char, std::char_traits<char> >, std::ios_base&, char, long) const (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25) ==12081== by 0x59C1178: std::ostream& std::ostream::_M_insert<long>(long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.25) ==12081== by 0x71CAF1: std::ostream& tcu::Format::operator<< <int const*>(std::ostream&, tcu::Format::Array<int const*> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0xD9F922: std::ostream& tcu::Format::operator<< <int>(std::ostream&, tcu::Format::ArrayPointer<int> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0xEE0160: tcu::MessageBuilder& tcu::MessageBuilder::operator<< <tcu::Format::ArrayPointer<int> >(tcu::Format::ArrayPointer<int> const&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0xE9E45F: glu::CallLogWrapper::glGetIntegerv(unsigned int, int*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0xA611D2: deqp::gles31::Functional::NegativeTestShared::get_nuniformfv(deqp::gles31::Functional::NegativeTestShared::NegativeTestContext&) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0x7230DF: deqp::gles31::Functional::(anonymous namespace)::TestFunctionWrapper::call(deqp::gles31::Functional::(anonymous namespace)::DebugMessageTestContext&) const (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0x725DA1: deqp::gles31::Functional::(anonymous namespace)::CallbackErrorCase::iterate() (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0x6DCAD3: deqp::gles31::TestCaseWrapper::iterate(tcu::TestCase*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) ==12081== by 0xF9E157: tcu::TestSessionExecutor::iterateTestCase(tcu::TestCase*) (in /home/djdeath/src/mesa-src/VK-GL-CTS/build-es31/modules/gles31/deqp-gles31) I'm not sure whether that's related, might be worth fixing though (trying to write some patches).
I just had my Gnome desktop crash and the only info in the log was: i965: Failed to submit batchbuffer: Bad address This is on Fedora 27, kernel 4.15.9, mesa 17.3.6.
It's clear to me that this bug is not simply "CI ghosts". We have a bug in Mesa which is hard to trigger, and we hit it very occasionally with the exhaustive CI infrastructure. What we need is ideas on how to narrow down the failure. Perhaps one of the branches that performs additional memory verification could help? I got nothing out of valgrind. I'm eager to get suggestions on what to do next.
I've also encountered some desktop crashes lately with May 03 14:15:15 ossy /usr/lib/gdm3/gdm-x-session[5995]: i965: Failed to submit batchbuffer: Cannot allocate memory but it's intermittent and yeah this sounds like a tough problem to solve. Ubuntu 18.04; GNOME 3.28.1; Kernel 4.15.0-20-lowlatency; Intel HD Graphics 630 with modesetting on Xorg 1.19.6 (but might try the old intel driver)
*** Bug 106621 has been marked as a duplicate of this bug. ***
One of the tests that seems to reproduce this more often than others: dEQP-GLES31.functional.debug.negative_coverage.get_error.vertex_array.draw_arrays_instanced_incomplete_primitive produces on stderr: corrupted size vs. prev_size or corrupted double-linked list Seen on bxt, bdw, bsw, ivb
We hit this bug twice in a week, and then nothing since then (5 months and 1 week). I wonder if newer kernels fixed this issue. What is the most up to date kernel that has shown this issue?
4.18. If you have a suggestion for what to run, I'll update.
(In reply to Mark Janes from comment #11) > 4.18. If you have a suggestion for what to run, I'll update. Our CI last saw it on Linux: 4.17.0-rc6. So I guess we are just lucky...
Last seen this issue on our CI system is 8 months, 3 weeks / 4968 runs ago. Can we close this issue?
The problematic tests have been disabled in mesa ci since June 2018. If you think this is fixed, than I can re-enable them. Mesa CI updated it's kernels to 4.19 recently, but otherwise there has been no change to affect this bug.
Mesa CI reproduce these test failures immediately: https://mesa-ci.01.org/mesa_master/builds/15252/group/63a9f0ea7bb98050796b649e85481845 Builds have fairly recent kernels: Linux otc-gfxtest-sklgt2-01 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 (2018-12-22) x86_64 GNU/Linux
(In reply to Mark Janes from comment #15) > Mesa CI reproduce these test failures immediately: > > https://mesa-ci.01.org/mesa_master/builds/15252/group/ > 63a9f0ea7bb98050796b649e85481845 > > Builds have fairly recent kernels: > > Linux otc-gfxtest-sklgt2-01 4.19.0-1-amd64 #1 SMP Debian 4.19.12-1 > (2018-12-22) x86_64 GNU/Linux Thanks for the info! I'll treat this as a mesa bug and since we are using your blacklist, we should be safe to just ignore it from our side. I'll close our kernel issue. Thanks to everyone involved!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
I saw this once. [Environment] CPU: SkyLake(core i5 6500TE) Distribution: debian(customised) Kernel: 4.14.98 Mesa: 18.3.3 libdrm: 2.4.89 Message from stdout of drawing module was ---- i965: Failed to submit batchbuffer: Bad address ---- and back-trace were following ---- : : #5 0x00007f4496240b35 in exit () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x00007f44864d1a5d in submit_batch (out_fence_fd=0x0, in_fence_fd=<optimized out>, brw=0x47ee030) at intel_batchbuffer.c:838 #7 _intel_batchbuffer_flush_fence (line=<optimized out>, file=<optimized out>, out_fence_fd=0x0, in_fence_fd=<optimized out>, brw=0x47ee030) at intel_batchbuffer.c:891 #8 _intel_batchbuffer_flush_fence (brw=0x47ee030, in_fence_fd=<optimized out>, out_fence_fd=0x0, file=<optimized out>, line=<optimized out>) at intel_batchbuffer.c:852 #9 0x00007f44864a558a in brw_draw_single_prim (stream=<optimized out>, xfb_obj=0x0, prim_id=0, prim=0x7ffff9aa77d0, ctx=0x47ee030, indirect=<optimized out>) at brw_draw.c:898 #10 brw_draw_prims (ctx=0x47ee030, prims=<optimized out>, nr_prims=1, ib=<optimized out>, index_bounds_valid=<optimized out>, min_index=0, max_index=3, gl_xfb_obj=0x0, stream=0, indirect=0x0) at brw_draw.c:1107 #11 0x00007f448608063c in _mesa_draw_arrays (drawID=0, baseInstance=0, numInstances=1, count=4, start=0, mode=6, ctx=0x47ee030) at main/draw.c:408 #12 _mesa_draw_arrays (ctx=0x47ee030, mode=6, start=0, count=4, numInstances=1, baseInstance=0, drawID=0) at main/draw.c:385 #13 0x00007f4486081344 in _mesa_exec_DrawArrays (mode=6, start=0, count=4) at main/draw.c:565 : : ----
(In reply to Yoshinori Gento from comment #18) > I saw this once. This occurred in our product.
Yoshinori: Mesa i965 team is seeking a way to reproduce this bug, so we can analyze and fix it. How often does this occur in your product? If it is reproducible, then perhaps we can use an apitrace to investigate the root cause.
(In reply to Mark Janes from comment #20) > Yoshinori: Mesa i965 team is seeking a way to reproduce this bug, so we can > analyze and fix it. > > How often does this occur in your product? If it is reproducible, then > perhaps we can use an apitrace to investigate the root cause. While I operated in about 1month * 4 machines, I saw this problem only once. So, I don't know how to reproduce this. But when I saw this, I executed 'cp' command on xterm for copy some files. (I think that I do not matter.) I keep operating machine to know frequency.
Hmm...`cp` in xterm is a pretty clear indicator that this issue is random and not triggered by a specific workload. Lionel suggested that it would be good to have a feedback from the kernel about what didn't pass validation. There is a kernel option to generate debug traces for that but you have to recompile your kernel with that option. Lionel, can you provide some details? It would be a good data point to see if a much older kernel produces this error (eg 4.9, 4.4). I can't deploy those kernels in Mesa i965 CI because they lack features needed to run our Vulkan test suites.
(In reply to Mark Janes from comment #22) > Hmm...`cp` in xterm is a pretty clear indicator that this issue is random > and not triggered by a specific workload. > > Lionel suggested that it would be good to have a feedback from the kernel > about what didn't pass validation. > > There is a kernel option to generate debug traces for that but you have to > recompile your kernel with that option. Lionel, can you provide some > details? > > It would be a good data point to see if a much older kernel produces this > error (eg 4.9, 4.4). I can't deploy those kernels in Mesa i965 CI because > they lack features needed to run our Vulkan test suites. With the kernel compiled with CONFIG_DRM_I915_DEBUG_GEM and the following command issued as root : echo 15 > /sys/module/drm/parameters/debug You should be able to get some traces about why the execbuffer failed. Unfortunately that generates a lot of traces...
I haven't encountered this issue at all since moving away from modesetting and back to the intel DDX driver. So whatever extra exercises GLAMOR was doing may be triggering the bug. I'm sure that doesn't help actually fix it but it might at least help people experiencing it to have a more stable desktop.
I saw this problem three times from yesterday. All of them occurred during file sync over LAN with rsync. I think this problem might be related to load by disk i/o or network i/o. But unfortunately I have not re-compiled kernel with CONFIG_DRM_I915_DEBUG_GEM yet. I will try to it next week.
Created attachment 143768 [details] Debug trace. I got debug traces. Please see attached file. PID needs to be checked is 2602. After that this process was exited with "i965: Failed to submit batchbuffer: Bad address". At that time I repeated to copy and delete of files by rsync. Note: This is occurred on core i3-6100E. Software version are same as the above.
I got how to reproduce. Cached memory grows big by reading many files and free RAM becomes empty. In this situation (repeat release and allocate caches frequently), drawing process faces this problem. Does conflict of memory cause this problem?
Hello Yoshinori Gento >I got how to reproduce. Does this mean that you could provide an apitrace or somekind of reproducer? It would be really helpful.
(In reply to Denis from comment #28) > Hello Yoshinori Gento > > >I got how to reproduce. > Does this mean that you could provide an apitrace or somekind of reproducer? > It would be really helpful. Hello Denis I didn't produce an apitrace nor reproducer. I updated kernel to 4.19.57. Then, this problem became hard to occur, but still occurs.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1680.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.