Summary: | [Intel-gfx] As of kernel 4.3-rc1 system will not stay in S3 suspend [REGRESSION][BISTECTED] | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Jairo Miramontes <jairo.daniel.miramontes.caton> | ||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | blocker | ||||||||||||
Priority: | highest | CC: | dorota.czaplejewicz, dsmythies, gary.c.wang, intel-gfx-bugs, tigrangab | ||||||||||
Version: | unspecified | Keywords: | bisected, regression | ||||||||||
Hardware: | All | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | ALL | i915 features: | power/suspend-resume | ||||||||||
Attachments: |
|
This bug was created for tracking purposes, was reported to the intel gfx list, refeer to http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html As discussed please follow up on the m-l with a link to each regression tracking bug you create so that the links go both ways. Thanks. Please use links that contain the Message-ID so that it's easier to find the messages in email. Please reference the original report. Like this: http://mid.gmane.org/002301d1025d$d5765090$8062f1b0$@net John, ideas? Additional information: After the first resume from suspend, the processor is in a bizarre state, where it will not go below 2.4 GHz, even though every CPU is asking for a pstate of 16 (the minimum for my processor). This has been tested several times, on both the preceding (good) and first bad kernels using both methods of suspend. My processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz Example (no load): pstate being asked for: # rdmsr --bitfield 15:8 -d -a 0x199 16 16 16 16 16 16 16 16 pstate that I am getting: # rdmsr --bitfield 15:8 -d -a 0x198 24 24 24 24 24 24 24 24 CPU freqs: # grep MHz /proc/cpuinfo cpu MHz : 2400.054 cpu MHz : 2399.921 cpu MHz : 2399.921 cpu MHz : 2399.921 cpu MHz : 2399.789 cpu MHz : 2399.789 cpu MHz : 2399.921 cpu MHz : 2399.921 Did you double check the bisect by running both dc4be6071a24 and dc4be6071a24^ ? (In reply to Jani Nikula from comment #6) > Did you double check the bisect by running both dc4be6071a24 and > dc4be6071a24^ ? Yes, of course, and I said so in my initial e-mail. Truth be known, this was my second bisection, as I must have made a mistake in my first attempt, because the double check failed. (In reply to Doug Smythies from comment #7) > (In reply to Jani Nikula from comment #6) > > Did you double check the bisect by running both dc4be6071a24 and > > dc4be6071a24^ ? > > Yes, of course, and I said so in my initial e-mail. > Truth be known, this was my second bisection, as I must have made a mistake > in my first attempt, because the double check failed. I asked, because I suspected the bisect result might be wrong. And the symptoms in comment #5 seem odd. Please try two things: First, run dc4be6071a24 and try suspend/resume several times, and see if it's 100% reproducible or not. Second, attach dmesg with drm.debug=14 module parameter set (for the failing case). Created attachment 118873 [details]
requested dmesg with drm.debug=14
Possibly relevant excerpt:
[ 399.518389] [drm] stuck on render ring
[ 399.518686] [drm] GPU HANG: ecode 6:0:0xfeffffff, reason: Ring hung, action: reset
[ 399.518686] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 399.518686] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 399.518687] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 399.518687] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 399.518687] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 399.518699] [drm:i915_reset_and_wakeup] resetting chip
[ 399.518724] i915 0000:00:02.0: GEM idle failed, resume might fail
[ 399.518737] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
[ 399.518739] dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -11
[ 399.518741] PM: Device 0000:00:02.0 failed to suspend async: error -11
[ 399.518804] PM: Some devices failed to suspend, or early wake event detected
an edited /sys/class/drm/card0/error will be attached in a moment.
Created attachment 118874 [details]
an edited version of the file the previous attachment asked for
I edited just to remove many lines of 0's.
(In reply to Jani Nikula from comment #8) > First, run dc4be6071a24 and try suspend/resume > several times, and see if it's 100% reproducible or not. Yes, it happens every time. To say 100%, I would have to have a sample space of about 1000 attempts. I did not do that many. Additional information: After a fresh boot with the bad kernel, turbostat shows: Pkg%pc6 = 97.84%; PkgWatt 4.01; CorWatt 0.28; GFXWatt 0.23. Then after the first suspend resume, turbostat shows: Pkg%pc6 = 0.00%; PkgWatt 10.08; CorWatt 3.04; GFXWatt 3.51. PkgTmp goes up by more than 10 degrees. The system is idle is both cases: 0.03% busy. I'm having the same problem with 4.3, skipped few versions before that. However, I seem to have found an earlier bug that could be related, at least it sounds similar: https://bugs.freedesktop.org/show_bug.cgi?id=90253 This issue persists though kernel 4.4-rc8. Bug is still valid in 4.4 release. Yes, the bug persists through kernel 4.4. Having isolated the issue down to the exact causal commit, I do not know what else I can do to move this one along. Just an update: bug still exists in 4.5 rc5. Bug still exists in 4.6 rc1. Is there anything I can provide to help with this issue? This issue no longer occurs on my computer. The a fresh install on linux was done, and now the system is using systemd whereas previously it was not. I am not certain systemd is the difference, it is just my best guess. (In reply to tigrangab from comment #18) > Bug still exists in 4.6 rc1. Is there anything I can provide to help with > this issue? Did you bisect this to the same commit reported in comment #0? Highest+Blocker as being regression w/o workaround Tested on drm-tip on IVB-3770, but the issue didn't appear: all suspends and resumes are fine. tigrangab@gmail.com, can you check if failure still persist with latest drm-tip? For others it seem to be resolved, see comment 19 and comment 22. While in comment 19 above, I mentioned that this issue no longer occurred on my computer, I did try to go back and re-install an older version of my distribution (Ubuntu) on another partition in an attempt to re-create the issue. I was unsuccessful. I tried without success again today. My original work and kernel bisection was good and repeatable. I do not understand why I can not re-create the failure scenario now. I can only assume it is because I did not install from the exact same iso starting point. The only hardware change was the hard disk that had failed. In comment 13 there is a reference to a bug with similar symptoms (https://bugs.freedesktop.org/show_bug.cgi?id=90253). However it can not be the same root issue, because, if I understand the dates correctly, the commit that this was isolated to did not exist when that bug report was entered. I have to update my comment - probably I didn't check the correct kernel, but the issue mysteriously appeared between 4.9.0 (69973b830859bc6529a7a0468ba0d80ee5117826) and 4.10.0-rc6 from drm-tip: 2017y-02m-01d-11h-09m-17s UTC (eb9b7b42023edc1b5849d1ff3bef490b492067a3). The system seems to wake up immediately, and there's nothing special in dmesg, even though kernel command line includes drm.debug=0x1f $ cat /sys/power/state freeze mem disk $ echo 'mem' > /sys/power/state bash: echo: write error: Resource temporarily unavailable $ dmesg | tail [ 174.113592] systemd-journald[521]: Failed to set ACL on /var/log/journal/fe605962ccdd4f5dafb1348d1329bf81/user-1000.journal, ignoring: Operation not supported [ 209.294939] PM: Syncing filesystems ... done. [ 209.346235] PM: Preparing system for sleep (mem) System: Fedora 24 i7-3770 CPU @ 3.40GHz Intel HD 4000 Kernel config used: https://intel-gfx-ci.01.org/CI/CI_DRM_2133/kernel.config.bz2 I managed to get more info about the issue I'm seeing. The symptoms have been mostly consistent with my previous post. The failure to sleep does not happen every time; I had to reboot up to 5 times for the first failure to happen. Because of that, I'm not 100% certain if "good" commits are really bug-free - I tested at most 5 reboots. "Bad" commits are definitely correct though. Suspend failure is somewhat correlated to failures in dmesg, like: [ 10.287387] BUG: unable to handle kernel paging request at ffffffffa041d82 8 This commit came out as bad: commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24 Merge: a2cd64f 2cfe8f8 Author: David S. Miller <davem@davemloft.net> Date: Sun Jan 8 22:01:22 2017 -0500 Merge branch 'bcm_sf2-fixes' Created attachment 129436 [details]
git bisect result
good commits are those which survived 5 warm reboots without failing to suspend
Created attachment 129437 [details]
first bad commit dmesg
(In reply to Dorota Czaplejewicz from comment #26) > This commit came out as bad: > commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24 > Merge: a2cd64f 2cfe8f8 > Author: David S. Miller <davem@davemloft.net> > Date: Sun Jan 8 22:01:22 2017 -0500 > > Merge branch 'bcm_sf2-fixes' And that one has nothing to do with Intel graphics... Based on comment 24, comment 26 and comment 29 I would propose this to be closed. Should we pass this bug to other product+component or even another bugzilla? Agree, it should be closed. Thanks Doug Smythies for your confirmation |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
> This started somewhere between Kernel 4.2 and 4.3-rc1, > but I only noticed it a day ago. > > The first S3 suspend after a fresh boot works fine. > Thereafter, suspends simply resume again immediately. > > I get the following errors on my console: > > [ 152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail > [ 152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11 > [ 152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11 > [ 152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11 > [ 152.697306] PM: Some devices failed to suspend, or early wake event detected > > The issue is not limited to my normal way of doing suspend, using "pm-suspend". > It also happens using the "echo mem > /sys/power/state" method. > > The kernel was bisected, and the result was double checked by clean compiles > of the first bad commit and the immediately preceding commit. Bisect results > copied below: > > $ git bisect good > dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit > commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 > Author: John Harrison <John.C.Harrison at Intel.com> > Date: Fri May 29 17:43:39 2015 +0100 > > drm/i915: Add explicit request management to i915_gem_init_hw() > > Now that a single per ring loop is being done for all the different > intialisation steps in i915_gem_init_hw(), it is possible to add proper request > management as well. The last remaining issue is that the context enable call > eventually ends up within *_render_state_init() and this does its own private > _i915_add_request() call. > > This patch adds explicit request creation and submission to the top level loop > and removes the add_request() from deep within the sub-functions. > > v2: Updated for removal of batch_obj from add_request call in previous patch. > > For: VIZ-5115 > Signed-off-by: John Harrison <John.C.Harrison at Intel.com> > Reviewed-by: Tomas Elf <tomas.elf at intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch> > > :040000 040000 789c630ff3f5f07238a5df1bde79187c6c1251d0 2da3f7e20e2642d8eebd9f72528923c2ac53a8cb M drivers