11 minute read

The Problem. With Nvidia drivers 285.xx and later, you may start receiving errors while playing games like “GetDeviceRemovedReason”, or “Display driver nvlddmkm stopped responding and has successfully recovered.” (Refer to sample image)

TDR Error

While I do not know what Nvidia did to start this problem, it is rampant out on the internet. There is many angry posts at the situation and Nvidia doesn’t appear to be doing anything 5 months later. I myself have suffered this with SLI‘d 9800 GTX++. These are caused by Timeout Detection and Recovery. This is a new feature of Vista and later operating systems.

Before I get started too far into what it is, as it can get to be a lengthy topic, I want to mention what I ended up doing and what I found in my scenario. As I mentioned, I had 2 9800 GTX++ cards in an SLI configuration. And specifically I started getting this error in Battlefield 3. So after much troubleshooting and frustration, I started using MSI Afterburner to monitor the GPU usage of both cards as I played the game. I found that every time this occurred, one of my cards was locked at 100% utilization where normally they both hovered around 97-99%. It was always the same card that would show 100%. So I removed that card and the problem went away. I determined that the issue was specifically caused by a hardware defect in that card. So I upgraded to my ATI Radeon 7970 and that was well worth the $500 I spent at the time. It is over 2 years later and I still use the card and run all games at max settings.

What is it? Timeout Detection and Recovery

Windows Vista and later operating systems attempt to detect situations in which computers appear to be completely “frozen”. They then attempt to dynamically recover from the frozen situations so that their desktops are responsive again. This process of detection and recovery is known as timeout detection and recovery (TDR). In the TDR process, the operating system’s GPU scheduler calls the display miniport driver’s DxgkDdiResetFromTimeout function to reinitialize the driver and reset the GPU. Therefore, end users are not required to reboot the operating system, which greatly enhances their experience. The only visible artifact from the hang detection to the recovery is a screen flicker. This screen flicker results when the operating system resets some portions of the graphics stack, which causes a screen redraw. Some legacy DirectX applications (for example, those DirectX applications that conform to DirectX versions earlier than 9.0) might render to a black screen at the end of this recovery. The end user would have to restart these applications.

The following sequence briefly describes the TDR process:

  1. Timeout detection

The GPU scheduler, which is part of the DirectX graphics kernel subsystem (Dxgkrnl.sys), detects that the GPU is taking more than the permitted amount of time to execute a particular task. The GPU scheduler then tries to preempt this particular task. The preempt operation has a “wait” timeout, which is the actual TDR timeout. This step is thus the timeout detection phase of the process. The default timeout period in Windows Vista and later operating systems is 2 seconds. If the GPU cannot complete or preempt the current task within the TDR timeout period, the operating system diagnoses that the GPU is frozen.

To prevent timeout detection from occurring, hardware vendors should ensure that graphics operations (that is, DMA buffer completion) take no more than 2 seconds in end-user scenarios such as productivity and game play.

  1. Preparation for recovery:

The operating system’s GPU scheduler calls the display miniport driver’s DxgkDdiResetFromTimeout function to inform the driver that the operating system detected a timeout. The driver must then reinitialize itself and reset the GPU. In addition, the driver must stop accessing memory and should not access hardware. The operating system and the driver collect hardware and other state information that could be useful for post-mortem diagnosis.

  1. Desktop recovery:

The operating system resets the appropriate state of the graphics stack. The video memory manager, which is also part of Dxgkrnl.sys, purges all allocations from video memory. The display miniport driver resets the GPU hardware state. The graphics stack takes the final actions and restores the desktop to the responsive state. As previously mentioned, some legacy DirectX applications might render just black at the end of this recovery, which requires the end user to restart these applications. Well-written DirectX 9Ex and

DirectX 10

and later applications that handle Device Remove technology continue to work correctly. An application must release and then recreate its Direct3D device and all of the device’s objects. For more information about how DirectX applications recover, see the

Windows SDK

.

(Information provided by jimbonbon at http://forums.nvidia.com/index.php?showtopic=100800)

 Common issues that can cause a TDR: 

  • Incorrect memory timings or voltages
  • Insufficient/problematic PSU
  • Corrupt driver install
  • Overheating
  • Unstable overclocks (GPU or CPU)
  • Incorrect MB voltages (generally NB/SB)
  • Faulty graphics card
  • A badly written driver or piece of software, but this is an unlikely cause in most cases
  • Driver conflicts
  • Another possibility that people tend not to like to hear, is that you are simply asking too much of your graphics card. What I mean by this, is that if you have your settings too high and the graphics card struggles and falls to very low FPS, then something graphically complex occurs, the GPU may not be able to respond and a TDR error may occur
  • Some users have experienced TDR errors whilst browsing the web with the 280.xx, 285.xx and 290.xx drivers. Please head to this link to clarify if this is relevant to you – this is quite a specific issue which seems to predominantly affect web browsing as opposed to gaming. Some users have found that changing the power management mode to ‘Prefer Maximum Performance’ has helped, with many others reporting that 295.73 has resolved the issue.

Examples of specific TDR causes:

Things to check or consider initially in your troubleshooting:

  • Check for newer driver version or cleanly uninstall/re-install your drivers. Great description of how to do this here (full credit to DJNOOB for this).
  • If you have multiple ‘GPU tools’ like EVGA Precision and MSI Afterburner installed, consider that it is only advisable to have one tool such as this at any one time.
  • If the issue is only with a specific game, check for patches.
  • If this is a new problem for you, have you just added any new hardware or updated/installed any new drivers? Consider rolling them back.
  • Check temperatures. Its important you check these at load, which is generally when a TDR event will occur. Everest Ultimate Edition is a good tool for this, or OCCT’s GPU stress test. If things are too hot, you can use tools such as EVGA Precision to increase GPU fan speeds on graphics cards. Cleaning your system of dust can help temperatures significantly. Common sense will normally tell you if something is too hot, but if you aren’t sure, the information is generally available online.
  • Check that your RAM is running at the correct settings as defined by the manufacturer.
  • Remove any overclocks on your system and test with stock clocks. This includes memory, CPU and GPU (even factory OC’d cards). Best to try each separately so you can be sure if one solves the issue.
  • Attempt a CMOS reset to return all BIOS settings to default. This is a good hardware troubleshooting step as it also resets the IRQ assignments – you can normally reset the CMOS either through a jumper on the motherboard (see manual), or by disconnecting the mains power and taking out the motherboard battery for 5 minutes. You will likely need to go in to the BIOS after this reset to check the memory timings/voltages are correct, as these will not always do so automatically.

Additional steps:

  • Run memtest (memtest.org). This should complete with NO errors.
  • If you have just installed a new graphics card, check your PSU ratings. Is it providing enough power, and most importantly enough Amps on the 12V rail.
  • If you are using SLI, try each card separately to see if the fault lies with one.
  • Try graphics card/cards in another computer if you can.
Last Resort Solution:
As a last resort, you can add registry entries that modify the behavior of TDR. This is not supported by Microsoft. But if you are like me, you have to.

You can use the following TDR-related registry keys for testing or debugging purposes only. That is, they should not be manipulated by any applications outside targeted testing or debugging.

  • TdrLevel – Specifies the initial level of recovery. The default value is to recover on timeout (TdrLevelRecover).KeyPath :HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrLevel
    ValueType : REG_DWORD
    ValueData :</p>

    TdrLevelOff (0) – Detection disabled.
    TdrLevelBugcheck (1) – Bug check on detected timeout, for example, no recovery.
    TdrLevelRecoverVGA (2) – Recover to VGA (not implemented).
    TdrLevelRecover (3) – Recover on timeout. This is the default value.

  • TdrDelay – Specifies the number of seconds that the GPU can delay the preempt request from the GPU scheduler. This is effectively the timeout threshold. The default value is 2 seconds.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrDelay
    ValueType : REG_DWORD
    ValueData : Number of seconds to delay. 2 seconds is the default value.
  • TdrDdiDelay – Specifies the number of seconds that the operating system allows threads to leave the driver. After a specified time, the operating system bug-checks the computer with the code VIDEO_TDR_FAILURE (0x116). The default value is 5 seconds.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrDdiDelay
    ValueType : REG_DWORD
    ValueData : Number of seconds to leave the driver. 5 seconds is the default value.
  • TdrTestMode – Reserved. Do not use.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrTestMode
    ValueType : REG_DWORD
    ValueData : Do not use.
  • TdrDebugMode – Specifies the debugging-related behavior of the TDR process. The default value is TDR_DEBUG_MODE_RECOVER_NO_PROMPT, which indicates not to break into the debugger.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrDebugMode
    ValueType : REG_DWORD
    ValueData :</p>

    TDR_DEBUG_MODE_OFF (0) – Break to kernel debugger before the recovery to allow investigation of the timeout.
    TDR_DEBUG_MODE_IGNORE_TIMEOUT (1) – Ignore any timeout.
    TDR_DEBUG_MODE_RECOVER_NO_PROMPT (2) – Recover without breaking into the debugger. This is the default value. TDR_DEBUG_MODE_RECOVER_UNCONDITIONAL (3) – Recover even if some recovery conditions are not met (for example, recover on consecutive timeouts).

  • TdrLimitTime – Supported in Windows Server 2008 and later versions, and Windows Vista with Service Pack 1 (SP1) and later versions.Specifies the default time within which a specific number of TDRs (specified by the TdrLimitCount key) are allowed without crashing the computer. The default value is 60 seconds.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrLimitTime
    ValueType : REG_DWORD
    ValueData : Number of seconds before crashing. 60 seconds is the default value.
  • TdrLimitCount – Supported in Windows Server 2008 and later versions, and Windows Vista with Service Pack 1 (SP1) and later versions.Specifies the default number of TDRs (0x117) that are allowed during the time specified by the TdrLimitTime key without crashing the computer. The default value is 5.KeyPath : HKLMSystemCurrentControlSetControlGraphicsDrivers
    KeyValue : TdrLimitCount
    ValueType : REG_DWORD
    ValueData : Number of TDRs before crashing. The default value is 5.
  • What I did:

    First I added the following five registry entries at the common KeyPath mentioned in those settings. I added them all with the default values.

    Defaults – Decimal (Hexadecimal)
    TdrLevel – 3 (0x3)

    TdrDelay – 2 (0x2)

    TdrDdiDelay – 5 (0x5)

    TdrLimitTime – 60 (0x3c)

    TdrLimitCount – 5 (0x5)

     Test #1 –
    TdrLevel – 3 (0x3)

    TdrDelay – 2 (0x2)

    TdrDdiDelay – 5 (0x5)

    TdrLimitTime – 60 (0x3c)

    TdrLimitCount – 6 (0x6)

    I was getting a crash about every hour. After putting in the defaults, I was able to reproduce the issue while playing Warcraft III (Dota). Not stressful for my cards and I monitored heat/fan very carefully. Then I modified the TdrLimitCount from 5 to 6. This means that in order for TDR to recover it has to happen 6 times in 60 seconds instead of 5 in 60 seconds. After modifying that entry I haven’t had a crash since. So that one minor change made all the difference for me. It is important that you only do this if necessary and you have eliminated all possible causes as documented above. Use at your own risk. Modify only 1 thing at a time and by a very small amount. Even a small change can make the difference.

    A week later I got a few TDRs within 30 minutes.

    Test #2 –

    TdrLevel – 3 (0x3)

    Tdr Delay – 2 (0x2)

    TdrDdiDelay – 5 (0x5)

    TdrLimitTime – 60 (0x3c)

    TdrLimitCount – 7 (0x5)

    Received several crashes within half an hour again.

    Test #3 –
    TdrLevel – 3 (0x3)

    TdrDelay – 2 (0x2)

    TdrDdiDelay – 5 (0x5)

    TdrLimitTime – 30 (0x1e)

    TdrLimitCount – 7 (0x5)

    Again, several crashes with this result.

     

    After deliberating on the issue, I have decided to discard my SLI’d EVGA 9800 GTX++ Superclocked cards. I have instead ordered a single XFX HD 7970 to replace it. Nvidia doesn’t appear to care about the problem that many folks are seeing with 280.xx and later with TDR issues. So after 12 years of being a Nvidia/Evga fanboy, I am moving over to AMD.