The GPU (graphics processing unit) is a specialized processor that offloads graphics rendering from the CPU. It is especially important for 3D rendering, but can also do 2D acceleration, video decoding, etc.

Ringbuffers, batchbuffers, and debug registers

The graphics drivers run on the CPU and are responsible for feeding instructions to the GPU. They do this by placing the instructions in a so-called ringbuffer. A ringbuffer is a piece of memory that the CPU can write to and the GPU can read from. The CPU writes instructions to the GPU from the beginning of the ringbuffer and maintains a TAIL register which contains the address of the next memory address that the CPU will write. In other words the CPU has finished writing everything up to the TAIL address, but not the TAIL address itself. last valid instruction that the CPU has finished writing. The GPU follows and executes the instructions that the CPU has written up to the TAIL register. It maintains a HEAD register that contains the address of the next instruction that the GPU will read. The last instruction read is the instruction before the the HEAD. When CPU reaches the end of the ringbuffer it wraps around and starts writing from the beginning (which is why it is called a ringbuffer). It just has to watch the HEAD register to make sure it doesn't overwrite any instructions that the GPU hasn't yet read.

Often it is not practical for the CPU to write all instructions to the ringbuffer. It then writes instructions to another piece of memory and this is called a batchbuffer since it contains a batch of instructions. It then places an instruction in the ringbuffer to read from a batchbuffer at a given memory location. At the end of the batchbuffer there is either an instruction that says that this is the end of the batchbuffer, in which the GPU continues from where it left the ringbuffer, or an instruction to read from another batchbuffer (this is called a chain).

In addition to the HEAD and TAIL register, there are tons of other registers and intel_gpu_dump prints a few that are useful for debugging. By comparing this information with the data in the ringbuffer and batchbuffers, one can often get idea of what has gone wrong if the GPU has hung.

Interpreting an actual IntelGpuDump.txt

At TypicalIntelGpuDump.txt there is an actual output of intel_gpu_dump from when the system was in a healthy state. The system was exercised a little with glxgears in order to produce this dump with some 3D instructions and a HEAD which is not on top of TAIL. Let's look at the different parts:

First come the debug registers.

ACTHD: 0x0f71a038
EIR: 0x00000000
EMR: 0xffffffcd
ESR: 0x00000001
PGTBL_ER: 0x00000000
IPEHR: 0x02000000
IPEIR: 0x00000000
INSTDONE: 0xffe5fafd
INSTDONE1: 0x000fffff
  busy: Projection and LOD
  busy: Bypass FIFO
  busy: Color calculator
  busy: Command Processor
ACTHD: ACTive HeaD pointer register
This memory contains the memory address of the HEAD of the currently active ringbuffer or batchbuffer.
ESR: Error Status Register
The bits here are set when the hardware detects different errors. They can be cleared again by software.
EMR: Error Mask Register
This is a bit mask that decides which of the error bits in the ESR that gets propagated to the EIR. Bits from ESR are propagated to EIR if the corresponding bit in EMR is 0.
EIR: Error Identity Register
Error bits propagated from the ESR if the corresponding EMR bit is 0. Any bit set in the EIR will cause a Master Error bit to be set, so that the driver can know that an error has occured.
PGTBL_ER: PaGe TaBLe Error Register
This register holds more details about the error if it is a page table error/GTT error, i.e. an error related to the table translating GPU memory addresses into physical (CPU) addresses.
IPEHR: Instruction Parser Error Header Register
This register is loaded with the header of each instruction that is executed. If the GPU locks up due to an invalid instruction, this register will hold the instruction that triggered the lockup.
IPEIR: Instruction Parser Error Identification Register
Identifies if an Invalid Instruction Error happened in the ringbuffer or a batchbuffer. 0x00000000 if the error is in the ringbuffer and 0x00000010 is it is in a batchbuffer.
INSTDONE: INstruction STream interface DONE Register
This register consists of 32 single bits that is cleared when a subsystem of the GPU is busy. When the GPU is idle, all bits are set, but since some bits are reserved and has value 0, the default value is 0xffe7fffe. When the GPU hangs, this register can be used to tell which functions failed to complete.
INSTDONE1: Additional INstruction STream interface DONE
Like INSTDONE, but for other tasks. Not very well documented, but only the lower 20 bits are used.

Then comes a batchbuffer:

batchbuffer at 0x0a689000:
0x0a689000:      0x61040000: 3DSTATE_PIPELINE_SELECT
0x0a689004:      0x79090000: 3DSTATE_GLOBAL_DEPTH_OFFSET_CLAMP
0x0a689008:      0x00000000:    dword 1
0x0a68900c:      0x61020000: STATE_SIP
0x0a689010:      0x00000000:    dword 1
0x0a689014:      0x780b0000: 3DSTATE_VF_STATISTICS
0x0a689018:      0x61010004: STATE_BASE_ADDRESS
0x0a68901c:      0x00000001:    General state at 0x00000000
0x0a689020:      0x00000001:    Surface state at 0x00000000
0x0a689024:      0x00000001:    Indirect state at 0x00000000
0x0a689028:      0x00000001:    General state upper bound 0x00000000
0x0a68902c:      0x00000001:    Indirect state upper bound 0x00000000
0x0a689930:      0x60020100: CONSTANT_BUFFER: valid
0x0a689934:      0x0a649002:    offset: 0x00299240, length: 0x00000002
0x0a689938:      0x7b001404: 3DPRIMITIVE: tri strip sequential
0x0a68993c:      0x00000016:    vertex count
0x0a689940:      0x00000000:    start vertex
0x0a689944:      0x00000001:    instance count
0x0a689948:      0x00000000:    start instance
0x0a68994c:      0x00000000:    index bias
0x0a689950:      0x00000000: MI_NOOP
0x0a689954:      0x05000000: MI_BATCH_BUFFER_END
0x0a689958:      0x00000000:
0x0a68995c:      0x00000000:
0x0a68cff8:      0x00000000:
0x0a68cffc:      0x00000000:

This particular batchbuffer is not currently executing and therefore doesn't have its own HEAD. This is consistent with the ACTHD register 0x0f71a038 that is not in the range 0x0a689000-0x0a689954.

And finally, the ringbuffer:

Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:
0x00000000:      0x10800001: MI_STORE_DATA_INDEX
0x00000004:      0x00000080:    dword 1
0x00000008:      0x004cf867:    dword 2
0x0000000c:      0x01000000: MI_USER_INTERRUPT
0x00000010:      0x02000004: MI_FLUSH
0x0000003c:      0x00000000: MI_NOOP
0x00000040:      0x18800180: MI_BATCH_BUFFER_START
0x00000044:      0x0a689000:    dword 1
0x00000048:      0x02000004: MI_FLUSH
0x0001f488:      0x18800180: MI_BATCH_BUFFER_START
0x0001f48c:      0x0f71a000:    dword 1
0x0001f490: HEAD 0x02000004: MI_FLUSH
0x0001f494:      0x00000000: MI_NOOP
0x0001f498:      0x10800001: MI_STORE_DATA_INDEX
0x0001f49c:      0x00000080:    dword 1
0x0001f4a0:      0x004cf81a:    dword 2
0x0001f4a4:      0x01000000: MI_USER_INTERRUPT
0x0001f528:      0x10800001: MI_STORE_DATA_INDEX
0x0001f52c:      0x00000080:    dword 1
0x0001f530:      0x004cf81e:    dword 2
0x0001f534:      0x01000000: MI_USER_INTERRUPT
0x0001f538: TAIL 0x02000006: MI_FLUSH
0x0001f53c:      0x00000000: MI_NOOP
0x0001f540:      0x18800180: MI_BATCH_BUFFER_START
0x0001f544:      0x0f6ea000:    dword 1
0x0001ffdc:      0x01000000: MI_USER_INTERRUPT
0x0001ffe0:      0x02000004: MI_FLUSH
0x0001ffe4:      0x00000000: MI_NOOP
0x0001ffe8:      0x18800180: MI_BATCH_BUFFER_START
0x0001ffec:      0x0f6fe000:    dword 1
0x0001fff0:      0x02000004: MI_FLUSH
0x0001fff4:      0x00000000: MI_NOOP
0x0001fff8:      0x00000000: MI_NOOP
0x0001fffc:      0x00000000: MI_NOOP

The last instruction in the ringbuffer that have been read by the GPU is

0x0001f488:      0x18800180: MI_BATCH_BUFFER_START
0x0001f48c:      0x0f71a000:    dword 1

This says that there is a batchbuffer ready at memory address 0xf71a000. The GPU is apparently currently executing this batchbuffer since the ACTHD register is 0x0f71a038. For some reason this batchbuffer is not captured by the dump. The last instruction written by the CPU is

0x0001f534:      0x01000000: MI_USER_INTERRUPT

and the next instruction

0x0001f538: TAIL 0x02000006: MI_FLUSH

is about to get overwritten.

Code Meanings

Sarvatt notices that:

  • IPEHR: 0x7xxxxxxx typically have been mesa problems
  • IPEHR: 0x018xxxxx typically are hangs during dpms cycles

X/InterpretingIntelGpuDump (last edited 2011-08-23 00:01:30 by bryce)