breakpad/docs/symbol_files.md
Andy Bonventre daed3a4383 [Docs] add markdown docs (converted from Wiki)
BUG=none
R=mark
CC=google-breakpad-dev@googlegroups.com

Review URL: https://codereview.chromium.org/1357773004 .

Patch from Andy Bonventre <andybons@chromium.org>.
2015-09-22 17:29:52 -04:00

24 KiB

Introduction

Given a minidump file, the Breakpad processor produces stack traces that include function names and source locations. However, minidump files contain only the byte-by-byte contents of threads' registers and stacks, without function names or machine-code-to-source mapping data. The processor consults Breakpad symbol files for the information it needs to produce human-readable stack traces from the binary-only minidump file.

The platform-specific symbol dumping tools parse the debugging information the compiler provides (whether as DWARF or STABS sections in an ELF file or as stand-alone PDB files), and write that information back out in the Breakpad symbol file format. This format is much simpler and less detailed than compiler debugging information, and values legibility over compactness.

Overview

Breakpad symbol files are ASCII text files, with lines delimited as appropriate for the host platform. Each line is a record, divided into fields by single spaces; in some cases, the last field of the record can contain spaces. The first field is a string indicating what sort of record the line represents (except for line records; these are very common, making them the default saves space). Some fields hold decimal or hexadecimal numbers; hexadecimal numbers have no "0x" prefix, and use lower-case letters.

Breakpad symbol files contain the following record types. With some restrictions, these may appear in any order.

  • A MODULE record describes the executable file or shared library from which this data was derived, for use by symbol suppliers. A `MODULE' record should be the first record in the file.

  • A FILE record gives a source file name, and assigns it a number by which other records can refer to it.

  • A FUNC record describes a function present in the source code.

  • A line record indicates to which source file and line a given range of machine code should be attributed. The line is attributed to the function defined by the most recent FUNC record.

  • A PUBLIC record gives the address of a linker symbol.

  • A STACK record provides information necessary to produce stack traces.

MODULE records

A MODULE record provides meta-information about the module the symbol file describes. It has the form:

MODULE operatingsystem architecture id name

For example: MODULE Linux x86 D3096ED481217FD4C16B29CD9BC208BA0 firefox-bin These records provide meta-information about the executable or shared library from which this symbol file was generated. A symbol supplier might use this information to find the correct symbol files to use to interpret a given minidump, or to perform other sorts of validation. If present, a MODULE record should be the first line in the file.

The fields are separated by spaces, and cannot contain spaces themselves, except for name.

  • The operatingsystem field names the operating system on which the executable or shared library was intended to run. This field should have one of the following values: | Value | Meaning | |:----------|:--------------------| | Linux | Linux | | mac | Macintosh OSX | | windows | Microsoft Windows |

  • The architecture field indicates what processor architecture the executable or shared library contains machine code for. This field should have one of the following values: | Value | Instruction Set Architecture | |:----------|:---------------------------------| | x86 | Intel IA-32 | | x86_64 | AMD64/Intel 64 | | ppc | 32-bit PowerPC | | ppc64 | 64-bit PowerPC | | unknown | unknown |

  • The id field is a sequence of hexadecimal digits that identifies the exact executable or library whose contents the symbol file describes. The way in which it is computed varies from platform to platform.

  • The name field contains the base name (the final component of the directory path) of the executable or library. It may contain spaces, and extends to the end of the line.

FILE records

A FILE record holds a source file name for other records to refer to. It has the form:

FILE number name

For example: FILE 2 /home/jimb/mc/in/browser/app/nsBrowserApp.cpp

A FILE record provides the name of a source file, and assigns it a number which other records (line records, in particular) can use to refer to that file name. The number field is a decimal number. The name field is the name of the file; it may contain spaces.

FUNC records

A FUNC record describes a source-language function. It has the form:

FUNC address size parameter_size name

For example: FUNC c184 30 0 nsQueryInterfaceWithError::operator()(nsID const&, void**) const

The address and size fields are hexadecimal numbers indicating the start address and length in bytes of the machine code instructions the function occupies. (Breakpad symbol files cannot accurately describe functions whose code is not contiguous.) The start address is relative to the module's load address.

The parameter_size field is a hexadecimal number indicating the size, in bytes, of the arguments pushed on the stack for this function. Some calling conventions, like the Microsoft Windows stdcall convention, require the called function to pop parameters passed to it on the stack from its caller before returning. The stack walker uses this value, along with data from STACK records, to step from the called function's frame to the caller's frame.

The name field is the name of the function. In languages that use linker symbol name mangling like C++, this should be the source language name (the "unmangled" form). This field may contain spaces.

Line records

A line record describes the source file and line number to which a given range of machine code should be attributed. It has the form:

address size line filenum

For example: c184 7 59 4

Because they are so common, line records do not begin with a string indicating the record type. All other record types' names use upper-case letters; hexadecimal numbers, like a line record's address, use lower-case letters.

The address and size fields are hexadecimal numbers indicating the start address and length in bytes of the machine code. The address is relative to the module's load address.

The line field is the line number to which the machine code should be attributed, in decimal; the first line of the source file is line number 1. The filenum field is a decimal number appearing in a prior FILE record; the name given in that record is the source file name for the machine code.

The line is assumed to belong to the function described by the last preceding FUNC record. Line records may not appear before the first `FUNC' record.

No two line records in a symbol file cover the same range of addresses. However, there may be many line records with identical line and file numbers, as a given source line may contribute many non-contiguous blocks of machine code.

PUBLIC records

A PUBLIC record describes a publicly visible linker symbol, such as that used to identify an assembly language entry point or region of memory. It has the form:

PUBLIC address parameter_size name

For example: PUBLIC 2160 0 Public2_1

The Breakpad processor essentially treats a PUBLIC record as defining a function with no line number data and an indeterminate size: the code extends to the next address mentioned. If a given address is covered by both a PUBLIC record and a FUNC record, the processor uses the FUNC data.

The address field is a hexadecimal number indicating the symbol's address, relative to the module's load address.

The parameter_size field is a hexadecimal number indicating the size of the parameters passed to the code whose entry point the symbol marks, if known. This field has the same meaning as the parameter_size field of a FUNC record; see that description for more details.

The name field is the name of the symbol. In languages that use linker symbol name mangling like C++, this should be the source language name (the "unmangled" form). This field may contain spaces.

STACK WIN records

Given a stack frame, a STACK WIN record indicates how to find the frame that called it. It has the form:

STACK WIN type rva code_size prologue_size epilogue_size parameter_size saved_register_size local_size max_stack_size has_program_string program_string_OR_allocates_base_pointer

For example: STACK WIN 4 2170 14 1 0 0 0 0 0 1 $eip 4 + ^ = $esp $ebp 8 + = $ebp $ebp ^ =

All fields of a STACK WIN record, except for the last, are hexadecimal numbers.

The type field indicates what sort of stack frame data this record holds. Its value should be one of the values of the [StackFrameTypeEnum] (http://msdn.microsoft.com/en-us/library/bc5207xw%28VS.100%29.aspx) type in Microsoft's [Debug Interface Access (DIA)] (http://msdn.microsoft.com/en-us/library/x93ctkx8%28VS.100%29.aspx) API. Breakpad uses only records of type 4 (FrameTypeFrameData) and 0 (FrameTypeFPO); it ignores others. These types differ only in whether the last field is an allocates_base_pointer flag (FrameTypeFPO) or a program string (FrameTypeFrameData). If more than one record covers a given address, Breakpad prefers FrameTypeFrameData records over FrameTypeFPO records.

The rva and code_size fields give the starting address and length in bytes of the machine code covered by this record. The starting address is relative to the module's load address.

The prologue_size and epilogue_size fields give the length, in bytes, of the prologue and epilogue machine code within the record's range. Breakpad does not use these values.

The parameter_size field gives the number of argument bytes this function expects to have been passed. This field has the same meaning as the parameter_size field of a FUNC record; see that description for more details.

The saved_register_size field gives the number of bytes in the stack frame dedicated to preserving the values of any callee-saves registers used by this function.

The local_size field gives the number of bytes in the stack frame dedicated to holding the function's local variables and temporary values.

The max_stack_size field gives the maximum number of bytes pushed on the stack in the frame. Breakpad does not use this value.

If the has_program_string field is zero, then the STACK WIN record's final field is an allocates_base_pointer flag, as a hexadecimal number; this is expected for records whose type is 0. Otherwise, the final field is a program string.

Interpreting a STACK WIN record

Given the register values for a frame F, we can find the calling frame as follows:

  • If the has_program_string field of a STACK WIN record is zero, then the final field is allocates_base_pointer, a flag indicating whether the frame uses the frame pointer register, %ebp, as a general-purpose register.
    • If allocates_base_pointer is true, then %ebp does not point to the frame's base address. Instead,
      • Let next_parameter_size be the parameter size of the function frame F called (not this record's parameter_size field), or zero if F is the youngest frame on the stack. You must find this value in F's callee's FUNC, STACK WIN, or PUBLIC records.
      • Let frame_size be the sum of the local_size field, the saved_register_size field, and next_parameter_size. > > With those definitions in place, we can recover the calling frame as follows:
      • F's return address is at %esp +frame_size,
      • the caller's value of %ebp is saved at %esp +next_parameter_size+saved_register_size- 8, and
      • the caller's value of %esp just before the call instruction was %esp +frame_size+ 4. > > (Why do we include next_parameter_size in the sum when computing frame_size and the address of the saved %ebp? When a function A has called a function B, the arguments that A pushed for B are considered part of A's stack frame: A's value for %esp points at the last argument pushed for B. Thus, we must include the size of those arguments (given by the debugging info for B) along with the size of A's register save area and local variable area (given by the debugging info for A) when computing the overall size of A's frame.)
    • If allocates_base_pointer is false, then F's function doesn't use %ebp at all. You may recover the calling frame as above, except that the caller's value of %ebp is the same as F's value for %ebp, so no steps are necessary to recover it.
  • If the has_program_string field of a STACK WIN record is not zero, then the record's final field is a string containing a program to be interpreted to recover the caller's frame. The comments in the [postfix_evaluator.h] (http://code.google.com/p/google-breakpad/source/browse/trunk/src/processor/postfix_evaluator.h#40) header file explain the language in which the program is written. You should place the following variables in the dictionary before interpreting the program:
    • $ebp and $esp should be the values of the %ebp and %esp registers in F.
    • .cbParams, .cbSavedRegs, and .cbLocals, should be the values of the STACK WIN record's parameter_size, saved_register_size, and local_size fields.
    • .raSearchStart should be set to the address on the stack to begin scanning for a return address, if necessary. The Breakpad processor sets this to the value of %esp in F, plus the frame_size value mentioned above.

If the program stores values for $eip, $esp, $ebp, $ebx, $esi, or $edi, then those are the values of the given registers in the caller. If the value of $eip is zero, that indicates that the end of the stack has been reached.

The Breakpad processor checks that the value yielded by the above for the calling frame's instruction address refers to known code; if the address seems to be bogus, then it uses a heuristic search to find F's return address and stack base.

STACK CFI records

STACK CFI ("Call Frame Information") records describe how to walk the stack when execution is at a given machine instruction. These records take one of two forms:

STACK CFI INIT address size register1: expression1 register2: expression2 ...

STACK CFI address register1: expression1 register2: expression2 ...

For example:

STACK CFI INIT 804c4b0 40 .cfa: $esp 4 + $eip: .cfa 4 - ^
STACK CFI 804c4b1 .cfa: $esp 8 + $ebp: .cfa 8 - ^

The address and size fields are hexadecimal numbers. Each registeri is the name of a register or pseudoregister. Each expression is a Breakpad postfix expression, which may contain spaces, but never ends with a colon. (The appropriate register names for a given architecture are determined when STACK CFI records are first enabled for that architecture, and should be documented in the appropriate stackwalker_architecture.cc source file.)

STACK CFI records describe, at each machine instruction in a given function, how to recover the values the machine registers had in the function's caller. Naturally, some registers' values are simply lost, but there are three cases in which they can be recovered:

  • You can always recover the program counter, because that's the function's return address. If the function is ever going to return, the PC must be saved somewhere.

  • You can always recover the stack pointer. The function is responsible for popping its stack frame before it returns to the caller, so it must be able to restore this, as well.

  • You should be able to recover the values of callee-saves registers. These are registers whose values the callee must preserve, either by saving them in its own stack frame before using them and re-loading them before returning, or by not using them at all.

(As an exception, note that functions which never return may not save any of this data. It may not be possible to walk the stack past such functions' stack frames.)

Given rules for recovering the values of a function's caller's registers, we can walk up the stack. Starting with the current set of registers --- the PC of the instruction we're currently executing, the current stack pointer, etc. --- we use CFI to recover the values those registers had in the caller of the current frame. This gives us a PC in the caller whose CFI we can look up; we apply the process again to find that function's caller; and so on.

Concretely, CFI records represent a table with a row for each machine instruction address and a column for each register. The table entry for a given address and register contains a rule describing how, when the PC is at that address, to restore the value that register had in the caller.

There are some special columns:

  • A column named .cfa, for "Canonical Frame Address", tells how to compute the base address of the frame; other entries can refer to the CFA in their rules.

  • A column named .ra represents the return address.

For example, suppose we have a machine with 32-bit registers, one-byte instructions, a stack that grows downwards, and an assembly language that resembles C. Suppose further that we have a function whose machine code looks like this:

func:                                ; entry point; return address at sp
func+0:      sp -= 16                ; allocate space for stack frame
func+1:      sp[12] = r0             ; save 4-byte r0 at sp+12
             ...                     ; stuff that doesn't affect stack
func+10:     sp -= 4; *sp = x        ; push some 4-byte x on the stack
             ...                     ; stuff that doesn't affect stack
func+20:     r0 = sp[16]             ; restore saved r0
func+21:     sp += 20                ; pop whole stack frame
func+22:     pc = *sp; sp += 4       ; pop return address and jump to it

The following table would describe the function above:

code address .cfa r0 (on Google Code) r1 (on Google Code) ... .ra
func+0 sp cfa[0]
func+1 sp+16 cfa[0]
func+2 sp+16 cfa[-4] cfa[0]
func+11 sp+20 cfa[-4] cfa[0]
func+21 sp+20 cfa[0]
func+22 sp cfa[0]

Some things to note here:

  • Each row describes the state of affairs before executing the instruction at the given address. Thus, the row for func+0 describes the state before we execute the first instruction, which allocates the stack frame. In the next row, the formula for computing the CFA has changed, reflecting the allocation.

  • The other entries are written in terms of the CFA; this allows them to remain unchanged as the stack pointer gets bumped around. For example, to find the caller's value for r0 (on Google Code) at func+2, we would first compute the CFA by adding 16 to the sp, and then subtract four from that to find the address at which r0 (on Google Code) was saved.

  • Although the example doesn't show this, most calling conventions designate "callee-saves" and "caller-saves" registers. The callee must restore the values of "callee-saves" registers before returning (if it uses them at all), whereas the callee is free to use "caller-saves" registers without restoring their values. A function that uses caller-saves registers typically does not save their original values at all; in this case, the CFI marks such registers' values as "unrecoverable".

  • Exactly where the CFA points in the frame --- at the return address? below it? At some fixed point within the frame? --- is a question of definition that depends on the architecture and ABI in use. But by definition, the CFA remains constant throughout the lifetime of the frame. It's up to architecture- specific code to know what significance to assign the CFA, if any.

To save space, the most common type of CFI record only mentions the table entries at which changes take place. So for the above, the CFI data would only actually mention the non-blank entries here:

insn cfa r0 (on Google Code) r1 (on Google Code) ... ra
func+0 sp cfa[0]
func+1 sp+16
func+2 cfa[-4]
func+11 sp+20
func+21 r0 (on Google Code)
func+22 sp

A STACK CFI INIT record indicates that, at the machine instruction at address, belonging to some function, the value that registern had in that function's caller can be recovered by evaluating expressionn. The values of any callee-saves registers not mentioned are assumed to be unchanged. (STACK CFI records never mention caller-saves registers.) These rules apply starting at address and continue up to, but not including, the address given in the next STACK CFI record. The size field is the total number of bytes of machine code covered by this record and any subsequent STACK CFI records (until the next STACK CFI INIT record). The address field is relative to the module's load address.

A STACK CFI record (no INIT) is the same, except that it mentions only those registers whose recovery rules have changed from the previous CFI record. There must be a prior STACK CFI INIT or STACK CFI record in the symbol file. The address field of this record must be greater than that of the previous record, and it must not be at or beyond the end of the range given by the most recent STACK CFI INIT record. The address is relative to the module's load address.

Each expression is a breakpad-style postfix expression. Expressions may contain spaces, but their tokens may not end with colons. When an expression mentions a register, it refers to the value of that register in the callee, even if a prior name/expression pair gives that register's value in the caller. The exception is .cfa, which refers to the canonical frame address computed by the .cfa rule in force at the current instruction.

The special expression .undef indicates that the given register's value cannot be recovered.

The register names preceding the expressions are always followed by colons. The expressions themselves never contain tokens ending with colons.

There are two special register names:

  • .cfa ("Canonical Frame Address") is the base address of the stack frame. Other registers' rules may refer to this. If no rule is provided for the stack pointer, the value of .cfa is the caller's stack pointer.

  • .ra is the return address. This is the value of the restored program counter. We use .ra instead of the architecture-specific name for the program counter.

The Breakpad stack walker requires that there be rules in force for .cfa and .ra at every code address from which it unwinds. If those rules are not present, the stack walker will ignore the STACK CFI data, and try to use a different strategy.

So the CFI for the example function above would be as follows, if func were at address 0x1000 (relative to the module's load address):

STACK CFI INIT 1000 .cfa: $sp .ra: .cfa ^
STACK CFI      1001 .cfa: $sp 16 +
STACK CFI      1002 $r0: .cfa 4 - ^
STACK CFI      100b .cfa: $sp 20 +
STACK CFI      1015 $r0: $r0
STACK CFI      1016 .cfa: $sp