Instruction-Based Sampling

CodeXL

PreviousNext
CodeXL User Guide
Help > Using CodeXL > CPU Profiler > CPU Profile Configurations > Instruction-Based Sampling
Instruction-Based Sampling

Instruction-Based Sampling (IBS) identifies and diagnoses performance issues in program hot-spots. It collects data on how instructions behave on the processor and in the memory subsystem; it also provides a range of measurable data for each sample. When running IBS,

·         hardware events are linked with the instructions that caused them.

·         it produces a wealth of event data in a single test run.

·         latency is measured for key performance factors such as data cache miss latency.

IBS provides the most common types of information needed for program performance analysis. It uses a hardware sampling technique to generate event information similar to that produced by event-based profiling. Event-based profiling, however, offers a wider range of events that can be monitored, such as those related to HyperTransport™ links.

Processor pipeline stages can be categorized into two main phases: instruction fetch and execution. Each instruction fetch operation produces a block of instruction data that is passed to the decode stages in the pipeline. The decoder identifies AMD64 instructions in the fetch block. These AMD64 instructions are translated to one or more macro-operations, called "macro-ops" or "ops," that are executed in the execution phase.

Note: For more information about instruction-based sampling, see the following documents which are available at AMD’s Developer Guides & Manuals page:

-          Software Optimization Guide for AMD Family 16h Processors

-          Preliminary BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 16h Models 00h-0Fh (Kabini) Processors

-          BIOS and Kernel Developer Guide (BKDG) for AMD Family 15h Models 00h-0Fh Processors

-          Software Optimization Guide for AMD Family 15h Processors 

-          BIOS and Kernel Developer Guide (BKDG) for AMD Family 14h Models 00h-0Fh Processors 

-          BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 12h Processors 

How IBS Works

IBS provides separate means to sample fetch operations and macro-ops. IBS fetch sampling and IBS op sampling can be enabled and collected separately or together.

IBS Fetch Sampling

This is a statistical sampling method. IBS fetch sampling counts the completed fetch operations. When the number of completed fetch operations reaches the maximum fetch count (the sampling period), IBS tags the fetch operation and monitors that fetch operation until it either completes or aborts.

When a tagged fetch completes or aborts, a sampling interrupt is generated, and an IBS fetch sample is taken. An IBS fetch sample contains a timestamp, the identifier of the interrupted process, the virtual fetch address, and several event flags and values that describe what happened during the fetch operation. Similar to time-based profiling and event-based profiling, CodeXL uses the IBS sample data and information from the executable images, debug information, and source to build a profile IBS for software components executed on the system. IBS is also available in system-wide profiling.

The event data reported in an IBS sample includes the following:

·         If the fetch completed or aborted.

·         If the address translation initially missed in the level one (L1) or level two (L2) instruction translation lookaside buffer (ITLB).

·         The page size of the L1 ITLB address translation (4K, 2M).

·         Whether the fetch initially missed in the instruction cache (IC).

·         Fetch latency (number of processor cycles from when the fetch was initiated to when the fetch completed or aborted).

Event-based profiling requires several counters to collect as much information as IBS. The fetch address precisely identifies the fetch operation associated with the hardware events. The IBS fetch address may be the address of a fetch block, the target of a branch, or the address of an instruction that is the fall-through of a conditional branch. A fetch block does not always start with a complete, valid AMD64 instruction; this occurs when an AMD64 instruction straddles two fetch blocks. In this case, CodeXL associates the IBS fetch sample with the AMD64 instruction in the preceding fetch block.

A fetch can be abandoned before it delivers data to the decoder, or due to a control flow redirection; this can happen at any time during the fetch process. A fetch abandoned before initial access to the ITLB (before address translation) is not regarded as useful for analysis. These early abandoned fetches are called killed fetches.

CodeXL identifies killed fetches. The fetch operations remaining after killed fetches are removed from consideration are called attempted fetches: these fetches represent valid attempts to obtain instruction bytes.

A completed fetch is an attempted fetch that successfully delivered instruction data to the decoder. An aborted fetch is an attempted fetch that did not complete.

Note: Instruction fetch is an aggressive, speculative activity, and even instruction data produced by a completed fetch may not be used.

IbsOps IBS Op Sampling

IBS op sampling operates like fetch sampling. It provides two methods for op selection:

·         Cycles mode IBS hardware counts processor cycles. When reaching the maximum cycle count (the sampling period), IBS tags an available valid op.

·         Dispatched op mode IBS hardware counts ops as they are issued into the pipeline. When the number of dispatched ops reaches the maximum op count (the sampling period), IBS tags the op. Dispatched op mode is preferred because Cycles mode selection is susceptible to delay induced sampling bias.

Note: Some processors do not support dispatched op mode. For more details, see the BKDG for the AMD processor for your platform. The execution stages of the pipeline monitor the tagged macro-op. When the tagged macro-op retires, a sampling interrupt is generated, and an IBS op sample is taken. An IBS op sample contains:

·         a timestamp,

·         the identifier of the interrupted process,

·         the virtual address of the AMD64 instruction from which the op was issued, and

·         several event flags and values that describe what happened when the macro-op executed.

CodeXL uses this and other information to build an IBS profile.

Cycle-based op sampling can be susceptible to timing bias: it can cause ops from some instructions to be selected more often than other instructions. Dispatched op-based sampling is the preferred IBS operating mode because it is not biased by timing.

IBS op samples are taken only for ops that retire. Thus, IBS op event information does not measure speculative execution activity. The cycles-based tagging scheme can introduce statistical bias due to stalls at the decoding stage of the pipeline. If a macro-op is not available for tagging when the maximum op count is reached, the hardware tags a macro-op and starts counting again from a small, pseudo-random initial count.

IBS op sampling reports the following values for all ops:

·         Virtual address of the parent AMD64 instruction from which the tagged op was issued.

·         Tag-to-retire time (the number of processor cycles from when the op was tagged to when the op retired).

·         Completion-to-retire time (the number of processor cycles from when the op completed to when the op was retired).

Attribution of event information is precise because the IBS hardware reports the address of the AMD64 instruction causing the events. For example, branch mispredictions are attributed to the mispredicted branch, and cache misses are attributed to the AMD64 instruction that caused the cache miss. IBS makes it easier to identify the performance-degrading instructions.

Some ops implement branch semantics. Branches include unconditional and conditional branches, subroutine calls, and subroutine returns.

Event information reported for branch ops include whether the branch was mispredicted or was taken.

IBS also indicates whether a branch operation was a subroutine return, and if the return was mispredicted. Some ops can perform a load (memory read), store (memory write), or a load and a store to the same memory address, as in the case of a read-op-write sequence.

When an op performs a load and/or store, event information includes the following:

·         Whether a load was performed.

·         Whether a store was performed.

·         Whether address translation initially missed in the L1 and/or L2 data translation lookaside buffer (DTLB).

·         Whether the load or store initially missed in the data cache (DC).

·         Virtual data address for the memory operation.

·         Latency when a load misses the DC.

Requests made through the Northbridge produce additional event information:

·         Whether the access was local or remote.

·         Data source that fulfilled the request.

A full list of IBS op event information appears in the section on IBS-Derive events below. For hardware-level details, see the BIOS and Kernel Developer's Guide (BKDG) for the AMD processor for your platform.

IBS-Derived Events

CodeXL translates the IBS information produced by the hardware into derived event sample counts that resemble EBP sample counts. All IBS-derived events have "IBS" in the event name and abbreviation. Although IBS-derived events and sample counts look similar to EBP events and sample counts, the source and sampling basis for the IBS event information are different.

Arithmetic should never be performed between IBS derived event sample counts and EBP event sample counts. It is not meaningful to directly compare the number of samples taken for events that represent the same hardware condition. For example, fewer IBS DC miss samples is not necessarily better than a larger quantity of EBP DC miss samples.

Event

Description

All IBS fetch samples

The number of all IBS fetch samples. This derived event counts the number of all IBS fetch samples that were collected including IBS-killed fetch samples

IBS fetch killed

The number of IBS sampled fetches that were killed fetches. A fetch operation is killed if the fetch did not reach ITLB or IC access. The number of killed fetch samples is not generally useful for analysis and are filtered out in other derived IBS fetch events (except Event Select 0xF000 which counts all IBS fetch samples including IBS killed fetch samples.)

IBS fetch attempted

The number of IBS sampled fetches that were not killed fetch attempts. This derived event measures the number of useful fetch attempts and does not include the number of IBS killed fetch samples. This event should be used to compute ratios such as the ratio of IBS fetch IC misses to attempted fetches. The number of attempted fetches should equal the sum of the number of completed fetches and the number of aborted fetches.

IBS fetch completed

The number of IBS sampled fetches that completed. A fetch is completed if the attempted fetch delivers instruction data to the instruction decoder. Although the instruction data was delivered, it may still not be used (e.g., the instruction data may have been on the "wrong path" of an incorrectly predicted branch.)

IBS fetch aborted

The number of IBS sampled fetches that aborted. An attempted fetch is aborted if it did not complete and deliver instruction data to the decoder. An attempted fetch may abort at any point in the process of fetching instruction data. An abort may be due to a branch redirection as the result of a mispredicted branch. The number of IBS aborted fetch samples is a lower bound on the amount of unsuccessful, speculative fetch activity. It is a lower bound since the instruction data delivered by completed fetches may not be used.

IBS ITLB hit

The number of IBS attempted fetch samples where the fetch operation initially hit in the L1 ITLB (Instruction Translation Lookaside Buffer).

IBS L1 ITLB misses (and L2 ITLB hits)

The number of IBS attempted fetch samples where the fetch operation initially missed in the L1 ITLB and hit in the L2 ITLB.

IBS L1 L2 ITLB miss

The number of IBS attempted fetch samples where the fetch operation initially missed in both the L1 ITLB and the L2 ITLB.

IBS instruction cache misses

The number of IBS attempted fetch samples where the fetch operation initially missed in the IC (instruction cache).

IBS instruction cache hit

The number of IBS attempted fetch samples where the fetch operation initially hit in the IC.

IBS 4K page translation

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (i.e., address translation completed successfully) and used a 4-KByte page entry in the L1 ITLB.

IBS 2M page translation

The number of IBS attempted fetch samples where the fetch operation produced a valid physical address (i.e., address translation completed successfully) and used a 2-MByte page entry in the L1 ITLB.

IBS fetch latency

The total latency of all IBS attempted fetch samples. Divide the total IBS fetch latency by the number of IBS attempted fetch samples to obtain the average latency of the attempted fetches that were sampled.

IBS fetch L2 cache miss

The instruction fetch missed in the L2 Cache.

IBS ITLB refill latency

The number of cycles when the fetch engine is stalled for an ITLB reload for the sampled fetch. If there is no reload, the latency will be 0.

All IBS op samples

The number of all IBS op samples that were collected. These op samples may be branch ops, resync ops, ops that perform load/store operations, or undifferentiated ops (e.g., those ops that perform arithmetic operations, logical operations, etc.). IBS collects data for retired ops. No data is collected for ops that are aborted due to pipeline flushes, etc. Thus, all sampled ops are architecturally significant and contribute to the successful forward progress of executing programs.

IBS tag-to-retire cycles

The total number of tag-to-retire cycles across all IBS op samples. The tag-to-retire time of an op is the number of cycles from when the op was tagged (selected for sampling) to when the op retired.

IBS completion-to-retire cycles

The total number of completion-to-retire cycles across all IBS op samples. The completion-to-retire time of an op is the number of cycles from when the op completed to when the op retired.

IBS branch op

The number of IBS retired branch op samples. A branch operation is a change in program control flow and includes unconditional and conditional branches, subroutine calls and subroutine returns. Branch ops are used to implement AMD64 branch semantics.

IBS mispredicted branch op

The number of IBS samples for retired branch operations that were mispredicted. This event should be used to compute the ratio of mispredicted branch operations to all branch operations.

IBS taken branch op

The number of IBS samples for retired branch operations that were taken branches.

IBS mispredicted taken branch op

The number of IBS samples for retired branch operations that were mispredicted taken branches.

IBS return op

The number of IBS retired branch op samples where the operation was a subroutine return. These samples are a subset of all IBS retired branch op samples.

IBS mispredicted return op

The number of IBS retired branch op samples where the operation was a mispredicted subroutine return. This event should be used to compute the ratio of mispredicted returns to all subroutine returns.

IBS resync op

The number of IBS resync op samples. A resync op is only found in certain microcoded AMD64 instructions and causes a complete pipeline flush.

IBS all load store ops

The number of IBS op samples for ops that perform either a load and/or store operation. An AMD64 instruction may be translated into one ("single fastpath"), two ("double fastpath"), or several ("vector path") ops. Each op may perform a load operation, a store operation or both a load and store operation (each to the same address). Some op samples attributed to an AMD64 instruction may perform a load/store operation while other op samples attributed to the same instruction may not. Further, some branch instructions perform load/store operations. Thus, a mix of op sample types may be attributed to a single AMD64 instruction depending upon the ops that are issued from the AMD64 instruction and the op types.

IBS load ops

The number of IBS op samples for ops that perform a load operation.

IBS store ops

The number of IBS op samples for ops that perform a store operation.

IBS L1 DTLB hit

The number of IBS op samples where either a load or store operation initially hit in the L1 DTLB (data translation lookaside buffer).

IBS L1 DTLB misses L2 hits

The number of IBS op samples where either a load or store operation initially missed in the L1 DTLB and hit in the L2 DTLB.

IBS L1 and L2 DTLB misses

The number of IBS op samples where either a load or store operation initially missed in both the L1 DTLB and the L2 DTLB.

IBS data cache misses

The number of IBS op samples where either a load or store operation initially missed in the data cache (DC).

IBS data cache hits

The number of IBS op samples where either a load or store operation initially hit in the data cache (DC).

IBS misaligned data access

The number of IBS op samples where either a load or store operation caused a misaligned access (i.e., the load or store operation crossed a 128-bit boundary).

IBS bank conflict on load op

The number of IBS op samples where either a load or store operation caused a bank conflict with a load operation.

IBS bank conflict on store op

The number of IBS op samples where either a load or store operation caused a bank conflict with a store operation.

IBS store-to-load forwarded

The number of IBS op samples where data for a load operation was forwarded from a store operation.

IBS store-to-load cancelled

The number of IBS op samples where data forwarding to a load operation from a store was cancelled.

IBS UC memory access

The number of IBS op samples where a load or store operation accessed uncacheable (UC) memory.

IBS WC memory access

The number of IBS op samples where a load or store operation accessed write combining (WC) memory.

IBS locked operation

The number of IBS op samples where a load or store operation was a locked operation.

IBS MAB hit

The number of IBS op samples where a load or store operation hit an already allocated entry in the Miss Address Buffer (MAB).

IBS L1 DTLB 4K page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 4-KByte page entry in the L1 DTLB was used for address translation.

IBS L1 DTLB 2M page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 2-MByte page entry in the L1 DTLB was used for address translation.

IBS L1 DTLB 1G page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address and a 1-GByte page entry in the L1 DTLB was used for address translation.

IBS L2 DTLB 4K page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 4 KByte page entry for address translation.

IBS L2 DTLB 2M page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 2-MByte page entry for address translation.

IBS L2 DTLB 1G page

The number of IBS op samples where a load or store operation produced a valid linear (virtual) address, hit the L2 DTLB, and used a 1-GByte page entry for address translation.

IBS data cache miss load latency

The total DC miss load latency (in processor cycles) across all IBS op samples that performed a load operation and missed in the data cache. The miss latency is the number of clock cycles from when the data cache miss was detected to when data was delivered to the core. Divide the total DC miss load latency by the number of data cache misses to obtain the average DC miss load latency.

IBS load resync

Load Resync.

IBS Northbridge local

The number of IBS op samples where a load operation was serviced from the local processor. Northbridge IBS data is only valid for load operations that miss in both the L1 data cache and the L2 data cache. If a load operation crosses a cache line boundary, then the IBS data reflects the access to the lower cache line.

IBS Northbridge remote

The number of IBS op samples where a load operation was serviced from a remote processor.

IBS Northbridge local L3

The number of IBS op samples where a load operation was serviced by the local L3 cache.

IBS Northbridge local core L1 or L2 cache

The number of IBS op samples where a load operation was serviced by a cache (L1 data cache or L2 cache) belonging to a local core which is a sibling of the core making the memory request.

IBS Northbridge local core L1, L2, L3 cache

The number of IBS op samples where a load operation was serviced by a remote L1 data cache, L2 cache or L3 cache after traversing one or more coherent HyperTransport links.

IBS Northbridge local DRAM

The number of IBS op samples where a load operation was serviced by local system memory (local DRAM via the memory controller).

IBS Northbridge remote DRAM

The number of IBS op samples where a load operation was serviced by remote system memory (after traversing one or more coherent HyperTransport links and through a remote memory controller).

IBS Northbridge local APIC MMIO Config PCI

The number of IBS op samples where a load operation was serviced from local MMIO, configuration or PCI space, or from the local APIC.

IBS Northbridge remote APIC MMIO Config PCI

The number of IBS op samples where a load operation was serviced from remote MMIO, configuration or PCI space.

IBS Northbridge cache modified state

The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Modified (M) state.

IBS Northbridge cache owned state

The number of IBS op samples where a load operation was serviced from local or remote cache, and the cache hit state was the Owned (O) state.

IBS Northbridge local cache latency

The total data cache miss latency (in processor cycles) for load operations that were serviced by the local processor.

IBS Northbridge remote cache latency

The total data cache miss latency (in processor cycles) for load operations that were serviced by a remote processor.