GPU Profile: Performance Counters page

CodeXL

previous page next page

CodeXL User Guide

Help > Using CodeXL > GPU Profiler > Using the GPU Profiler > GPU Profiling Project Settings > GPU Profile: Performance Counters page

GPU Profile: Performance Counters page

This page lets you configure the behavior of the Profiler when it collects performance counters.

Settings

· Measure kernel execution time when checked, requires an additional pass during collection. (only applicable for OpenCL)

· Generate occupancy information for each OpenCL™ or HSA kernel profiled When checked, the Profiler generates kernel occupancy data for each OpenCL™ kernel dispatched to a GPU device.

· Profile specific kernels Profile only kernels that their names are specified.

· Counter selection TreeView This treeview displays the available GPU performance counters that can be enabled for a profile session. The performance counters are grouped by counter type. The counters shown depend on the type of GPU installed on the system. If the system has multiple GPU devices from multiple hardware families, the tree contains a top-level node for each available hardware family. For instance, if a system has both an AMD Radeon™ HD 7000 series GPU device (one based on Graphics Core Next Architecture) and an AMD Radeon™ HD 5000 series device, then the counter selection treeview includes counters supported by each device (see screenshot below).

· Some counter selection combinations require multi-pass collection. When profiling using multiple passes, any OpenCL kernels that use shared virtual memory or pipes as arguments will not be profiled.

· When more than one pass is required, the number of required passes will be displayed next to the device name.

· To load and save the counter selections to a file, click on the Load Selection and Save Selection buttons.

Below is a list and brief description of available counters. You also can use the cursor to hover over the counter names in the treeview to view the descriptions.

· The full set of counters for AMD Radeon™ HD 7000 series GPU devices or newer (based on Graphics Core Next Architecture) are described in the following table.

Name	Description
Wavefronts	Total wavefronts.
VALUInsts	The average number of vector ALU instructions executed per work-item (affected by flow control).
SALUInsts	The average number of scalar ALU instructions executed per work-item (affected by flow control).
VFetchInsts	The average number of vector fetch instructions from the video memory executed per work-item (affected by flow control).
SFetchInsts	The average number of scalar fetch instructions from the video memory executed per work-item (affected by flow control).
VWriteInsts	The average number of vector write instructions to the video memory executed per work-item (affected by flow control).
FlatVMemInsts	The average number of FLAT instructions that read from or write to the video memory executed per work item (affected by flow control). Includes FLAT instructions that read from or write to scratch.
GDSInsts	The average number of GDS read or GDS write instructions executed per work item (affected by flow control).
VALUUtilization	The percentage of active vector ALU threads in a wave. A lower number can mean either more thread divergence in a wave or that the work-group size is not a multiple of 64. Value range: 0% (bad), 100% (ideal - no thread divergence).
VALUBusy	The percentage of GPUTime vector ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
SALUBusy	The percentage of GPUTime scalar ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).
LDSInsts	The average number of LDS read or LDS write instructions executed per work-item (affected by flow control).
FlatLDSInsts	The average number of FLAT instructions that read or write to LDS executed per work item (affected by flow control).
LDSBankConflict	The percentage of GPUTime LDS is stalled by bank conflicts. Value range: 0% (optimal) to 100% (bad).
FetchSize	The total kilobytes fetched from the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
WriteSize	The total kilobytes written to the video memory. This is measured with all extra fetches and any cache or memory effects taken into account.
CacheHit	The percentage of fetch, write, atomic, and other instructions that hit the data cache. Value range: 0% (no hit) to 100% (optimal).
MemUnitBusy	The percentage of GPUTime the memory unit is active. The result includes the stall time (MemUnitStalled). This is measured with all extra fetches and writes and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).
MemUnitStalled	The percentage of GPUTime the memory unit is stalled. Try reducing the number or size of fetches and writes if possible. Value range: 0% (optimal) to 100% (bad).
WriteUnitStalled	The percentage of GPUTime the Write unit is stalled. Value range: 0% to 100% (bad).

previous page start next page