Application Timeline Trace Summary Pages

CodeXL

PreviousNext
CodeXL User Guide
Help > Using CodeXL > GPU Profiler > Using the GPU Profiler > Application Timeline Trace Summary Pages
Application Timeline Trace Summary Pages

The GPU Profiler Summary Pages show the statistics for your OpenCL™ or HSA application. They can provide you with a general idea of the location of the application's bottlenecks. They also provide information such as the number of buffers and number of images created on each context (for OpenCL™), the most expensive kernel call, etc. One Summary Page, the "Warning(s)/Error(s)" page, shows the result of a rule-based analysis of the API trace and timeline data. You can sort each column in a summary page by clicking the table header. You also can rearrange the columns by dragging them to a new location. By default, the Summary Pages are generated when performing an Application Trace profile from CodeXL. You can control whether the Summary Pages are generated by changing the settings on the Application Timeline Trace page. To generate summary pages from the command line, see Using the Command Line Interface. You can find summary pages under the same directory as the .atp file. You can view each summary page in your default web browser because all summary pages are in html format.

API Summary Page

The API Summary page shows statistics for all OpenCL™ or HSA API calls made by the application. This page can help to identify any API hotspots.

API Name

# of Calls

Total Time(ms)

Ave Time(ms)

Max Time(ms)

Min Time(ms)

clSetKernelArg

60884

106.42239

0.00175

0.07284

0.00147

clGetKernelInfo

4332

8.29252

0.00191

0.02347

0.00147

clEnqueueNDRangeKernel

4332

178.09443

0.04111

0.19556

0.00831

clReleaseMemObject

4265

537.25810

0.12597

1.07458

0.00147

clCreateBuffer

4265

13.41322

0.00314

0.05622

0.00195

clEnqueueReadBuffer

938

2403.70320

2.56258

18.78409

0.60280

clEnqueueWriteBuffer

231

318.01591

1.37669

10.39231

0.78809

clEnqueueWriteBufferRect

63

1286.34994

20.41825

63.08036

13.78178

clEnqueueReadBufferRect

63

4288.33441

68.06880

134.18926

45.09560

clReleaseKernel

13

0.03764

0.00290

0.00440

0.00244

clCreateKernel

13

0.06600

0.00508

0.00880

0.00293

clGetDeviceInfo

11

0.03422

0.00311

0.00391

0.00244

clGetPlatformInfo

5

0.02396

0.00479

0.00635

0.00293

clGetContextInfo

4

0.02396

0.00599

0.01613

0.00196

clReleaseContext

2

0.00831

0.00415

0.00587

0.00244

clCreateContextFromType

2

0.06844

0.03422

0.04547

0.02298

clReleaseProgram

1

2.08413

2.08413

2.08413

2.08413

clReleaseCommandQueue

1

97.29672

97.29672

97.29672

97.29672

clFinish

1

0.06307

0.06307

0.06307

0.06307

clCreateProgramWithBinary

1

0.58276

0.58276

0.58276

0.58276

clCreateCommandQueue

1

227.11189

227.11189

227.11189

227.11189

clBuildProgram

1

1492.02943

1492.02943

1492.02943

1492.02943

OpenCL™ Context Summary Page

The Context summary page shows the statistics for all the OpenCL™ kernel dispatch and data transfer operations for each context. It also shows the number of buffers and images created for each context.

 

Context ID

# of Buffers

# of Images

# of Kernel Dispatch - CPU_Device

Total Kernel Time(ms) - CPU_Device

# of Kernel Dispatch - Juniper

Total Kernel Time(ms) - Juniper

# of Memory Transfer

Total Memory Time(ms)

# of Read

Total Read Time(ms)

Size of Read

# of Write

Total Write Time(ms)

Size of Write

# of Map

Total Map Time(ms)

Size of Map

# of Copy

Total Copy Time(ms)

Size of Copy

0

2

0

1

69.07170

1

4.10271

2

1.08503

0

0

0 Byte

2

1.08503

512.00 KB

0

0

0 Byte

0

0

0 Byte

1

2

0

1

35.53000

NA

NA

1

0.21951

0

0

0 Byte

1

0.21951

256.00 KB

0

0

0 Byte

0

0

0 Byte

2

2

0

NA

NA

1

3.41856

1

0.68449

0

0

0 Byte

1

0.68449

256.00 KB

0

0

0 Byte

0

0

0 Byte

3

2

0

1

35.73143

NA

NA

1

0.17307

0

0

0 Byte

1

0.17307

256.00 KB

0

0

0 Byte

0

0

0 Byte

4

2

0

NA

NA

1

1.38896

1

1.45957

0

0

0 Byte

1

1.45957

256.00 KB

0

0

0 Byte

0

0

0 Byte

Total

10

0

3

140.33313

3

8.91023

6

3.62166

0

0

0 Byte

6

3.62166

1.50 MB

0

0

0 Byte

0

0

0 Byte

Kernel Summary Page

The Kernel summary page shows statistics for all the kernels that are dispatched by the application.

 

Kernel Name

Device Name

# of Calls

Total Time(ms)

Avg Time(ms)

Max Time(ms)

Min Time(ms)

multiDeviceKernel

CPU_Device

3

140.33313

46.77771

69.07170

35.53000

multiDeviceKernel

Juniper

3

8.91023

2.97008

4.10271

1.38896

OpenCL™Top 10 Data Transfer Summary Page

The Top 10 Data transfer summary page shows a sorted list of the ten most time-consuming OpenCL™ data transfers operations. Clicking on a hyperlink takes you to the corresponding item in the Timeline view.
Since data transfer operations can have a great impact on application performance, ensuring that kernel execution operations and data transfer operations overlap can lead to better overall performance.

 

Command Type

Context ID

Command Queue ID

Duration(ms)

Transfer Size

Transfer Rate(MB/s)

Thread ID

Sequence ID

WRITE_BUFFER

4

5

1.45957

256.00 KB

171.284

3496

108

WRITE_BUFFER

0

1

0.94618

256.00 KB

264.220

3496

16

WRITE_BUFFER

2

3

0.68449

256.00 KB

365.237

3496

56

WRITE_BUFFER

1

2

0.21951

256.00 KB

1138.895

3496

55

WRITE_BUFFER

3

4

0.17307

256.00 KB

1444.527

3496

107

WRITE_BUFFER

0

0

0.13885

256.00 KB

1800.569

3496

15

Top 10 Kernel Summary Page

The Top 10 kernel summary page shows a sorted list of the 10 most time-consuming kernel execution operations. Clicking on a hyperlink takes you to the corresponding item in Timeline view.

 

Kernel Name

Context ID

Command Queue ID

Device Name

Duration(ms)

Global Work Size

Work-Group Size

Thread ID

Sequence ID

multiDeviceKernel

0

0

CPU_Device

69.07170

{65536}

{64}

3496

21

multiDeviceKernel

3

4

CPU_Device

35.73143

{65536}

{64}

1968

0

multiDeviceKernel

1

2

CPU_Device

35.53000

{65536}

{64}

3496

61

multiDeviceKernel

0

1

Juniper

4.10271

{65536}

{64}

3496

23

multiDeviceKernel

2

3

Juniper

3.41856

{65536}

{64}

3496

63

multiDeviceKernel

4

5

Juniper

1.38896

{65536}

{64}

2892

0

Warning(s)/Error(s) Page

The Warning(s)/Error(s) Page shows potential problems in your OpenCL™ or HSA application. It can detect unreleased resources,  API failures, and it can provide suggestions for better performance. Clicking on a hyperlink takes you to the corresponding  API.

 

Index

Call Index

Thread ID

Type

Message

0

542

2268

Warning

Memory leak detected [Ref = 1, Handle = 0x0B1730B0]: Object created by clEnqueueNDRangeKernel

216

208

2268

Best Practices

clEnqueueNDRangeKernel: Work-group size is too small - [1,1,1]. Recommended work-group size is a multiple of 64.

270

319

2268

Best Practices

clEnqueueNDRangeKernel: Global work size is too small - [111], resulting in low GPU utilization.

144

482

1932

Error

clEnqueueNDRangeKernel returns CL_INVALID_KERNEL_ARGS.

 

From these summary pages, it is possible to determine whether an OpenCL™ application is bound by kernel execution or data transfer (Context Summary page). If the application is bound by kernel execution, you can determine which device is the bottleneck. From the Kernel Summary page, you can find the name of the kernel with the highest total execution time. From the Top 10 Kernel Summary page, you can find the individual kernel instance with the highest execution time. If the kernel execution on a GPU device is the bottleneck, the GPU performance counters can then be used to investigate the bottleneck inside the kernel.

If the application is bound by the data transfers, it is possible to determine the most expensive data transfer type (read, write, copy or map) in the application from the Context Summary page. You can then investigate whether you can minimize this type of data transfer by modifying the algorithm if necessary. With help from the Timeline View, you can investigate whether data transfers have been executed in the most efficient way (concurrently with a kernel execution).