Application Timeline Trace Summary Pages

CodeXL

previous page next page

CodeXL User Guide

Help > Using CodeXL > GPU Profiler > Using the GPU Profiler > Application Timeline Trace Summary Pages

Application Timeline Trace Summary Pages

The GPU Profiler Summary Pages show the statistics for your OpenCL™ or HSA application. They can provide you with a general idea of the location of the application's bottlenecks. They also provide information such as the number of buffers and number of images created on each context (for OpenCL™), the most expensive kernel call, etc. One Summary Page, the "Warning(s)/Error(s)" page, shows the result of a rule-based analysis of the API trace and timeline data. You can sort each column in a summary page by clicking the table header. You also can rearrange the columns by dragging them to a new location. By default, the Summary Pages are generated when performing an Application Trace profile from CodeXL. You can control whether the Summary Pages are generated by changing the settings on the Application Timeline Trace page. To generate summary pages from the command line, see Using the Command Line Interface. You can find summary pages under the same directory as the .atp file. You can view each summary page in your default web browser because all summary pages are in html format.

API Summary Page

The API Summary page shows statistics for all OpenCL™ or HSA API calls made by the application. This page can help to identify any API hotspots.

API Name	# of Calls	Total Time(ms)	Ave Time(ms)	Max Time(ms)	Min Time(ms)
clSetKernelArg	60884	106.42239	0.00175	0.07284	0.00147
clGetKernelInfo	4332	8.29252	0.00191	0.02347	0.00147
clEnqueueNDRangeKernel	4332	178.09443	0.04111	0.19556	0.00831
clReleaseMemObject	4265	537.25810	0.12597	1.07458	0.00147
clCreateBuffer	4265	13.41322	0.00314	0.05622	0.00195
clEnqueueReadBuffer	938	2403.70320	2.56258	18.78409	0.60280
clEnqueueWriteBuffer	231	318.01591	1.37669	10.39231	0.78809
clEnqueueWriteBufferRect	63	1286.34994	20.41825	63.08036	13.78178
clEnqueueReadBufferRect	63	4288.33441	68.06880	134.18926	45.09560
clReleaseKernel	13	0.03764	0.00290	0.00440	0.00244
clCreateKernel	13	0.06600	0.00508	0.00880	0.00293
clGetDeviceInfo	11	0.03422	0.00311	0.00391	0.00244
clGetPlatformInfo	5	0.02396	0.00479	0.00635	0.00293
clGetContextInfo	4	0.02396	0.00599	0.01613	0.00196
clReleaseContext	2	0.00831	0.00415	0.00587	0.00244
clCreateContextFromType	2	0.06844	0.03422	0.04547	0.02298
clReleaseProgram	1	2.08413	2.08413	2.08413	2.08413
clReleaseCommandQueue	1	97.29672	97.29672	97.29672	97.29672
clFinish	1	0.06307	0.06307	0.06307	0.06307
clCreateProgramWithBinary	1	0.58276	0.58276	0.58276	0.58276
clCreateCommandQueue	1	227.11189	227.11189	227.11189	227.11189
clBuildProgram	1	1492.02943	1492.02943	1492.02943	1492.02943

OpenCL™ Context Summary Page

The Context summary page shows the statistics for all the OpenCL™ kernel dispatch and data transfer operations for each context. It also shows the number of buffers and images created for each context.

Context ID	# of Buffers	# of Images	# of Kernel Dispatch - CPU_Device	Total Kernel Time(ms) - CPU_Device	# of Kernel Dispatch - Juniper	Total Kernel Time(ms) - Juniper	# of Memory Transfer	Total Memory Time(ms)	# of Read	Total Read Time(ms)	Size of Read	# of Write	Total Write Time(ms)	Size of Write	# of Map	Total Map Time(ms)	Size of Map	# of Copy	Total Copy Time(ms)	Size of Copy
0	2	0	1	69.07170	1	4.10271	2	1.08503	0	0	0 Byte	2	1.08503	512.00 KB	0	0	0 Byte	0	0	0 Byte
1	2	0	1	35.53000	NA	NA	1	0.21951	0	0	0 Byte	1	0.21951	256.00 KB	0	0	0 Byte	0	0	0 Byte
2	2	0	NA	NA	1	3.41856	1	0.68449	0	0	0 Byte	1	0.68449	256.00 KB	0	0	0 Byte	0	0	0 Byte
3	2	0	1	35.73143	NA	NA	1	0.17307	0	0	0 Byte	1	0.17307	256.00 KB	0	0	0 Byte	0	0	0 Byte
4	2	0	NA	NA	1	1.38896	1	1.45957	0	0	0 Byte	1	1.45957	256.00 KB	0	0	0 Byte	0	0	0 Byte
Total	10	0	3	140.33313	3	8.91023	6	3.62166	0	0	0 Byte	6	3.62166	1.50 MB	0	0	0 Byte	0	0	0 Byte

Kernel Summary Page

The Kernel summary page shows statistics for all the kernels that are dispatched by the application.

Kernel Name	Device Name	# of Calls	Total Time(ms)	Avg Time(ms)	Max Time(ms)	Min Time(ms)
multiDeviceKernel	CPU_Device	3	140.33313	46.77771	69.07170	35.53000
multiDeviceKernel	Juniper	3	8.91023	2.97008	4.10271	1.38896

OpenCL™Top 10 Data Transfer Summary Page

The Top 10 Data transfer summary page shows a sorted list of the ten most time-consuming OpenCL™ data transfers operations. Clicking on a hyperlink takes you to the corresponding item in the Timeline view.
Since data transfer operations can have a great impact on application performance, ensuring that kernel execution operations and data transfer operations overlap can lead to better overall performance.

Command Type	Context ID	Command Queue ID	Duration(ms)	Transfer Size	Transfer Rate(MB/s)	Thread ID	Sequence ID
WRITE_BUFFER	4	5	1.45957	256.00 KB	171.284	3496	108
WRITE_BUFFER	0	1	0.94618	256.00 KB	264.220	3496	16
WRITE_BUFFER	2	3	0.68449	256.00 KB	365.237	3496	56
WRITE_BUFFER	1	2	0.21951	256.00 KB	1138.895	3496	55
WRITE_BUFFER	3	4	0.17307	256.00 KB	1444.527	3496	107
WRITE_BUFFER	0	0	0.13885	256.00 KB	1800.569	3496	15

Top 10 Kernel Summary Page

The Top 10 kernel summary page shows a sorted list of the 10 most time-consuming kernel execution operations. Clicking on a hyperlink takes you to the corresponding item in Timeline view.

Kernel Name	Context ID	Command Queue ID	Device Name	Duration(ms)	Global Work Size	Work-Group Size	Thread ID	Sequence ID
multiDeviceKernel	0	0	CPU_Device	69.07170	{65536}	{64}	3496	21
multiDeviceKernel	3	4	CPU_Device	35.73143	{65536}	{64}	1968	0
multiDeviceKernel	1	2	CPU_Device	35.53000	{65536}	{64}	3496	61
multiDeviceKernel	0	1	Juniper	4.10271	{65536}	{64}	3496	23
multiDeviceKernel	2	3	Juniper	3.41856	{65536}	{64}	3496	63
multiDeviceKernel	4	5	Juniper	1.38896	{65536}	{64}	2892	0

Warning(s)/Error(s) Page

The Warning(s)/Error(s) Page shows potential problems in your OpenCL™ or HSA application. It can detect unreleased resources, API failures, and it can provide suggestions for better performance. Clicking on a hyperlink takes you to the corresponding API.

Index	Call Index	Thread ID	Type	Message
0	542	2268	Warning	Memory leak detected [Ref = 1, Handle = 0x0B1730B0]: Object created by clEnqueueNDRangeKernel
216	208	2268	Best Practices	clEnqueueNDRangeKernel: Work-group size is too small - [1,1,1]. Recommended work-group size is a multiple of 64.
270	319	2268	Best Practices	clEnqueueNDRangeKernel: Global work size is too small - [111], resulting in low GPU utilization.
144	482	1932	Error	clEnqueueNDRangeKernel returns CL_INVALID_KERNEL_ARGS.

From these summary pages, it is possible to determine whether an OpenCL™ application is bound by kernel execution or data transfer (Context Summary page). If the application is bound by kernel execution, you can determine which device is the bottleneck. From the Kernel Summary page, you can find the name of the kernel with the highest total execution time. From the Top 10 Kernel Summary page, you can find the individual kernel instance with the highest execution time. If the kernel execution on a GPU device is the bottleneck, the GPU performance counters can then be used to investigate the bottleneck inside the kernel.

If the application is bound by the data transfers, it is possible to determine the most expensive data transfer type (read, write, copy or map) in the application from the Context Summary page. You can then investigate whether you can minimize this type of data transfer by modifying the algorithm if necessary. With help from the Timeline View, you can investigate whether data transfers have been executed in the most efficient way (concurrently with a kernel execution).

previous page start next page