GPU Resource Management

NVIDIA PhysX SDK

GPU Resource Management

PxTask provides both a CUDA heap manager and an execution scheduler. To use the execution scheduler (GpuDispatcher), you need to break up your CUDA pipeline into GpuTasks. Each GpuTask can be one of three flavors: Host-to-Device copies, CUDA kernels, or Device-to-Host copies. The execution scheduler will submit your tasks to the device in an order that optimizes the device throughput.

The following classes comprise the PxTask GPU resource management.

GpuTasks

GpuTask derives from Task (not BaseTask), so they must be submitted to a TaskManager for scheduling. GpuTasks are dispatched to the TaskManager's given GpuDispatcher when they are ready to run, and have a specialized launchInstance() method rather than run().

GpuTasks come in three flavors. They are HostToDevice, Kernel, and DeviceToHost. Each GpuTask must declare their flavor by returning a valid value from getTaskHint(). The GpuDispatcher must know the type of operations each task will perform in order to optimally combine work of multiple tasks, for example the dispatcher should run all available HostToDevice tasks before running all Kernel tasks, and all Kernel tasks before running any DeviceToHost tasks. This provides maximal kernel overlap and the least number of CUDA flushes.

CudaContextManager

The application should allocate one CudaContextManager for each CUDA context it wishes PhysX/APEX to use. In most typical scenarios, only a single CUDA context is required, as the user only has one GPU, or in the cases where two CUDA capable GPUs are present, only one is used for game physics.

If the application is using CUDA itself and wants to share its CUDA context with PhysX, it can provide its context to the CudaContextManager constructor. Otherwise the CudaContextManager will allocate its own context.

Note

If the application allocates its own CUDA context and gives it to a CudaContextManager, the context must be detached from any threads when the CudaContextManager takes ownership.

Each CudaContextManager provides a heap for managing memory allocated within the CUDA context (both device and host memory), and a GpuDispatcher instance for scheduling CUDA work on the device managed by the context. A scene's TaskManager GpuDispatcher pointer determines the CUDA context used by that scene.

The CudaContextManager has ownership of the CUDA context until it is released. Any code which wishes to use CUDA APIs must acquire the context from the CudaContextManager for the duration of the CUDA API calls.

The CudaContextManager also provides functions which perform Graphics/CUDA Interop mappings in a version safe manner. The intent is that application code can call these functions without needing to include CUDA headers.

Note

After allocating a CudaContextManager, it is required that you check the return value of contextIsValid() and to discard the instance if it returns false. The CUDA context will not be used at all if this method fails. At runtime, the application should also poll the return of GpuDispatcher::failureDetected(). If it returns True, the CUDA context has encountered a non-recoverable error and no further CUDA work is possible.

GpuDispatcher

Every CudaContextManager instance creates and owns a GpuDispatcher instance. The GpuDispatcher is responsible for scheduling GpuTasks for the given context in the most efficient means possible.

On SM architectures 1.0 through 1.3, this mostly means overlapping the kernels in one task with copies (HostToDevice or DeviceToHost) in other tasks. On SM 2.0 and higher architectures, the GpuDispatcher will try to maximize the kernel overlap (out-of-order completion).

These types of overlap optimizations are only possible when more than one GpuTask is ready to run at the same time, so if parts of your CUDA pipeline are parallelizable with itself, you should break the work into multiple GpuTasks and set dependencies accordingly.

Dependencies between GpuTasks determines the order they are dispatched through the GpuDispatcher, which enforces the order those tasks are submitted to CUDA. Note however that independent tasks may be rearranged for maximum efficiency (see GpuTasks section for details).

If you have a task which cannot start until actual CUDA results are available (this almost always implies a DeviceToHost copy), then one of your GpuTasks must call the GpuDispatcher::addCompletionPrereq( BaseTask& ) method to register a CUDA completion dependency on the given task. The GpuDispatcher will immediately increment the reference count of the given Task and then decrement it again when all of the work submitted up to CUDA at the time addCompletionPrereq() was called has been completed.

Note

Since addCompletionPrereq() takes a BaseTask reference, CUDA completion tasks may be GpuTasks, LightCpuTasks or normal Tasks.

Warning

CUDA completion dependencies are volatile in that they are not created until your GpuTask pipeline is essentially finished running. In order to prevent your completion task from starting before the CUDA completion dependency can be added to its reference count, it is recommended practice to add normal startAfter dependency on the GpuTask that will set the CUDA completion dependency.

The GpuTask "run" function, launchInstance() takes two parameters, a CUDA stream and a kernel index. launchInstance() will always be called with the same CUDA stream while the kernel index will increment from 0 until the GpuTask is finished. Returning false from launchInstance tells the GpuDispatcher that your GpuTask is done launching CUDA work and should be released.

The reason for the iterative kernel launching is that optimal kernel scheduling on SM 2.0 GPUs requires kernels in different streams to be launch interleaved to minimize the blocking that occurs when kernels have to wait for their own stream to become idle.

Note

HostToDevice and DeviceToHost GpuTasks can ignore the kernelIndex and simply perform all of their copies, then return false. There is no benefit to interleaving copies in streams.

GpuTasks have three flavors to avoid serious performance pitfalls, especially when multiple pipelines are sharing a single CUDA context. When GpuTasks are dispatched they are binned by the GpuDispatcher by flavor and the dispatcher itself runs in a three-phase state machine: HtoD -> Kernel -> DtoH -> HtoD, etc. At each phase, the GpuDispatcher interleaves launchInstance() calls to all running GpuTasks in that phase until there are no more tasks remaining.

Note

All of the work submitted to CUDA in one round trip of the state machine is referred to as a GpuTaskBatch. It should roughly correlate to a single submission of work to the GPU, unless you submitted more work than could fit and caused an implicit flush.

Note

The GpuDispatcher has a utility kernel to perform memory copies as part of the Kernel GpuTasks. See GpuDispatcher::launchCopyKernel()

When a GpuTask's launchInstance() function returns false, the GpuDispatcher immediately releases the task and allows its dependency resolutions to schedule new GpuTasks of the same flavor so those can be dispatched in the same pass through the state machine (if the newly dispatched GpuTasks do not match the current GpuDispatcher phase, they are binned). In short, three kernel GpuTasks that each launch a single kernel and have linear dependencies will be executed in exactly the same way as one kernel GpuTask which launches three kernels (one per launchInstance() call).

Note

Now that the concept of the GpuDispatcher state machine has been introduced, it is worth noting that CUDA completions are queued within the GpuDispatcher. The completion tasks' reference counts are incremented immediately, but the CUDA work is not flushed until the GpuDispather transitions from the DtoH to HtoD phase. This is an optimization tailored for multiple pipeline situations where we want to minimize the number of CUDA flushes (aka, maximize the amortization).

To make GpuTask scheduling flexible, the GpuDispatcher manages the assignment of CUDA streams to each task. It does this by managing a pool of free CUDA streams, and a few extra flags and fields on each GpuTask. The logic works like this:

1. If a GpuTask has no GpuTask predicate dependencies (essentially means it is the first GpuTask in the pipeline), it is assigned a new CUDA stream from the free pool. 2. When a GpuTask completes launching CUDA work and is released, it passes its CUDA stream to its first dependent GpuTask. 3. If a GpuTask has more than one dependent tasks, the remaining tasks are given new CUDA streams (from the free pool) and have a flag set on them that informs the GpuDispatcher it must perform a WFI (wait-for-idle) on the GPU before that task's CUDA work is allowed to run. 4. If a GpuTask has more than one GpuTask predicate dependency (a joining of two CUDA streams), it takes the CUDA stream of the first predicate and is marked as requiring a WFI.

Note

The GpuDispatcher implements the WFI by issuing a non-blocking cuEventRecord in stream 0. For all architectures up to SM2.0, this causes all previously launched CUDA work to complete before any later submitted work to start - a sync point.

The GpuDispatcher does not release CUDA streams back to the free pool until it knows that all of the TaskManagers that are using the GpuDispatcher have finished submitting GpuTasks. This is the primary reason the TaskManager calls the startSimulation() and stopSimulation() methods of the GpuDispatcher (the other is to initialize profiling), and also why the TaskManager itself needs to be notified of simulation start and stop, even if LightCpuTasks are being used for all non-Gpu work.

BlockingWait

CUDA completion notifications are handled by a special BlockingWait thread of the GpuDispatcher. The main GpuDispatcher thread queues the completion task references and the CUDA EventRecord they are waiting for, and the BlockingWait thread waits for each of them in turn and then decrements the reference count of the provided task. The BlockingWait thread is either in a blocking wait for completion tasks or in a blocking wait for a CUDA EventRecord.

Note

Profiling events for the blocking wait thread are prefixed with GDB, while profiling events for the GpuDispatcher dispatch thread are prefixed with GD.

Copy Engine Kernel

The GpuDispatcher has a utility kernel that can execute an arbitrary number of memory copies or memory sets in parallel on the GPU itself. This is the recommended method for performing all host-to-device, device-to-host, device-to-device copies.

  1. By using the utility kernel from a Kernel flavor GpuTask, you avoid the need for creating DeviceToHost or HostToDevice flavor GpuTasks.
  2. By batching together many copies into a single kernel launch, you save driver overhead and save space inside your push buffer.
  3. By only using Kernel flavored GpuTasks, there is a much greater opportunity to overlap both copies and kernels on SM2.0 architectures and greater.
  4. By using a copy kernel, the copies will show up in your profile.

See GpuDispatcher::launchCopyKernel() for the details on using the utility kernel. The primary point to remember is that the copy descriptors are read from host memory if you perform more than one copy at a time, so the descriptors must be allocated from page locked memory.

CUDA Profiling

PxTask, more specifically the GpuDispatcher, supports profiling at the CTA level. This requires a number of preparation steps:

  1. Your application must register kernel names with the GpuDispatcher using the registerKernelNames() method. It returns an index to the start of an ID range, each kernel should use this index as an offset when constructing IDs for use with fillKernelEvent().
  2. Each kernel must call KERNEL_START_EVENT(profileBuffer, kernelID) and KERNEL_STOP_EVENT(profileBuffer, kernelID) at the appropriate times (start and stop of the kernel). This requires the inclusion of PsPAGpuEventSrc.h
  3. At run-time, the kernel ID and current profiling buffer must be passed to each kernel (see GpuDispatcher::getCurrentProfileBuffer()). Note that the profile buffer will be NULL unless you are actively collecting events.

The application needs to tune a couple of parameters for their needs, both in PsPAGPUEventSrc.h.

ENABLE_CTA_PROFILING - this flag toggles CTA level event collection globally. Turning it off removes the code (and probably register) overhead in each kernel, and disables the allocation of profile buffers and their copying overhead.

NUM_CTAS_PER_PROFILE_BUFFER - needs to be larger than the count of CTAs launched in your largest batch of kernel launches. The BlockingWait thread will issue a warning if it finds that CTA events were lost because the buffer was too small.

NUM_CTA_PROFILE_BUFFERS - the number of profile buffers allocated. This should be the number of GpuTaskBatches you intend to launch each simulation step. If it is too small, later GpuTaskBatches will not have CTA level profiling.

Warning

After changing ENABLE_CTA_PROFILING or NUM_CTAS_PER_PROFILE_BUFFER, you must rebuild all of your CUDA source files.

CTA profiling works on a GpuTaskBatch basis, and is only active when the GpuDispatcher notices an event collector is attached and active. At the start of each batch the GpuDispatcher allocates an idle profile buffer, issues a small memset kernel to clear the 8 word header of the buffer, then issues a saturation kernel in order to record the clock on each SM. These clocks are used by the event viewer to line up the SM events, though this only seems to work well on SM 2.0. The GpuDispatcher then executes GpuTasks as normal until the state machine finishes all three phases (HtoD, Kernel, DtoH). At the end the GpuDispatcher issues another Saturate kernel to delineate the end of all of the DtoH copy times, then issues a DtoH copy for the profile buffer itself, then finally issues a blocking cuEventRecord to flush everything to the GPU for execution.

The profile buffer ID is sent to the BlockingWait thread along with the blocking EventRecord, and once the event has completed the BlockingWait thread is responsible for parsing the CTA events and emitting profiling data to the collector.

Note that much effort was devoted to minimize the performance impact of collecting all this data, but some penalties are unavoidable. Also, the copy back to the host can be substantial depending on the value you chose for NUM_CTAS_PER_PROFILE_BUFFER.

Note

The CTA based profiling requires global memory atomics, and thus SM architecture 1.1 or above. SM 1.0 cards will see the GpuTaskBatch bar, which is measured via non-blocking cuEventRecords, but they will not see any CTAs.

Note

By default, the BlockingWait thread will parse the CTA events in the profile buffer and attempt to piece together the run-times of each kernel. This is done with a simplistic heuristic and does not work very well with short kernels and SM clock skews. It seems to work best with SM 2.0 devices. By changing EMIT_CTA_EVENTS to 1, near the top of BlockingWait.cpp, the BlockingWait thread will instead emit one profile bar for each CTA, giving you a more truthfull picture of how the SMs were used by each kernel. You will still need to account for the SM clock skew in your analysis.