pub struct CudaIterator2<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> { /* private fields */ }
Expand description

CUDA iterator for two mutable inputs.

Transfers data from main-memory to device memory on a chunk-sized granularity.

Preconditions

All inputs are required to have the same length.

Thread safety

Transfers involve multiple stages (currently 4 stages). These stages are performed in a parallel pipeline using threads. For this reason, thread-safety is implemented through mutable references and the Send marker trait.

See the fold() documentation for details.

Notes for future reference

Currently, CudaIterator2 does not copy back data from device memory to main-memory, but this functionality could be implemented in future.

CudaIterator2 could be used as a template for iterators over less or more inputs, e.g. a CudaIterator1 or CudaIterator3.

Implementations

Apply a GPU function that produces a single, final value.

fold() takes two arguments: a data value, and a CUDA stream. In the case of CudaIterator2, the data value is specified as a two-tuple of launchable slices. The slices are guaranteed to have the same length.

The function passed to fold() is meant to launch a CUDA kernel function on the given CUDA stream.

In contrast to Rust’s standard library fold() iterator, the state in this iterator is implicit in GPU memory.

Thread safety

As the transfer is performed as a parallel pipeline, i.e., transfer and execution overlap. Therefore, the function may be called by multiple threads at the same time, and must be thread-safe. Thread-safety is specified through the Send marker trait.

Furthermore, the CUDA kernel is executed on two or more CUDA streams, and must therefore be thread-safe. However, Rust cannot guarantee thread-safety of CUDA kernels. Thus, the user must ensure that the CUDA kernels are safe to execute on multiple CUDA streams, e.g. by using atomic operations when accessing device memory.

Internals

The current implementation calls fold_par() if the LazyPinnedCopy strategy is selected. For all other strategies, fold_async() is called. The reason is that LazyPinnedCopy performs blocking calls, therefore transfer-compute-overlapping requires multi-threading. In constrast, the other strategies can be executed completely asynchronously by CUDA.

Example
let chunk_len = 1024_usize;
let cpu_memcpy_threads = 2_usize;
let cpu_affinity = CpuAffinity::default();

let mut data_0 = vec![1.0_f32; data_len];
let mut data_1 = vec![1.0_f32; data_len];
let mut result = DeviceBox::new(&0.0f32).unwrap();
let result_ptr = result.as_device_ptr();

(data_0.as_mut_slice(), data_1.as_mut_slice())
    .into_cuda_iter_with_strategy(
        CudaTransferStrategy::PageableCopy,
        chunk_len,
        cpu_memcpy_threads,
        &cpu_affinity,
    )
    .unwrap()
    .fold(|(x, y), stream| {
        unsafe {
            launch!(cuda_dot<<<1, 1, 0, stream>>>(
            x.len(),
            x.as_launchable_ptr(),
            y.as_launchable_ptr(),
            result_ptr
            ))?;
        }

        Ok(())
    }).unwrap();

let mut result_host = 0.0f32;
result.copy_to(&mut result_host).unwrap();

A parallel implementation of fold().

Transfer-compute-overlapping is performed by multi-threading the pipeline stages parallelism. As we configure at least as many threads as there are pipeline stages, each thread may execute the complete pipeline synchronously.

An asynchronous implementation of fold().

Transfer-compute-overlapping is achieved by calling asynchronous CUDA functions. If not all functions are asynchronous, the pipeline is executed synchronously.

Correctness

If a blocking function is scheduled as part of a strategy, then that function must enforce synchronous execution, e.g. by calling stream.synchronize().

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.