Struct numa_gpu::runtime::cuda::CudaIterator2

source · [−]

pub struct CudaIterator2<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> { /* private fields */ }

Expand description

CUDA iterator for two mutable inputs.

Transfers data from main-memory to device memory on a chunk-sized granularity.

Preconditions

All inputs are required to have the same length.

Transfers involve multiple stages (currently 4 stages). These stages are performed in a parallel pipeline using threads. For this reason, thread-safety is implemented through mutable references and the Send marker trait.

See the fold() documentation for details.

Notes for future reference

Currently, CudaIterator2 does not copy back data from device memory to main-memory, but this functionality could be implemented in future.

CudaIterator2 could be used as a template for iterators over less or more inputs, e.g. a CudaIterator1 or CudaIterator3.

Implementations

source

impl<'a, R: Copy + DeviceCopy + Send, S: Copy + DeviceCopy + Send> CudaIterator2<'a, R, S>

source

pub fn fold<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

Apply a GPU function that produces a single, final value.

fold() takes two arguments: a data value, and a CUDA stream. In the case of CudaIterator2, the data value is specified as a two-tuple of launchable slices. The slices are guaranteed to have the same length.

The function passed to fold() is meant to launch a CUDA kernel function on the given CUDA stream.

In contrast to Rust’s standard library fold() iterator, the state in this iterator is implicit in GPU memory.

Thread safety

As the transfer is performed as a parallel pipeline, i.e., transfer and execution overlap. Therefore, the function may be called by multiple threads at the same time, and must be thread-safe. Thread-safety is specified through the Send marker trait.

Furthermore, the CUDA kernel is executed on two or more CUDA streams, and must therefore be thread-safe. However, Rust cannot guarantee thread-safety of CUDA kernels. Thus, the user must ensure that the CUDA kernels are safe to execute on multiple CUDA streams, e.g. by using atomic operations when accessing device memory.

Internals

The current implementation calls fold_par() if the LazyPinnedCopy strategy is selected. For all other strategies, fold_async() is called. The reason is that LazyPinnedCopy performs blocking calls, therefore transfer-compute-overlapping requires multi-threading. In constrast, the other strategies can be executed completely asynchronously by CUDA.

Example

let chunk_len = 1024_usize;
let cpu_memcpy_threads = 2_usize;
let cpu_affinity = CpuAffinity::default();

let mut data_0 = vec![1.0_f32; data_len];
let mut data_1 = vec![1.0_f32; data_len];
let mut result = DeviceBox::new(&0.0f32).unwrap();
let result_ptr = result.as_device_ptr();

(data_0.as_mut_slice(), data_1.as_mut_slice())
    .into_cuda_iter_with_strategy(
        CudaTransferStrategy::PageableCopy,
        chunk_len,
        cpu_memcpy_threads,
        &cpu_affinity,
    )
    .unwrap()
    .fold(|(x, y), stream| {
        unsafe {
            launch!(cuda_dot<<<1, 1, 0, stream>>>(
            x.len(),
            x.as_launchable_ptr(),
            y.as_launchable_ptr(),
            result_ptr
            ))?;
        }

        Ok(())
    }).unwrap();

let mut result_host = 0.0f32;
result.copy_to(&mut result_host).unwrap();

source

pub fn fold_par<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

A parallel implementation of fold().

Transfer-compute-overlapping is performed by multi-threading the pipeline stages parallelism. As we configure at least as many threads as there are pipeline stages, each thread may execute the complete pipeline synchronously.

source

impl<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> CudaIterator2<'a, R, S>

source

pub fn fold_async<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: FnMut((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()>,

An asynchronous implementation of fold().

Transfer-compute-overlapping is achieved by calling asynchronous CUDA functions. If not all functions are asynchronous, the pipeline is executed synchronously.

Correctness

If a blocking function is scheduled as part of a strategy, then that function must enforce synchronous execution, e.g. by calling stream.synchronize().

Auto Trait Implementations

impl<'a, R, S> !RefUnwindSafe for CudaIterator2<'a, R, S>

impl<'a, R, S> Send for CudaIterator2<'a, R, S> where
R: Send,
S: Send,

impl<'a, R, S> !Sync for CudaIterator2<'a, R, S>

impl<'a, R, S> Unpin for CudaIterator2<'a, R, S>

impl<'a, R, S> !UnwindSafe for CudaIterator2<'a, R, S>

Blanket Implementations

source

impl<T> Any for T where
T: 'static + ?Sized,

source

pub fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more

source

impl<T> Borrow<T> for T where
T: ?Sized,

const: unstable · source

pub fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more

source

impl<T> BorrowMut<T> for T where
T: ?Sized,

const: unstable · source

pub fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more

source

impl<T> From<T> for T

const: unstable · source

pub fn from(t: T) -> T

Returns the argument unchanged.

source

impl<T, U> Into for T where
U: From<T>,

const: unstable · source

pub fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

impl<T> Pointable for T

pub const ALIGN: usize

The alignment of pointer.

type Init = T

The type for initializers.

pub unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more

pub unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more

pub unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more

pub unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more

source

impl<T, U> TryFrom for T where
U: Into<T>,

type Error = Infallible

The type returned in the event of a conversion error.

const: unstable · source

pub fn try_from(value: U) -> Result<T, <T as TryFrom>::Error>

Performs the conversion.

source

impl<T, U> TryInto for T where
U: TryFrom<T>,

type Error = >::Error

The type returned in the event of a conversion error.

const: unstable · source

pub fn try_into(self) -> Result<U, >::Error>

Performs the conversion.

Struct numa_gpu::runtime::cuda::CudaIterator2

Implementations

impl<'a, R: Copy + DeviceCopy + Send, S: Copy + DeviceCopy + Send> CudaIterator2<'a, R, S>

pub fn fold<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

pub fn fold_par<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

impl<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> CudaIterator2<'a, R, S>

pub fn fold_async<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where F: FnMut((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()>,

Auto Trait Implementations

impl<'a, R, S> !RefUnwindSafe for CudaIterator2<'a, R, S>

impl<'a, R, S> Send for CudaIterator2<'a, R, S> where R: Send, S: Send,

impl<'a, R, S> !Sync for CudaIterator2<'a, R, S>

impl<'a, R, S> Unpin for CudaIterator2<'a, R, S>

impl<'a, R, S> !UnwindSafe for CudaIterator2<'a, R, S>

Blanket Implementations

impl<T> Any for T where T: 'static + ?Sized,

pub fn type_id(&self) -> TypeId

impl<T> Borrow<T> for T where T: ?Sized,

pub fn borrow(&self) -> &T

impl<T> BorrowMut<T> for T where T: ?Sized,

pub fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

pub fn from(t: T) -> T

impl<T, U> Into<U> for T where U: From<T>,

pub fn into(self) -> U

impl<T> Pointable for T

pub const ALIGN: usize

type Init = T

pub unsafe fn init(init: <T as Pointable>::Init) -> usize

pub unsafe fn deref<'a>(ptr: usize) -> &'a T

pub unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

pub unsafe fn drop(ptr: usize)

impl<T, U> TryFrom<U> for T where U: Into<T>,

type Error = Infallible

pub fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for T where U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

pub fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

pub fn fold<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

pub fn fold_par<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,

pub fn fold_async<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: FnMut((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()>,

impl<'a, R, S> Send for CudaIterator2<'a, R, S> where
R: Send,
S: Send,

impl<T> Any for T where
T: 'static + ?Sized,

impl<T> Borrow<T> for T where
T: ?Sized,

impl<T> BorrowMut<T> for T where
T: ?Sized,

impl<T, U> Into<U> for T where
U: From<T>,

impl<T, U> TryFrom<U> for T where
U: Into<T>,

impl<T, U> TryInto<U> for T where
U: TryFrom<T>,