Expand description
CUDA iterator for two mutable inputs.
Transfers data from main-memory to device memory on a chunk-sized granularity.
Preconditions
All inputs are required to have the same length.
Thread safety
Transfers involve multiple stages (currently 4 stages). These
stages are performed in a parallel pipeline using threads. For this reason,
thread-safety is implemented through mutable references and the Send
marker trait.
See the fold()
documentation for details.
Notes for future reference
Currently, CudaIterator2
does not copy back data from device memory to
main-memory, but this functionality could be implemented in future.
CudaIterator2
could be used as a template for iterators over less or more
inputs, e.g. a CudaIterator1
or CudaIterator3
.
Implementations
sourceimpl<'a, R: Copy + DeviceCopy + Send, S: Copy + DeviceCopy + Send> CudaIterator2<'a, R, S>
impl<'a, R: Copy + DeviceCopy + Send, S: Copy + DeviceCopy + Send> CudaIterator2<'a, R, S>
sourcepub fn fold<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,
pub fn fold<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,
Apply a GPU function that produces a single, final value.
fold()
takes two arguments: a data value, and a CUDA stream. In the
case of CudaIterator2
, the data value is specified as a two-tuple of
launchable slices. The slices are guaranteed to have the same length.
The function passed to fold()
is meant to launch a CUDA kernel function
on the given CUDA stream.
In contrast to Rust’s standard library fold()
iterator, the state in
this iterator is implicit in GPU memory.
Thread safety
As the transfer is performed as a parallel pipeline, i.e., transfer and
execution overlap. Therefore, the function may be called by multiple
threads at the same time, and must be thread-safe. Thread-safety is
specified through the Send
marker trait.
Furthermore, the CUDA kernel is executed on two or more CUDA streams, and must therefore be thread-safe. However, Rust cannot guarantee thread-safety of CUDA kernels. Thus, the user must ensure that the CUDA kernels are safe to execute on multiple CUDA streams, e.g. by using atomic operations when accessing device memory.
Internals
The current implementation calls fold_par()
if the LazyPinnedCopy
strategy is selected. For all other strategies, fold_async()
is called.
The reason is that LazyPinnedCopy
performs blocking calls, therefore
transfer-compute-overlapping requires multi-threading. In constrast, the
other strategies can be executed completely asynchronously by CUDA.
Example
let chunk_len = 1024_usize;
let cpu_memcpy_threads = 2_usize;
let cpu_affinity = CpuAffinity::default();
let mut data_0 = vec![1.0_f32; data_len];
let mut data_1 = vec![1.0_f32; data_len];
let mut result = DeviceBox::new(&0.0f32).unwrap();
let result_ptr = result.as_device_ptr();
(data_0.as_mut_slice(), data_1.as_mut_slice())
.into_cuda_iter_with_strategy(
CudaTransferStrategy::PageableCopy,
chunk_len,
cpu_memcpy_threads,
&cpu_affinity,
)
.unwrap()
.fold(|(x, y), stream| {
unsafe {
launch!(cuda_dot<<<1, 1, 0, stream>>>(
x.len(),
x.as_launchable_ptr(),
y.as_launchable_ptr(),
result_ptr
))?;
}
Ok(())
}).unwrap();
let mut result_host = 0.0f32;
result.copy_to(&mut result_host).unwrap();
sourcepub fn fold_par<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,
pub fn fold_par<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: Fn((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()> + Send + Sync,
A parallel implementation of fold()
.
Transfer-compute-overlapping is performed by multi-threading the pipeline stages parallelism. As we configure at least as many threads as there are pipeline stages, each thread may execute the complete pipeline synchronously.
sourceimpl<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> CudaIterator2<'a, R, S>
impl<'a, R: Copy + DeviceCopy, S: Copy + DeviceCopy> CudaIterator2<'a, R, S>
sourcepub fn fold_async<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: FnMut((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()>,
pub fn fold_async<F>(&mut self, f: F) -> Result<CudaTransferStrategyMeasurement> where
F: FnMut((LaunchableSlice<'_, R>, LaunchableSlice<'_, S>), &Stream) -> Result<()>,
An asynchronous implementation of fold()
.
Transfer-compute-overlapping is achieved by calling asynchronous CUDA functions. If not all functions are asynchronous, the pipeline is executed synchronously.
Correctness
If a blocking function is scheduled as part of a strategy, then that
function must enforce synchronous execution, e.g. by calling
stream.synchronize()
.
Auto Trait Implementations
impl<'a, R, S> !RefUnwindSafe for CudaIterator2<'a, R, S>
impl<'a, R, S> Send for CudaIterator2<'a, R, S> where
R: Send,
S: Send,
impl<'a, R, S> !Sync for CudaIterator2<'a, R, S>
impl<'a, R, S> Unpin for CudaIterator2<'a, R, S>
impl<'a, R, S> !UnwindSafe for CudaIterator2<'a, R, S>
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcepub fn borrow_mut(&mut self) -> &mut T
pub fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more