pub enum GpuRadixPartitionAlgorithm {
NC,
LASWWC,
SSWWCv2,
SSWWCv2G,
HSSWWCv4,
}
Expand description
Specifies the radix partition algorithm.
Variants
NC
Non-caching radix partition.
This is a standard, parallel radix partition algorithm.
LASWWC
Radix partitioning with look-ahead software write combining.
This algorithm reorders tuples in shared memory before writing them out to device memory. The purpose is to coalesce as many writes as possible, which can lead to higher throughput.
This algorithm was first described by Stehle and Jacobsen in “A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs”. It is also implemented by Sioulas et al. for “Hardware-conscious Hash-Joins on GPUs”, although the authors do not mention or cite it in the paper.
SSWWCv2
Radix partitioning with shared software write combining, version 2.
In version 1, a warp can block all other warps by holding locks on more than one buffer (i.e., leader candidates).
Version 2 tries to avoid blocking other warps by releasing all locks except one (i.e., the leader’s buffer lock).
SSWWCv2G
Radix partitioning with shared software write combining, version 2G.
Version 2G is exactly the same algorithm as version 2, but stores the SWWC buffers in device memory instead of in shared memory. The fanout can thus be scaled higher.
Ideally, the buffers are retained in the GPU L2 cache. Storing buffers in the L1 cache is infeasible for the targeted use-case of high fanouts, as the buffers are too large. To avoid polluting the L2 cache, reading input and writing output uses streaming load and store instructions.
HSSWWCv4
Radix partitioning with hierarchical shared software write combining, version 4.
Version 4 performs the buffer flush from dmem to memory asynchronously with double-buffering.
Double-buffering means that there are fanout + #warps
dmem buffers. Thus each warp owns a
spare buffer. When the dmem buffer of a partition is
full, the warp that holds the lock exchanges the full dmem buffer for its empty spare
buffer, and releases the lock. Only then does the warp flush the dmem buffer to CPU memory.
Double-buffering enables all warps to always make progress during the dmem flush, because there is always a (partially-) empty dmem buffer available.
Trait Implementations
sourceimpl Clone for GpuRadixPartitionAlgorithm
impl Clone for GpuRadixPartitionAlgorithm
sourcefn clone(&self) -> GpuRadixPartitionAlgorithm
fn clone(&self) -> GpuRadixPartitionAlgorithm
Returns a copy of the value. Read more
1.0.0 · sourcefn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
Performs copy-assignment from source
. Read more
sourceimpl Debug for GpuRadixPartitionAlgorithm
impl Debug for GpuRadixPartitionAlgorithm
impl Copy for GpuRadixPartitionAlgorithm
Auto Trait Implementations
impl RefUnwindSafe for GpuRadixPartitionAlgorithm
impl Send for GpuRadixPartitionAlgorithm
impl Sync for GpuRadixPartitionAlgorithm
impl Unpin for GpuRadixPartitionAlgorithm
impl UnwindSafe for GpuRadixPartitionAlgorithm
Blanket Implementations
sourceimpl<T> BorrowMut<T> for T where
T: ?Sized,
impl<T> BorrowMut<T> for T where
T: ?Sized,
const: unstable · sourcepub fn borrow_mut(&mut self) -> &mut T
pub fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more
impl<T> Pointable for T
impl<T> Pointable for T
sourceimpl<T> ToOwned for T where
T: Clone,
impl<T> ToOwned for T where
T: Clone,
type Owned = T
type Owned = T
The resulting type after obtaining ownership.
sourcepub fn to_owned(&self) -> T
pub fn to_owned(&self) -> T
Creates owned data from borrowed data, usually by cloning. Read more
sourcepub fn clone_into(&self, target: &mut T)
pub fn clone_into(&self, target: &mut T)
toowned_clone_into
)Uses borrowed data to replace owned data, usually by cloning. Read more