pub enum GpuRadixPartitionAlgorithm {
    NC,
    LASWWC,
    SSWWCv2,
    SSWWCv2G,
    HSSWWCv4,
}
Expand description

Specifies the radix partition algorithm.

Variants

NC

Non-caching radix partition.

This is a standard, parallel radix partition algorithm.

LASWWC

Radix partitioning with look-ahead software write combining.

This algorithm reorders tuples in shared memory before writing them out to device memory. The purpose is to coalesce as many writes as possible, which can lead to higher throughput.

This algorithm was first described by Stehle and Jacobsen in “A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs”. It is also implemented by Sioulas et al. for “Hardware-conscious Hash-Joins on GPUs”, although the authors do not mention or cite it in the paper.

SSWWCv2

Radix partitioning with shared software write combining, version 2.

In version 1, a warp can block all other warps by holding locks on more than one buffer (i.e., leader candidates).

Version 2 tries to avoid blocking other warps by releasing all locks except one (i.e., the leader’s buffer lock).

SSWWCv2G

Radix partitioning with shared software write combining, version 2G.

Version 2G is exactly the same algorithm as version 2, but stores the SWWC buffers in device memory instead of in shared memory. The fanout can thus be scaled higher.

Ideally, the buffers are retained in the GPU L2 cache. Storing buffers in the L1 cache is infeasible for the targeted use-case of high fanouts, as the buffers are too large. To avoid polluting the L2 cache, reading input and writing output uses streaming load and store instructions.

HSSWWCv4

Radix partitioning with hierarchical shared software write combining, version 4.

Version 4 performs the buffer flush from dmem to memory asynchronously with double-buffering.

Double-buffering means that there are fanout + #warps dmem buffers. Thus each warp owns a spare buffer. When the dmem buffer of a partition is full, the warp that holds the lock exchanges the full dmem buffer for its empty spare buffer, and releases the lock. Only then does the warp flush the dmem buffer to CPU memory.

Double-buffering enables all warps to always make progress during the dmem flush, because there is always a (partially-) empty dmem buffer available.

Trait Implementations

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The alignment of pointer.

The type for initializers.

Initializes a with the given initializer. Read more

Dereferences the given pointer. Read more

Mutably dereferences the given pointer. Read more

Drops the object pointed to by the given pointer. Read more

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

🔬 This is a nightly-only experimental API. (toowned_clone_into)

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.