Expand description

Radix partition operators for CPU and GPU.

Overview

CPU and GPU partitioning operators are compatible. The devices can cooperate to partition a relation in parallel. Note that this only holds for equivalent operator algorithms, i.e., chunked_radix_partition_swwc cannot be combined with chunked_radix_partition.

Provided is one radix partitioning algorithm, chunked radix partitioning with software write-combine buffering (SWWC). This algorithm is described by Schuh et al. in Section 6 of “An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory”.

Thread-safety

The radix partitioning operators are designed to be thread-safe. Although the input data can be shared between threads, threads should typically work on disjuct input partitions for correct results. In contrast, each thread must have exclusive ownership of its output and intermediate state buffers.

Padding

It is important to note that partitions are padded to, at minimum, the cache-line size. This is necessary for SWWC buffering because cache-lines are written back to memory as a whole. However, partition offsets are not naturally aligned, because partitions can have any size. Therefore, all partitions are padded in front by, at minimum, the length of a cache-line. The cache-alignment is also necessary for non-temporal SIMD writes, which must be aligned to their SIMD vector length.

Combining CPU histogram with a GPU partitioner

CPU and GPU histograms and radix partitioners can be mixed and matched. For example, CpuHistogramAlgorithm::Chunked computes the exact same result as GpuHistogramAlgorithm::Chunked.

Combining a CPU histogram with a GPU partitioner saves us from transferring the keys twice over the interconnect, once for the histogram and once for partitioning. Computing a histogram is light-weight and should be close to the memory bandwidth. See Polychroniou et al. “A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort.

The reasoning is that the required histograms are cheap to compute. The target size at maximum 2^12 buckets, which is only 32 KiB and should fit into the CPU’s L1 cache. This upper bound for buckets is given by the GPU hardware and partitioning algorithm.

Optimizations and tuning

Hardware prefetching

On most CPU architectures, the behavior of the hardware prefetcher can be tuned using model-specific registers (MSRs). On POWER9, disabling N-stride prefetching and setting an aggressive prefetch depth increases the throughput of the prefix sum and partitioning.

Loop unrolling

Manual loop unrolling is a well-known optimization. For the POWER9 architecture, loading 128-byte cachelines with a sequence of SIMD instructions improves throughput. However, as the VSX instruction set doesn’t support gather-scatter memory operations, this is only a load optimization and not a “pure” SIMD implementation.

SIMD vectorization

Keys are hashed using VSX instructions on POWER9.

Data hazard avoidance

Out-of-order execution stalls if there is a read-after-write hazard in the CPU pipeline. This occurs in the prefix sum due the memory dependency on the previous value of a histogram bucket. As only small histograms are affected, the histogram can be replicated within the L1 cache for low fanouts, and use a single copy per thread for high fanouts.

The C/C++ CPU code is based on code kindly published by Cagri Balkesen and Claude Barthels. As such, we adhere to their copyright and license (MIT) in derived code. Modifications are licensed under Apache License 2.0.

Summary of changes

We have made several changes to the original code. These include:

  • Fix race condition in SWWC partitioning variant. Each thread writes its result into a separate chunk with padding between chunks and (within each chunk) partitions. The race condition occurs because the flushes are aligned to a cacheline. Thus, the first flush of a thread might overwrite the last flush of its neighboring thread. A (potential) drawback is that resulting partitions are non-contiguous.

  • Support different tuple types at run-time instead of at compile-time. Implemented by instantiating C++ templates with multiple types instead of C preprocessor.

  • Split prefix sum and partitioning into separate functions. This facilitates CPU microarchitecture optimization and allows recombination with equivalent GPU kernels.

  • Add SIMD prefix sum variant for IBM POWER9.

  • Add SIMD partitioning variant for IBM POWER9.

  • Avoid data hazards in the prefix sum.

  • Tune the hardware prefetcher for IBM POWER9.

  • Add SWWC flush variants for POWERPC64 VSX and x86_64 AVX-512.

Structs

A CPU radix partitioner that provides partitioning functions.

Enums

Specifies the histogram algorithm that computes the partition offsets.

Specifies the radix partition algorithm.

Traits

Specifies that the implementing type can be used as partitioning key in CpuRadixPartitioner.