1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142
// Copyright 2019-2022 Clemens Lutz
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
//! # The SQL Operator Library
//!
//! `sql-ops` is a collection of SQL operators and building blocks for CPUs and
//! GPUs. Currently it includes the operators:
//!
//! - Hash join (no-partitioning and radix-partitioned)
//! - Radix partition
//! - Prefix scan (exclusive)
//!
//! # Tuning parameters
//!
//! Several tuning parameters are defined as constant values. These affect
//! the performance and should be adjusted if necessary.
//!
//! The tuning parameters are set in the `build.rs` file, which exports them to Rust, C++, and
//! CUDA.
//!
//! ## CPU cacheline size
//!
//! `CPU_CACHE_LINE_SIZE` defines the number of bytes used for padding to prevent false sharing,
//! and SWWC radix partitioning buffers. The size is specific to the CPU architecture and set to
//! different values depending on the ISA:
//!
//! - aarch64: 64 bytes
//! - x86_64: 64 bytes
//! - powerpc64: 128 bytes
//!
//! ## GPU cacheline size
//!
//! `GPU_CACHE_LINE_SIZE` serves the same purpose as the CPU cacheline size, but is used in GPU
//! code paths. The size is set to 128 bytes, which is the size used by many Nvidia GPUs (e.g.,
//! Pascal, Volta, Ampere).
//!
//! ## Align bytes
//!
//! `ALIGN_BYTES` defines the alignment of partitions in bytes. This parameter is intended to
//! prevent cache conflict misses. It should be set to a multiple of the cacheline size.
//!
//! Furthermore, cacheline alignment is necessary for:
//!
//! - non-temporal store instructions
//! - vector load and store instructions
//! - perfectly aligned coalesced loads and stores on GPUs
//!
//! ## Padding bytes
//!
//! `PADDING_BYTES` defines the padding size between partitions. Padding is necessary for
//! partitioning algorithms to align writes. Aligned writes have fixed length and may overwrite the
//! padding space in front of their partition. For this reason, also the first partition includes
//! padding in front.
//!
//! If no padding is used, aligned writes incur a race condition between threads. Given two
//! partitions, a thread writing to the end of the first partition must write after a different
//! thread writing to the beginning of the second partition, because the written locations may
//! overlap due to aligning the second thread to `PADDING_BYTES`.
//!
//! ## Number of banks
//!
//! `LOG2_NUM_BANKS` defines the number of shared memory banks on GPUs. This parameter is used to
//! avoid bank conflicts.
//!
//! ## LA-SWWC tuples per thread
//!
//! `LASWWC_TUPLES_PER_THREAD` defines the number of tuples processed at a time per thread. More
//! tuples require more shared memory and more registers. Thus, the parameter should be tuned for
//! each GPU architecture.
//!
//! The Stehle and Jacobsen set the value to `3` for a Tesla P100 GPU in their work: [*A Memory
//! Bandwidth-Efficient Hybrid Radix Sort on GPUs*](http://doi.acm.org/10.1145/3035918.3064043). We
//! set the value to `5` for a Tesla V100 GPU.
//!
//! ## Bucket chaining entries
//!
//! `RADIX_JOIN_BUCKET_CHAINING_ENTRIES` defines the number of hash table entries used by the
//! bucket chaining scheme of the radix join.
//!
//! The value must be set to a power of two, and at least 1. No further constraints.
//!
//! # Library initialization
//!
//! GPU operators are compiled as a [CUDA `fatbinary` module][fatbin]. The
//! module must be loaded into the current context before using the
//! [`cuModuleLoad` driver function][cuModuleLoad] before the operator can start
//! executing. Module loading can take up to several hundred milliseconds.
//!
//! To avoid load the module each time an operator is executed, the `sql-ops`
//! library globally loads the module exactly once. The load is lazy and is
//! performed when a GPU operator is executed for the first time. Thus, later
//! executions of any GPU operator use the already-loaded module.
//!
//! **Important:** The CUDA context must be initialized before calling the
//! a GPU operator. *Destroying this context will also destroy the module!*
//!
//! This is usually not a problem in applications that initialize the context
//! once at the start of the program. However, in unit tests, a common pattern
//! is to initialize a context for each test case. Instead, tests should create
//! a singleton instance of the context that is only initialized once. See
//! `sql-ops/tests/test_gpu_radix_partition.rs` as an example.
//!
//! [fatbin]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#fatbinaries
//! [cuModuleLoad]: https://docs.nvidia.com/cuda/archive/10.2/cuda-driver-api/group__CUDA__MODULE.html#group__CUDA__MODULE_1g366093bd269dafd0af21f1c7d18115d3
pub mod error;
pub mod join;
pub mod partition;
pub mod prefix_scan;
use once_cell::sync::Lazy;
use rustacuda::module::Module;
use std::ffi::CString;
#[allow(dead_code)]
pub(crate) mod constants {
include!(concat!(env!("OUT_DIR"), "/constants.rs"));
}
// Export cache line constants
pub use constants::CACHE_LINE_SIZE as CPU_CACHE_LINE_SIZE;
pub use constants::GPU_CACHE_LINE_SIZE;
static mut MODULE_OWNER: Option<Module> = None;
static MODULE: Lazy<&'static Module> = Lazy::new(|| {
let module_path = CString::new(env!("CUDAUTILS_PATH"))
.expect("Failed to load CUDA module, check your CUDAUTILS_PATH");
let module = Module::load_from_file(&module_path).expect("Failed to load CUDA module");
unsafe { MODULE_OWNER.get_or_insert(module) }
});