A Lightning-Fast OpenMP Implementation

BOLT is a recursive acronym that stands for “BOLT is OpenMP over Lightweight Threads”.

BOLT targets a high-performing OpenMP implementation, especially specialized for fine-grain parallelism. Unlike other OpenMP implementations, BOLT utilizes a lightweight threading model for its underlying threading mechanism. It currently adopts Argobots, a new holistic, low-level threading and tasking runtime, in order to overcome shortcomings of conventional OS-level threads. The current BOLT implementation is based on the OpenMP runtime in LLVM, and thus it can be used with LLVM/Clang, Intel OpenMP compiler, and GCC.

bolt-compilers

Motivation
OpenMP is a directive-based parallel programming model for shared memory computers. Thanks to its simple incremental parallelization method, OpenMP has been widely used in many applications. While current OpenMP implementations based on OS-level threads (e.g., pthreads) perform well on computation-bound codes that can be evenly divided among threads, they are encountering some challenges observed in recent HPC trends:

  1. OpenMP applications are demanded to express more parallelism (e.g., nested parallelism) to fully utilize increasing CPU cores.
  2. Irregular or non-traditional applications use OpenMP task constructs to express fine-grained parallelism rather than traditional work sharing constructs.
  3. Hybrid programming mixing OpenMP and MPI requires better interoperability between two programming models, which is usually connected through the common threading model.

These challenges might be difficult or inefficient to be handled in the current OpenMP implementation due to their underlying heavyweight threading model.

Approaches
BOLT implements OpenMP by exploiting Argobots to better deal with above challenges and to achieve enhanced performance than existing solutions. BOLT’s approaches are:

  1. BOLT creates work units (i.e., user-level threads (ULTs) or tasklets) to implement any levels of parallel constructs instead of OS-level threads.
    Argobots exposes an N:M mapping between work units and execution streams (ESs, OS-level threads), and BOLT utilizes this mapping while keeping the number of ESs within the number of cores or hardware threads in the system. Note that creating many ULTs and tasklets does not add much overhead. BOLT, by default, generates ULTs to process each parallel construct, but if the parallel region is guaranteed that it has only computation code without any blocking calls by the compiler or the user, BOLT utilizes tasklets to further reduce the overhead of work unit management and scheduling.
  2. BOLT handles OpenMP task constructs in a similar way as it does for other parallel constructs.
    BOLT creates ULTs only when the context of OpenMP task needs to be saved. Otherwise, tasklets are created to correspond to OpenMP tasks. This approach can efficiently handle blocking calls or taskyield pragma inside OpenMP task code.
  3. BOLT interoperates with theĀ Argobots-aware MPI implementation through ULTs.
    If the parallel region includes blocking calls (e.g., MPI communication calls), ULTs can give better performance since one blocking call in an iteration will not block the entire ES (i.e., core). The ULT that invokes a blocking call can be context switched to other ULT by theĀ Argobots-aware MPI runtime. This approach decreases the waste of cores by overlapping communication and computation and thus improves the interoperability of the hybrid programming with OpenMP and MPI.