Analysing complex scientific data often relies on unstructured meshes, which pose significant challenges for visualisation and processing due to their irregular nature, and the difficulty of managing their complex connections. Guoxi Liu, Thomas Randall, and Rong Ge, from IEEE, alongside Federico Iuricich, present a new approach to overcome these limitations by harnessing the power of modern computer systems. Their research introduces GALE, a novel data structure that intelligently distributes the workload between the central processing unit and graphics processing unit, allowing each to focus on its strengths. This innovative technique, the first open-source CUDA-based solution of its kind, substantially accelerates data analysis, achieving speedups of up to 2. 7 times compared to existing methods, and represents a significant step forward in efficiently visualising and interpreting complex scientific datasets

GPU Parameter Optimisation for Mesh Processing

This research details a parameter study designed to optimise the number of GPU threads per mesh segment, the GPU block configuration (number of blocks and threads per block), and the waiting time distributions of multiple consumers. Datasets such as Fish and Stent were used to evaluate performance, testing algorithms including Critical Points and Discrete Gradient, with performance measured in terms of execution time, memory usage, and workload balance. Understanding these parameters is crucial as modern mesh datasets often contain millions or even billions of elements, demanding highly parallel processing capabilities.

The study first examined the number of GPU threads allocated to compute relations for a single segment, varying this value from one to a higher number. Results demonstrate that increasing the number of threads improves performance up to a point, with 32 threads achieving approximately a threefold speedup compared to single-threaded execution. This improvement stems from the inherent parallelism of the GPU architecture, allowing multiple threads to execute concurrently. However, beyond 32 threads, benefits diminish due to increased communication and synchronisation overhead, as the GPU’s warp size of 32 plays a crucial role in grouping work within the same warp. A warp is the smallest execution unit on a GPU, and threads within a warp execute in lockstep; therefore, divergence within a warp can significantly reduce performance. The VT relation, computed directly from a tetrahedron list, did not benefit as significantly from increased threads due to atomic operations on vertices. Atomic operations, while ensuring data consistency, introduce serialisation bottlenecks, limiting the potential for parallel speedup. This highlights the importance of algorithm selection and its compatibility with the GPU’s parallel processing capabilities.

Next, the study varied the number of threads per block, ranging from 32 to 1024, while maintaining a constant number of threads working on each segment at 32. Findings indicate that using more threads per block generally improves performance by precomputing more segments, with 512 threads per block achieving the best performance, providing a speedup exceeding four times the baseline. This is because larger blocks allow for more efficient use of shared memory, a fast on-chip memory accessible by all threads within a block. Shared memory reduces the need to access slower global memory, significantly improving performance. However, exceeding this value can saturate memory bandwidth and increase synchronisation overhead. Memory bandwidth refers to the rate at which data can be transferred between the GPU and its memory; exceeding this limit creates a bottleneck. The increase in memory usage due to relation arrays found to be relatively small, suggesting that memory capacity is not a primary constraint in this scenario. The study then investigated the impact of varying the number of launched blocks, ranging from one to sixteen, while keeping the number of threads per block constant at 512. This exploration aimed to determine the optimal level of parallelism for the overall computation.

Initial results show that execution time decreases as the number of blocks increases, due to the precomputation of more segments. This is because launching more blocks allows for greater concurrency, enabling the GPU to process more data simultaneously. However, beyond a certain point, execution time increases due to kernel launch overhead and potential limitations in running multiple kernels concurrently. Kernel launch overhead refers to the time required to transfer data and instructions to the GPU before the kernel can begin execution. This overhead can become significant if the number of launched kernels is too high. Memory usage also increases with the number of blocks, though this effect was not significant on larger datasets. This suggests that the memory footprint of each block is relatively small, and the GPU has sufficient memory capacity to accommodate a moderate number of blocks. Finally, the research analysed the waiting time distributions of multiple consumers running the Critical Points and Discrete Gradient algorithms with varying numbers of consumer threads, ranging from eight to forty. The limited variance observed across consumer threads indicates effective workload balancing, demonstrating that the algorithms can distribute work evenly among the consumers. Effective workload balancing is crucial for maximising GPU utilisation and minimising execution time.

Based on the study, 32 threads per segment and 512 threads per block appear to be optimal. The optimal number of blocks depends on the specific dataset and algorithm, but the research suggests finding a balance between precomputing segments and avoiding kernel launch overhead, providing a solid foundation for further optimisation efforts. The study was conducted on a GPU with a maximum block size of 1024, and the GPU’s warp size of 32 is a key factor in optimising thread allocation. Datasets used included Fish, Stent, and others, representing a range of mesh complexities and sizes. Performance was measured using execution time, memory usage, and waiting time distribution, providing a comprehensive evaluation of the optimisation efforts. This research provides valuable insights into the optimisation of GPU-based mesh processing algorithms. By carefully tuning parameters related to thread allocation and block configuration, it is possible to significantly improve performance and achieve efficient workload balancing, offering a strong foundation for future optimisation work. Future research could explore the use of more advanced scheduling algorithms and data structures to further improve performance and scalability, particularly for extremely large datasets.