关键词:
Benchmark testing
Pattern matching
Heating systems
Clustering algorithms
Software
Topology
Surges
Graphs and networks
neighborhood communication
MPI
network communication
network contention
distributed memories
benchmarking
摘要:
Distributed-memory graph algorithms are fundamental enablers in scientific computing and analytics workflows. A majority of graph algorithms rely on the graph neighborhood communication pattern, i.e., repeated asynchronous communication between a vertex and its neighbors in the graph. The pattern is adversarial for communication software and hardware due to high message injection rates and input-dependent, many-to-one traffic with variable destinations and volumes. We present benchmarks and performance analysis of graph neighborhood communication on modern large-scale network interconnects from four supercomputers: ALCF Theta, NERSC Cori, OLCF Summit and R-CCS Fugaku. Our benchmarks characterize communication from the perspectives of latency and throughput. Benchmark parameters make it possible to mimic the behaviors of complex applications on real world or synthetic graphs by varying work distribution, remote edges, message volume, and per-vertex work. We find that minor changes in the input graph can substantially increase latencies;and contention can develop in memory caches and network stacks before contention in the network itself. Further, latencies and contention vary significantly for different graph neighborhoods, motivating the need for exploring asynchronous algorithms in greater detail. When adding work, load imbalance on real-world graphs can be pronounced: latencies for the 99th percentile were 8-128x than the corresponding average latencies. Our results help analysts and developers understand the performance implications of this important pattern, especially for the impending exascale platforms.