Generic Topology Mapping Strategies for Large-scale Parallel Architectures 大规模并行架构的通用拓扑映射策略
Torsten Hoefler 托尔斯滕-霍夫勒University of Illinois at Urbana-Champaign 伊利诺伊大学香槟分校Urbana, IL, USA 美国伊利诺伊州厄巴纳htor@illinois.edu
Marc Snir 马克-斯尼尔University of Illinois at Urbana-Champaign 伊利诺伊大学香槟分校Urbana, IL, USA 美国伊利诺伊州厄巴纳snir@illinois.edu
Abstract 摘要
The steadily increasing number of nodes in high-performance computing systems and the technology and power constraints lead to sparse network topologies. Efficient mapping of application communication patterns to the network topology gains importance as systems grow to petascale and beyond. Such mapping is supported in parallel programming frameworks such as MPI, but is often not well implemented. We show that the topology mapping problem is NP-complete and analyze and compare different practical topology mapping heuristics. We demonstrate an efficient and fast new heuristic which is based on graph similarity and show its utility with application communication patterns on real topologies. Our mapping strategies support heterogeneous networks and show significant reduction of congestion on torus, fat-tree, and the PERCS network topologies, for irregular communication patterns. We also demonstrate that the benefit of topology mapping grows with the network size and show how our algorithms can be used in a practical setting to optimize communication performance. Our efficient topology mapping strategies are shown to reduce network congestion by up to 80%80 \%, reduce average dilation by up to 50%50 \%, and improve benchmarked communication performance by 18%18 \%. 高性能计算系统中节点数量的稳步增长以及技术和功率方面的限制导致了稀疏的网络拓扑结构。随着系统发展到千万亿次甚至更大规模,将应用通信模式有效映射到网络拓扑结构变得越来越重要。MPI 等并行编程框架支持这种映射,但通常无法很好地实现。我们证明了拓扑映射问题的 NP 完备性,并分析和比较了不同的实用拓扑映射启发式。我们展示了一种基于图相似性的高效、快速的新启发式,并通过实际拓扑结构上的应用通信模式证明了它的实用性。我们的映射策略支持异构网络,并显示在环形、胖树和 PERCS 网络拓扑结构上,不规则通信模式的拥塞现象显著减少。我们还证明了拓扑映射的优势会随着网络规模的扩大而增加,并展示了如何在实际环境中使用我们的算法来优化通信性能。我们的高效拓扑映射策略可将网络拥塞降低 80%80 \% ,将平均扩张降低 50%50 \% ,并将基准通信性能提高 18%18 \% 。
The number of nodes in the largest computing systems, and, hence, the size of their interconnection networks, is increasing rapidly: The Jaguar system at ORNL has over 最大计算系统的节点数量及其互连网络的规模正在迅速增加:ORNL 的 "美洲虎 "系统拥有超过
18,000 nodes and larger systems are expected in the near future. These networks are built by interconnecting nodes (switches and processors) with links. Pin count, power and gate count constraints restrict the number of links per switch; typical sizes are: 24 (InfiniBand), 36 (Myrinet, InfiniBand), or 6 (Sea Star or BlueGene/P). Different topologies are used to construct large-scale networks from crossbars; e.g., kary n-cubes (hypercube, torus), k-ary n-trees (fat-trees), or folded Clos networks. Networks also differ in their routing protocols. 18,000 个节点,预计不久的将来还会有更大的系统。这些网络通过链路将节点(交换机和处理器)互连起来。针脚数、功率和门数限制了每个交换机的链路数;典型的链路数为24 个(InfiniBand)、36 个(Myrinet、InfiniBand)或 6 个(Sea Star 或 BlueGene/P)。使用不同的拓扑结构从横杆构建大规模网络,如 kary n 立方体(超立方体、环)、k-ary n 树(胖树)或折叠克洛斯网络。网络的路由协议也各不相同。
As the number of nodes grows larger, the diameter of the network (i.e., the maximum distance between two processors) increases; for many topologies, the bisection bandwidth (i.e., the minimum total bandwidth of links that need to be cut in order to divide the processors into two equal sets) decreases relative to the number of nodes. 随着节点数量的增加,网络的直径(即两个处理器之间的最大距离)也会增加;对于许多拓扑结构而言,分段带宽(即为将处理器分成两个相等的集合而需要切断的链路的最小总带宽)相对于节点数量会减少。
This effect is well understood and it is generally accepted that dense communication patterns (such as an all-to-all communication where each node communicates to each other) are hard to scale beyond petascale systems. Luckily, the communication patterns of many applications are relatively sparse (each node communicate with a few others), and dense communications can be replaced by repeated sparse communications (e.g., the all-to-all communication used for the transpose in a parallel Fast Fourier Transform can be replaced by two phases of group transposes, each involving only Theta(sqrtP)\Theta(\sqrt{P}) processors [17]). Furthermore, the communication pattern often has significant locality, e.g., when most communication occurs between adjacent cells in a 3D domain. However, an inappropriate mapping of processes to the nodes of the interconnection network can map a logical communication pattern that is sparse and local into traffic that has no locality. 这种效应已广为人知,人们普遍认为密集型通信模式(例如每个节点都相互通信的全对全通信)很难扩展到千万亿次级以上的系统。幸运的是,许多应用的通信模式相对稀疏(每个节点只与其他几个节点通信),密集通信可以被重复的稀疏通信所取代(例如,并行快速傅立叶变换中用于转置的全对全通信可以被两个阶段的组转置所取代,每个阶段只涉及 Theta(sqrtP)\Theta(\sqrt{P}) 个处理器 [17])。此外,通信模式通常具有显著的局部性,例如,当大部分通信发生在三维域中的相邻单元之间时。然而,如果将进程不恰当地映射到互连网络的节点上,就会将稀疏和局部的逻辑通信模式映射成没有局部性的流量。
Finding an allocation of processes to nodes such that the sparse application communication topology efficiently utilizes the physical links in the network is called topology mapping. The problem has been much studied for regular communication graph and regular interconnection network topologies. In practice, both graphs are likely to be irregular: The communication pattern may be data-dependent (e.g., for finite-element on irregular meshes); it may consist of a superposition of multiple regular graphs (e.g., for computations that combine nearest-neighbor communications with global communication). The interconnection network may have a complex topology, with different links having different bandwidths (e.g., copper vs. optics), and with some links being disabled. The general problem has been much less studied. 为节点分配进程,使稀疏的应用通信拓扑能有效利用网络中的物理链路,这就是拓扑映射。对于规则通信图和规则互连网络拓扑结构,人们对这个问题进行了大量研究。实际上,这两种图都可能是不规则的:通信模式可能与数据有关(例如,在不规则网格上进行有限元计算);可能由多个规则图叠加而成(例如,结合了近邻通信和全局通信的计算)。互连网络可能具有复杂的拓扑结构,不同的链路具有不同的带宽(如铜线与光纤),有些链路是禁用的。对一般问题的研究要少得多。
Our previous argument suggests that mapping regular and irregular applications to the network topology is becoming more and more important at large scale. MPI offers support for topology mapping. A user can specify the (regular or irregular) communication topology of the application and request the library to provide a good mapping to the physical topology [16, 7]. An MPI implementation then re-numbers the processes in the communicator so as to improve the mapping. The scalability and usability of the topology interface was recently improved in MPI-2.2 [12] to allow a scalable specification and edge weights that represent communication characteristics. Finding a good mapping is non trivial and MPI implementations tend to use the trivial identify mapping. 我们之前的论证表明,在大规模应用中,将规则和不规则应用映射到网络拓扑结构变得越来越重要。MPI 支持拓扑映射。用户可以指定应用程序的(规则或不规则)通信拓扑,并要求库提供与物理拓扑的良好映射[16, 7]。然后,MPI 实现会对通信器中的进程重新编号,以改进映射。拓扑接口的可扩展性和可用性最近在 MPI-2.2 [12] 中得到了改进,允许可扩展的规范和代表通信特性的边缘权重。要找到一个好的映射并非易事,MPI 实现倾向于使用琐碎的标识映射。
Our work supports the optimization of arbitrary process topologies for arbitrary network topologies and thus an efficient implementation of the MPI process topology interface. This enables transparent and portable topology mapping for all network topologies. Our work also addresses heterogeneous networks such as PERCS, where different physical links may have different bandwidths. 我们的工作支持针对任意网络拓扑结构优化任意进程拓扑结构,从而高效实现 MPI 进程拓扑接口。这样就能为所有网络拓扑结构提供透明、可移植的拓扑映射。我们的工作还涉及异构网络,如 PERCS,其中不同的物理链路可能具有不同的带宽。
Our implementation is intended for renumbering processes as suggested by MPI, however, the developed techniques and our open-source library can also be applied to other parallel programming frameworks such as UPC or CAF. 我们的实施旨在按照 MPI 的建议对进程进行重新编号,不过,开发的技术和我们的开源库也可应用于其他并行编程框架,如 UPC 或 CAF。
1.1 Related Work 1.1 相关工作
The mapping of regular Cartesian structures to different target architectures is well understood. Yu, Chung, and Moreira present different topology mapping strategies of torus process topologies into the torus network of BlueGene/L [23]. Bhatelé, Kalé and Kumar discuss topology-aware load-balancing strategies for molecular dynamic CHARM++ applications [2]. Their analysis enables mapping from mesh and torus process topologies to other mesh and torus network topologies and provides performance gains of up to 10%10 \%. 将规则笛卡尔结构映射到不同目标架构的方法已广为人知。Yu、Chung 和 Moreira 提出了将环形进程拓扑映射到 BlueGene/L 环形网络的不同拓扑映射策略[23]。Bhatelé、Kalé 和 Kumar 讨论了针对分子动力学 CHARM++ 应用的拓扑感知负载平衡策略[2]。他们的分析实现了从网格和环形进程拓扑到其他网格和环形网络拓扑的映射,并提供了高达 10%10 \% 的性能提升。
Several researchers investigated techniques to optimize process mappings with arbitrary topologies on parallel computers. Bokhari [3] reduces the mapping problem to graph isomorphism. However, his strategy ignores edges that are not mapped. It was shown later that such edges can have a detrimental effect on the congestion and dilation of the mapping. Lee and Aggarwal [15] improve those results and define a more accurate model which includes all edges of the communication graph and propose a two-stage optimization function consisting of initial greedy assignment and later pairwise swaps. Bollinger and Midkiff [4] use a similar model and simulated annealing to optimize process mappings. 一些研究人员研究了在并行计算机上优化具有任意拓扑结构的进程映射的技术。Bokhari [3] 将映射问题简化为图形同构。不过,他的策略忽略了未映射的边。后来的研究表明,这些边会对映射的拥塞和扩张产生不利影响。Lee 和 Aggarwal [15] 改进了这些结果,定义了一个更精确的模型,其中包括通信图的所有边,并提出了一个两阶段优化函数,包括最初的贪婪分配和后来的成对交换。Bollinger 和 Midkiff [4] 使用类似的模型和模拟退火来优化流程映射。
Träff proposes an implementation strategy for strictly hierarchical networks such as clusters of SMPs [22]. He defines different optimization criteria and shows the potential of MPI topology mapping for several artificial graphs. Träff 提出了一种针对严格分层网络(如 SMP 集群)的实施策略[22]。他定义了不同的优化标准,并展示了 MPI 拓扑映射在多个人工图形中的潜力。
2. TOPOLOGY MAPPING 2.拓扑映射
2.1 Terms and Conventions 2.1 术语和约定
We use a notation that extends that used for graph embeddings [19]. The formulation is similar to the fluid flow approximation used to study Internet traffic [13]. We represent the (logical) communication pattern using a weighted, directed graph G=(V_(G),omega_(G))V_(G)\mathcal{G}=\left(V_{\mathcal{G}}, \omega_{\mathcal{G}}\right) V_{\mathcal{G}} is the set of processes; the weight omega(uv)\omega(u v) of the edge connecting u inV_(G)u \in V_{\mathcal{G}} to v inV_(G)v \in V_{\mathcal{G}} rep- 我们使用的符号扩展了用于图嵌入的符号[19]。这种表述方式类似于用于研究互联网流量的流体流近似法 [13]。我们使用加权有向图(逻辑)来表示(逻辑)通信模式 G=(V_(G),omega_(G))V_(G)\mathcal{G}=\left(V_{\mathcal{G}}, \omega_{\mathcal{G}}\right) V_{\mathcal{G}} 是流程集合;连接 u inV_(G)u \in V_{\mathcal{G}} 和 v inV_(G)v \in V_{\mathcal{G}} 的边的权重 omega(uv)\omega(u v) 表示
resents the volume of communication from process uu to process vv; the weight is zero if no such communication occurs. The graph G\mathcal{G} might be disconnected and isolated vertices can exist - representing the concurrent execution of multiple unrelated jobs. 表示进程 uu 与进程 vv 之间的通信量;如果没有发生此类通信,则权重为零。图 G\mathcal{G} 可能是断开的,也可能存在孤立的顶点,这表示多个不相关的作业同时执行。
Likewise, the (physical) interconnection network is represented by a weighted, directed graph H=(V_(H),C_(H),c_(H),R_(H))\mathcal{H}=\left(V_{\mathcal{H}}, C_{\mathcal{H}}, c_{\mathcal{H}}, \mathcal{R}_{\mathcal{H}}\right). V_(H)V_{\mathcal{H}} is the set of physical nodes (processors and switches). If u inV_(H)u \in V_{\mathcal{H}} then C_(H)(u)C_{\mathcal{H}}(u) is the number of processes that can be hosted at uu (this represents multicore processors); C_(H)(u)=0C_{\mathcal{H}}(u)=0 if uu contains no processors (e.g., is a switch). c_(H)(uv)c_{\mathcal{H}}(u v) is the capacity (bandwidth) of the link connecting uu to vv (zero if there is no such link). 同样,(物理)互连网络由加权有向图 H=(V_(H),C_(H),c_(H),R_(H))\mathcal{H}=\left(V_{\mathcal{H}}, C_{\mathcal{H}}, c_{\mathcal{H}}, \mathcal{R}_{\mathcal{H}}\right) 表示。 V_(H)V_{\mathcal{H}} 是物理节点(处理器和交换机)的集合。如果 u inV_(H)u \in V_{\mathcal{H}} ,则 C_(H)(u)C_{\mathcal{H}}(u) 是可在 uu 上托管的进程数(这表示多核处理器);如果 uu 不包含处理器(例如,是交换机),则 C_(H)(u)=0C_{\mathcal{H}}(u)=0 是可在 uu 上托管的进程数(这表示多核处理器)。 c_(H)(uv)c_{\mathcal{H}}(u v) 是连接 uu 和 vv 的链路容量(带宽)(如果没有此类链路,则为零)。
The function R_(H)\mathcal{R}_{\mathcal{H}} represents the routing algorithm. Let P(uv)\mathcal{P}(u v) be the set of simple paths (paths where each edge occurs at most once) connecting node u inV_(H)u \in V_{\mathcal{H}} to node v inV_(H)v \in V_{\mathcal{H}} For each pair of nodes uv,R_(H)(uv)u v, \mathcal{R}_{\mathcal{H}}(u v) is a probability distribution on P(uv)\mathcal{P}(u v). Thus, if p inP(uv)p \in \mathcal{P}(u v) then R_(H)(uv)(p)\mathcal{R}_{\mathcal{H}}(u v)(p) is the fraction of traffic from uu to vv that is routed through path pp. In practice, routing algorithms tend to use a small fraction of the possible paths (e.g., only shortest paths), and the traffic is often distributed evenly across all used paths. 函数 R_(H)\mathcal{R}_{\mathcal{H}} 表示路由算法。让 P(uv)\mathcal{P}(u v) 成为连接节点 u inV_(H)u \in V_{\mathcal{H}} 和节点 v inV_(H)v \in V_{\mathcal{H}} 的简单路径(每条边最多出现一次的路径)的集合,对于每对节点 uv,R_(H)(uv)u v, \mathcal{R}_{\mathcal{H}}(u v) 是 P(uv)\mathcal{P}(u v) 上的概率分布。因此,如果 p inP(uv)p \in \mathcal{P}(u v) ,则 R_(H)(uv)(p)\mathcal{R}_{\mathcal{H}}(u v)(p) 是指从 uu 到 vv 的流量中通过路径 pp 路由的部分。在实际应用中,路由算法往往只使用一小部分可能的路径(例如,只使用最短路径),而流量通常会平均分配到所有已使用的路径上。
The topology mapping is specified by a function Gamma:V_(G)rarr\Gamma: V_{\mathcal{G}} \rightarrowV_(H)V_{\mathcal{H}} which maps the vertices of G\mathcal{G} (processes) to vertices in V_(H)V_{\mathcal{H}} (nodes) such that no more than C(v)C(v) vertices in G\mathcal{G} are mapped to each vertex v inV_(H)v \in V_{\mathcal{H}}. We use the terms mapping and embedding interchangeably. 拓扑映射由函数 Gamma:V_(G)rarr\Gamma: V_{\mathcal{G}} \rightarrowV_(H)V_{\mathcal{H}} 指定,该函数将 G\mathcal{G} (进程)中的顶点映射到 V_(H)V_{\mathcal{H}} (节点)中的顶点,从而使 G\mathcal{G} 中不超过 C(v)C(v) 的顶点映射到每个顶点 v inV_(H)v \in V_{\mathcal{H}} 。我们交替使用映射和嵌入这两个术语。
We now define two quality measures for a mapping: Worst Case Congestion (for short, congestion) and Average Dilation (for short, dilation). Let |p||p| denote the length of path pp. Then, the expected dilation of an edge uvu v of the communication graph is defined as 我们现在为映射定义两个质量度量:最坏情况拥塞(简称拥塞)和平均扩张(简称扩张)。让 |p||p| 表示路径 pp 的长度。那么,通信图中一条边 uvu v 的预期扩张度定义为
Dilation(uv) is the average length of the path taken by a message sent from process uu to process vv. The average dilation is computed by weighting each inter-process communication by its frequency: Dilation(uv) 是指从进程 uu 发送到进程 vv 的信息的平均路径长度。平均扩展是通过对进程间通信的频率进行加权计算得出的:
Dilation(Gamma)=sum_(u,v inV_(G))omega_(G)(uv)*Dilation(uv)\operatorname{Dilation}(\Gamma)=\sum_{u, v \in V_{\mathcal{G}}} \omega_{\mathcal{G}}(u v) \cdot \operatorname{Dilation}(u v)
Dilation is the average number of edges traversed by packets - hence is a measure of the total “communication work” performed by the interconnection network; it is indicative of the total energy consumption of the interconnection network. 扩展是数据包所穿越的边的平均数量,因此是衡量互联网络所执行的 "通信工作 "总量的一个指标;它表明了互联网络的总能耗。
The congestion of a link uvu v of the interconnection network is the ratio between the amount of traffic on that link and the capacity of the link. The total traffic crossing an edge e inE_(H)e \in E_{\mathcal{H}} is 互联网络 uvu v 链路的拥塞程度是该链路上的流量与链路容量之间的比率。通过边缘 e inE_(H)e \in E_{\mathcal{H}} 的总流量为
{:[Traffic(e)=],[sum_(u,v inV_(G))omega_(G)(uv)(sum_(p inP(Gamma(u)Gamma(v)),e in p)R_(H)(Gamma(u)Gamma(v))(p))]:}\begin{aligned}
& \operatorname{Traffic}(e)= \\
& \sum_{u, v \in V_{\mathcal{G}}} \omega_{\mathcal{G}}(u v)\left(\sum_{p \in \mathcal{P}(\Gamma(u) \Gamma(v)), e \in p} \mathcal{R}_{\mathcal{H}}(\Gamma(u) \Gamma(v))(p)\right)
\end{aligned}
The congestion of edge ee is defined as 边 ee 的拥塞定义为
Congestion (Gamma)(\Gamma) is a lower bound on the time needed for communication. Both congestion and dilation can be computed in polynomial time. 拥塞 (Gamma)(\Gamma) 是通信所需时间的下限。拥塞和扩张都可以用多项式时间计算。
The chosen representation embodies certain assumptions that are satisfied in many cases: 所选的表示法体现了在许多情况下都能满足的某些假设:
We assume that bandwidth between processes hosted at the same processor node is practically unbounded. To do so formally, we add to each processor node a self-loop with infinite capacity. 我们假设,在同一处理器节点上托管的进程之间的带宽实际上是无限的。为此,我们为每个处理器节点添加了一个容量无限的自循环。
We assume that switches are not a performance bottleneck: the traffic flowing through a switch is constrained by the bandwidth of the incoming and outgoing links, but not by the internal switch structure. If this is not the case, the internal switch structure needs to be represented, too. 我们假设交换机不是性能瓶颈:流经交换机的流量受进出链路带宽的限制,而不受交换机内部结构的限制。如果情况并非如此,则内部交换机结构也需要体现出来。
We assume oblivious routing: the distribution of traffic between two nodes does not depend on other ongoing traffic. 我们假设采用遗忘路由:两个节点之间的流量分配不取决于其他正在进行的流量。
We assume that the routing algorithm is fixed, and does not depend on the embedded communication graph: The knowledge of the application communication pattern is used to map processes to processors, but is not used to change the routing algorithm. The formalism can be adjusted to handle routing protocols that are application dependent, in which case, the routing function becomes part of the mapping. 我们假设路由算法是固定的,不依赖于嵌入式通信图:应用通信模式的知识用于将进程映射到处理器,但不用于改变路由算法。可以对形式进行调整,以处理与应用相关的路由协议,在这种情况下,路由功能将成为映射的一部分。
The mapping problem is often expressed as a permutation of processes, following an initial assignment of processes to processors. The remapping is defined by a permutation pi\pi on the set 0dots P-10 \ldots P-1 of processes. For example, in MPI, the user can specify a logical communication graph for a communicator, and request that the processes in the communicator be physically mapped so as to improve the performance of this communication pattern; the permutation pi\pi is returned as a new rank order for the processes in the communicator if the reorder argument is set to true. 映射问题通常用进程的排列组合来表示,它是在将进程初始分配给处理器之后进行的。重新映射由进程集 0dots P-10 \ldots P-1 上的排列 pi\pi 来定义。例如,在 MPI 中,用户可以为一个通信器指定一个逻辑通信图,并要求对通信器中的进程进行物理映射,以提高这种通信模式的性能;如果重排序参数设置为 true,那么排列组合 pi\pi 将作为通信器中进程的新等级顺序返回。
2.2 Practical Issues 2.2 实际问题
The mapping framework presented in this paper assumes that a communication pattern is defined once for all processes running on the system, and the processes are mapped to processors once. In practice, it may be advantageous to periodically readjust the mapping; and remapping of the processes of a job may be restricted to the set of nodes allocated to the job. We discuss below how our framework can be extended to handle these concerns; a detailed performance analysis of these enhancements is beyond the scope of this paper. 本文介绍的映射框架假定为系统上运行的所有进程定义一次通信模式,并将进程映射到处理器一次。在实践中,定期重新调整映射可能是有利的;而且作业进程的重新映射可能仅限于分配给作业的节点集。我们将在下文讨论如何扩展我们的框架以处理这些问题;对这些增强功能的详细性能分析超出了本文的讨论范围。
The jobs running on the system can be periodically reconfigured, based on the observed communication pattern and the overall network topology. A remapping might be beneficial whenever the communication pattern of a job changes, or when jobs start or terminate. When we do so, we need to balance the overhead of remapping against the benefit of improved communication performance. We do not consider in this paper the problem of selecting an optimal remapping schedule, and focus only on the choice of an optimal mapping when (re)mapping is performed. 根据观察到的通信模式和整体网络拓扑结构,可以定期重新配置系统上运行的作业。每当作业的通信模式发生变化,或者作业开始或终止时,重新映射都可能是有益的。在进行重映射时,我们需要平衡重映射的开销与改善通信性能带来的好处。本文不考虑选择最佳重新映射时间表的问题,只关注执行(重新)映射时的最佳映射选择。
If each job is mapped independently, then the host graph for a job is taken to be the partition used for this job (we assume space partitioning): the processors allocated to the job and the switches that can be used to route between these nodes. If there is little interference between the traffic of different parallel jobs, then the capacity of each link in this host graph equals its physical capacity; if the interference is significant, then the traffic of other jobs can be represented 如果每个作业都是独立映射的,那么作业的主机图就是该作业使用的分区(我们假定是空间分区):分配给作业的处理器和可用于在这些节点之间路由的交换机。如果不同并行作业的流量之间干扰很小,那么主机图中每个链路的容量就等于其物理容量;如果干扰很大,那么其他作业的流量可以表示为
as a reduction in the capacity of edges in the host graph seen by the job being mapped. A significant change in the background traffic may necessitate a remapping of processes to nodes. 即被映射的任务在主机图中看到的边的容量减少。后台流量发生重大变化时,可能需要将进程重新映射到节点。
The mapping problem can be defined as finding a mapping Gamma\Gamma that minimizes some measure of the congestion or dilation. In this work, we focus on minimizing the maximum congestion (the algorithm runtime) and average dilation (the needed power to move the data). Our work is equally applicable to other optimization metrics. 映射问题可定义为找到一个映射 Gamma\Gamma ,该映射可最大程度地减少拥塞或扩张。在这项工作中,我们重点关注最大拥塞(算法运行时间)和平均扩张(移动数据所需的功率)的最小化。我们的工作同样适用于其他优化指标。
We define the Topology Mapping Problem TMP as the problem of deciding if there exists a mapping Gamma\Gamma (or permutation pi\pi ) that has congestion less or equal to xx. 我们将拓扑映射问题 TMP 定义为判断是否存在拥塞小于或等于 xx 的映射 Gamma\Gamma (或排列 pi\pi )的问题。
THEOREM 1. TMP is NP-complete. 定理 1.TMP 是 NP-完全的。
Proof. The congestion of a mapping can be computed in polynomial time, using Equations (3), (4) and (5). It follows that TMP is in NP: The NP algorithm guesses a mapping Gamma\Gamma, computes its congestion and returns TRUE is the congestion is less than xx. 证明使用公式 (3)、(4) 和 (5) 可以在多项式时间内计算映射的拥挤度。由此可见,TMP 在 NP 中:NP 算法猜测一个映射 Gamma\Gamma ,计算其拥塞度,如果拥塞度小于 xx ,则返回 TRUE。
We now show a reduction to the “MINIMUM CUT INTO BOUNDED SETS” NP-complete problem’ [10, ND17] to conclude our proof. The (reduced) min cut problem takes as input an undirected graph G=(:V,E:)G=\langle V, E\rangle, two specified vertices s,t in Vs, t \in V and an integer LL; it decides whether there exist a partition of the vertices into two disjoint sets V_(1),V_(2)V_{1}, V_{2} such that s inV_(1),t inV_(2),|V_(1)|=|V_(2)|s \in V_{1}, t \in V_{2},\left|V_{1}\right|=\left|V_{2}\right| and the number of edges between V_(1)V_{1} and V_(2)V_{2} is no more than LL. an 现在,我们展示了 "最小切入有界集合 "问题的简化过程NP-complete problem"[10,ND17]的简化,从而结束我们的证明。最小切割问题(简化)的输入是一个无向图 G=(:V,E:)G=\langle V, E\rangle 、两个指定的顶点 s,t in Vs, t \in V 和一个整数 LL ;它决定是否存在将顶点分割成两个互不相交的集合 V_(1),V_(2)V_{1}, V_{2} ,使得 s inV_(1),t inV_(2),|V_(1)|=|V_(2)|s \in V_{1}, t \in V_{2},\left|V_{1}\right|=\left|V_{2}\right| 和 V_(1)V_{1} 与 V_(2)V_{2} 之间的边的数量不超过 LL 。
Let G=<V,E > ,s,t,LG=<V, E>, s, t, L be an instance of the min-cut problem. Let P=|V|P=|V|. We construct a “dumbbell” host graph H=(V_(H),E_(H),C,c)\mathcal{H}=\left(V_{\mathcal{H}}, E_{\mathcal{H}}, C, c\right) that consists of two fully connected graphs L_(1)=L_(2)=K_(P//2)L_{1}=L_{2}=K_{P / 2} and a single bidirectional edge bar(e)\bar{e} between arbitrary vertices u_(1)inL_(1)u_{1} \in L_{1} and u_(2)inL_(2)u_{2} \in L_{2}. We set C(v)=1C(v)=1 and c(e)=1c(e)=1 for all edges, except edge u_(1)u_(2)u_{1} u_{2}; c(u_(1)u_(2))=c(u_(2)u_(1))=Pc\left(u_{1} u_{2}\right)=c\left(u_{2} u_{1}\right)=P. The construction is shown, for P=8P=8 in Figure 2. The routing function R_(H)\mathcal{R}_{\mathcal{H}} is defined to 让 G=<V,E > ,s,t,LG=<V, E>, s, t, L 成为最小切割问题的一个实例。设 P=|V|P=|V| 。我们构建一个 "哑铃 "主机图 H=(V_(H),E_(H),C,c)\mathcal{H}=\left(V_{\mathcal{H}}, E_{\mathcal{H}}, C, c\right) ,它由两个完全相连的图 L_(1)=L_(2)=K_(P//2)L_{1}=L_{2}=K_{P / 2} 和任意顶点 u_(1)inL_(1)u_{1} \in L_{1} 和 u_(2)inL_(2)u_{2} \in L_{2} 之间的一条双向边 bar(e)\bar{e} 组成。我们为所有边设置了 C(v)=1C(v)=1 和 c(e)=1c(e)=1 ,但边 u_(1)u_(2)u_{1} u_{2} ; c(u_(1)u_(2))=c(u_(2)u_(1))=Pc\left(u_{1} u_{2}\right)=c\left(u_{2} u_{1}\right)=P 除外。图 2 显示了 P=8P=8 的构造。路由函数 R_(H)\mathcal{R}_{\mathcal{H}} 的定义是
Figure 2: Dumbbell graph 图 2:哑铃图
route all traffic between two nodes in H\mathcal{H} through the unique shortest path connecting them. 通过连接 H\mathcal{H} 中两个节点的唯一最短路径,路由它们之间的所有流量。
We define the communication graph G\mathcal{G} to be the graph GG with weight omega(e)=1\omega(e)=1 for each edge e in Ee \in E; the edge st is given weight P^(4)P^{4} if st!in Es t \notin E, and weight P^(4)+1P^{4}+1 if st in Es t \in E. 我们将通信图 G\mathcal{G} 定义为图 GG ,每条边 e in Ee \in E 的权重为 omega(e)=1\omega(e)=1 ;如果 st!in Es t \notin E ,则边 st 的权重为 P^(4)P^{4} ;如果 st in Es t \in E ,则边 st 的权重为 P^(4)+1P^{4}+1 。
Figure 1: A simple example for topology-aware mappings. 图 1:拓扑感知映射的一个简单示例。
Any mapping of G\mathcal{G} to H\mathcal{H} defines a partition of V_(G)V_{\mathcal{G}} into two equal size sets V_(1)=Gamma^(-1)(L_(1))V_{1}=\Gamma^{-1}\left(L_{1}\right) and V_(2)=Gamma^(-1)(L_(2))V_{2}=\Gamma^{-1}\left(L_{2}\right). A mapping that minimizes congestion must map {s,t}\{s, t\} to {u_(1),u_(2)}\left\{u_{1}, u_{2}\right\} : This results in a congestion of <= P^(3)+P\leq P^{3}+P whereas any other mapping will result in a congestion of >= P^(4)\geq P^{4} (since traffic from ss to tt will flow through an edge of capacity 1). In such a mapping, the most congested edge will be the edge u_(1)u_(2)u_{1} u_{2}; its congestion will be G\mathcal{G} 到 H\mathcal{H} 的任何映射都将 V_(G)V_{\mathcal{G}} 定义为两个大小相等的集合 V_(1)=Gamma^(-1)(L_(1))V_{1}=\Gamma^{-1}\left(L_{1}\right) 和 V_(2)=Gamma^(-1)(L_(2))V_{2}=\Gamma^{-1}\left(L_{2}\right) 。最小化拥塞的映射必须将 {s,t}\{s, t\} 映射到 {u_(1),u_(2)}\left\{u_{1}, u_{2}\right\} :这将导致 <= P^(3)+P\leq P^{3}+P 的拥塞,而任何其他映射都将导致 >= P^(4)\geq P^{4} 的拥塞(因为从 ss 到 tt 的流量将流经容量为 1 的边)。在这样的映射中,最拥堵的边将是边 u_(1)u_(2)u_{1} u_{2} ;其拥堵程度将是
{:P^(3)+(1)/(P)*|{(v_(1)v_(2))in E" s.t. "v_(1)inV_(1)" and "v_(2)inV_(2)}|\left.\left.P^{3}+\frac{1}{P} \cdot \right\rvert\,\left\{\left(v_{1} v_{2}\right) \in E \text { s.t. } v_{1} \in V_{1} \text { and } v_{2} \in V_{2}\right\} \right\rvert\,
Thus, any solution to the TMP instance, with x=P^(3)+L//Px=P^{3}+L / P, can be used to build a solution to the partition problem in polynomial time. It is easy to see that the converse is also true. 因此,带有 x=P^(3)+L//Px=P^{3}+L / P 的 TMP 实例的任何解都可以用来在多项式时间内构建分割问题的解。不难看出,反之亦然。
2.5 Restricted Mapping Problem 2.5 受限映射问题
We shall focus, from now on, on the simpler problem where routing only uses shortest paths and all shortest paths are used with equal probability. 从现在起,我们将把重点放在路由只使用最短路径且所有最短路径的使用概率相等的较简单问题上。
The dumbbell graph H\mathcal{H} used in the proof of Theorem 1 has a unique shortest path between any two nodes; the routing function routes all traffic between two nodes on that shortest paths. Thus, the problem of finding an optimal mapping is still NP-hard if routing is restricted as above. 定理 1 的证明中使用的哑铃图 H\mathcal{H} 在任意两个节点之间都有一条唯一的最短路径;路由函数会根据这条最短路径路由两个节点之间的所有流量。因此,如果路由受到上述限制,寻找最优映射的问题仍然是 NP 难。
For the restricted routing problem, we do not need to specify explicitly the routing function; the host graph is defined as (V_(H),C_(H),c_(H))\left(V_{\mathcal{H}}, C_{\mathcal{H}}, c_{\mathcal{H}}\right), and the routing function is determined implicitly. The number of shortest paths between two nodes can be exponential in the number of vertices, so that the concise representation of the host graph can be exponentially smaller than an explicit one. Therefore, it is not obvious that computing the congestion or dilation of a mapping takes polynomial time, with this input representation. We show this is the case, below. 对于受限路由问题,我们不需要明确指定路由函数;主机图定义为 (V_(H),C_(H),c_(H))\left(V_{\mathcal{H}}, C_{\mathcal{H}}, c_{\mathcal{H}}\right) ,路由函数是隐式确定的。两个节点之间的最短路径数可能是顶点数的指数,因此主机图的简明表示可能比显式表示小很多。因此,用这种输入表示法计算映射的拥塞或扩张所需的多项式时间并不明显。下面我们将证明这一点。
Determining the dilation of an edge (u,v)inG(u, v) \in \mathcal{G} is straightforward by computing the length of the shortest path from Gamma(u)\Gamma(u) to Gamma(v)\Gamma(v) in H\mathcal{H}. This can be implemented with single-source-shortest-path (SSSP) from each vertex in time O(|V_(G)|*(|E_(H)|+|V_(H)|*log|V_(H)|))\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right). 通过计算 H\mathcal{H} 中从 Gamma(u)\Gamma(u) 到 Gamma(v)\Gamma(v) 的最短路径长度,可以直接确定边 (u,v)inG(u, v) \in \mathcal{G} 的扩张。这可以通过在 O(|V_(G)|*(|E_(H)|+|V_(H)|*log|V_(H)|))\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right) 时间内从每个顶点出发的单源最短路径 (SSSP) 来实现。
The congestion of an edge can be computed in polynomial time using an algorithm similar to the one used for computing betweenness centrality [5]. We present below a simple (nonoptimal) polynomial time algorithm. 边的拥塞度可以用多项式时间计算,其算法类似于计算间度中心性(betweenness centrality)的算法 [5]。下面我们将介绍一种简单(非最优)的多项式时间算法。
Let sigma_(i)(s,t)\sigma_{i}(s, t) be the number of paths of length ii from ss to tt Then 设 sigma_(i)(s,t)\sigma_{i}(s, t) 为从 ss 到 tt 长度为 ii 的路径数 那么
sigma_(0)(s,t)={[1," if "s=t],[0," otherwise "]:}\sigma_{0}(s, t)= \begin{cases}1 & \text { if } s=t \\ 0 & \text { otherwise }\end{cases}
and 和
sigma_(i)(s,t)=sum_(u" adjacent to "t)sigma_(i-1)(s,u)\sigma_{i}(s, t)=\sum_{u \text { adjacent to } t} \sigma_{i-1}(s, u)
We compute {:sigma_(i)(s,t))\left.\sigma_{i}(s, t)\right) for all pairs of nodes s,ts, t and all i <=i \leqPP in time O(|V_(H)|^(2)|E_(H)|)O\left(\left|V_{\mathcal{H}}\right|^{2}\left|E_{\mathcal{H}}\right|\right). The distance (the shortest path length) between any two nodes is equal to 我们在 O(|V_(H)|^(2)|E_(H)|)O\left(\left|V_{\mathcal{H}}\right|^{2}\left|E_{\mathcal{H}}\right|\right) 时间内计算所有节点对 s,ts, t 和所有 i <=i \leqPP 的 {:sigma_(i)(s,t))\left.\sigma_{i}(s, t)\right) 。任意两个节点之间的距离(最短路径长度)等于
Let tau(s,t,e)\tau(s, t, e) be the number of shortest paths from ss to tt going through edge ee, Then, if e=uve=u v, then 让 tau(s,t,e)\tau(s, t, e) 成为从 ss 到 tt 经过边 ee 的最短路径数,那么,如果 e=uve=u v , 那么
Previous work discussed different options for topology mapping. We start with an extension to a simple greedy algorithm which supports heterogeneous networks, discuss recursive bisection mapping and then discuss a new mapping strategy based on graph similarity. We also show how to support multicore nodes with established graph partitioning techniques. 之前的工作讨论了拓扑映射的不同方案。我们首先扩展了支持异构网络的简单贪婪算法,讨论了递归分段映射,然后讨论了基于图相似性的新映射策略。我们还展示了如何利用成熟的图分割技术来支持多核节点。
3.1 Greedy Heuristic 3.1 贪婪启发式
Similar greedy algorithms have been proposed in previous work. Our greedy strategy, however, considers edge weights and thus enables mapping to heterogeneous network architectures. 以前的工作中也提出过类似的贪心算法。不过,我们的贪婪策略考虑了边缘权重,因此可以映射到异构网络架构。
Let the weight of a vertex v inV_(G)v \in V_{\mathcal{G}} be the sum of the weights of all edges e=(v,u)e=(v, u). The greedy mapping strategy starts at some vertex in H\mathcal{H}, chooses the heaviest vertex in G\mathcal{G} and greedily maps its heaviest neighboring vertices in G\mathcal{G} to the neighboring vertices in H\mathcal{H} with the heaviest connections. The process is continued recursively. The detailed algorithm is presented in Algorithm 1. The greedy heuristic would find an optimal solution for the example in Figure 1 if it is started at vertex 100 . This greedy approach is the most generic 假设顶点 v inV_(G)v \in V_{\mathcal{G}} 的权重是所有边 e=(v,u)e=(v, u) 的权重之和。贪婪映射策略从 H\mathcal{H} 中的某个顶点开始,选择 G\mathcal{G} 中权重最大的顶点,并将其在 G\mathcal{G} 中权重最大的相邻顶点贪婪地映射到 H\mathcal{H} 中连接权重最大的相邻顶点。该过程一直递归进行。详细算法见算法 1。如果从顶点 100 开始,贪婪启发式将为图 1 中的示例找到最优解。这种贪婪方法是最通用的
Algorithm 1: Greedy Graph Embedding
Input: Graphs \(\mathcal{H}\) and \(\mathcal{G}, C(v)\) for all \(v \in V_{\mathcal{H}}\).
Output: Mapping \(\Gamma: V_{\mathcal{G}} \rightarrow V_{\mathcal{H}}\), congestion \(\rho(e)\) for all
\(e \in E_{\mathcal{H}}\).
\(S \leftarrow V_{\mathcal{G}} ;\)
\(Q \leftarrow\) empty priority queue;
\(\hat{\omega}=\max _{e \in E_{\mathcal{G}}}\{\omega(e)\} \cdot\left|V_{\mathcal{H}}\right|^{2}\)
initialize all \(\rho(e)\) with \(\hat{\omega}\); / forces minimal edge count
pick start vertex \(s \in V_{\mathcal{H}}\);
while \(S \neq \emptyset\) do
find vertex \(m\) with heaviest out-edges in \(S\);
if \(C(s)=0\) then
pick new \(s \in V_{\mathcal{H}}\) such that \(C(s) \geq 1\);
\(\Gamma(m)=s ; / / \operatorname{map} m\) to \(s\)
\(S=S \backslash m ; / /\) remove \(m\) from \(S\)
\(C(s)=C(s)-1 ;\)
foreach \(u \mid(m, u) \in E_{\mathcal{G}}\) and \(u \in S\) do
\(Q \leftarrow(m, u) \mid u \in S ; / /\) add all neighbors
// ... of \(m\) that are still in \(S\) to \(Q\)
while \(Q \neq \emptyset\) do
\((u, m) \leftarrow Q ; / /\) heaviest edge in \(Q\)
if \(C(s)=0\) then
// find closest vertex \(t \in V_{\mathcal{H}}\) to \(s\) with
// ... \(C(t) \geq 1\) using a SSSP
// ... (e.g., Dijkstra's) algorithm
\(s=t ;\)
\(\Gamma(m)=s ; / / \operatorname{map} m\) to \(s\)
\(S=S \backslash m ; / /\) remove \(m\) from \(S\)
\(C(s)=C(s)-1 ;\)
add \(\omega((m, u)) / c(f)\) to each \(\rho(f)\) for all edges \(f\)
on the shortest path \(\Gamma(u) \sim \Gamma(m)\)
foreach \(u \mid(m, u) \in E_{\mathcal{G}}\) and \(u \in S\) do
\(Q \leftarrow(m, u) \mid u \in S ; / /\) add all neighbors
\(/ / \ldots\) of \(m\) that are still in \(S\) to \(Q\)
subtract \(\hat{\omega}\) from all \(\rho(e) ; / /\) correction from line 4
approach and works with all graphs and arbitrary values for C(v)C(v). 方法,并适用于所有图形和 C(v)C(v) 的任意值。
THEOREM 2. The runtime of the greedy mapping algorithm is O(|V_(G)|*(|E_(H)|+|V_(H)|log|V_(H)|+|V_(G)|log|V_(G)|))\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \log \left|V_{\mathcal{H}}\right|+\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right)\right). 定理 2.贪心映射算法的运行时间为 O(|V_(G)|*(|E_(H)|+|V_(H)|log|V_(H)|+|V_(G)|log|V_(G)|))\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \log \left|V_{\mathcal{H}}\right|+\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right)\right) 。
Proof. Each vertex in G\mathcal{G} will be removed exactly once from SS. Picking a new vertex (lines 7//8)7 / 8) takes O(|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right|\right) with a linear scan. Checking if each of the neighbors of mm should be added to QQ (lines 13,24 ) can be done in O(|V_(G)|log|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right). Line 16-19 issues an SSSP-run in H\mathcal{H} (e.g., Dijkstra’s algorithm using a Fibonacci heap) for AA v inV_(G)\forall v \in V_{\mathcal{G}}. Thus, the asymptotic run-time is O(|V_(G)|*(|E_(H)|+|V_(H)|log|V_(H)|+:}\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \log \left|V_{\mathcal{H}}\right|+\right.\right.{:|V_(G)|log|V_(G)|))\left.\left.\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right)\right) 证明。 G\mathcal{G} 中的每个顶点都将从 SS 中精确删除一次。选取一个新的顶点( 7//8)7 / 8) 行需要对 O(|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right|\right) 进行线性扫描。检查是否应将 mm 的每个相邻顶点添加到 QQ 中(第 13、24 行)可以在 O(|V_(G)|log|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right) 中完成。第 16-19 行在 H\mathcal{H} 中对 AA v inV_(G)\forall v \in V_{\mathcal{G}} 进行 SSSP 运行(例如,使用 Fibonacci 堆的 Dijkstra 算法)。因此,渐近运行时间为 O(|V_(G)|*(|E_(H)|+|V_(H)|log|V_(H)|+:}\mathcal{O}\left(\left|V_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \log \left|V_{\mathcal{H}}\right|+\right.\right.{:|V_(G)|log|V_(G)|))\left.\left.\left|V_{\mathcal{G}}\right| \log \left|V_{\mathcal{G}}\right|\right)\right)
3.2 Recursive Bisection Mapping 3.2 递推分段映射
A second method to find a good topology mapping is recursive bisection. In this method, the weighted graphs H\mathcal{H} and G\mathcal{G} are recursively split with minimum weighted edge-cut into equal halves to determine the mapping. This technique proved successful to determine “static mappings” in the software package SCOTCH [18]. 找到良好拓扑映射的第二种方法是递归分割法。在这种方法中,将加权图 H\mathcal{H} 和 G\mathcal{G} 用最小加权切边递归分割成相等的两半,以确定映射。事实证明,在软件包 SCOTCH [18] 中,这种技术成功地确定了 "静态映射"。
The minimal edge cut in the bisections maps “heavy” clusters in G\mathcal{G} to “strong” clusters in H\mathcal{H}. Thus, this mechanism 分段中的最小边缘切割将 G\mathcal{G} 中的 "重 "簇映射到 H\mathcal{H} 中的 "强 "簇。因此,这种机制
Algorithm 2: Function map_recursive().
Input: Graphs \(\mathcal{H}\) and \(\mathcal{G}, C(v)\) for all \(v \in V_{\mathcal{H}}\).
Output: Mapping \(\Gamma: V_{\mathcal{G}} \rightarrow V_{\mathcal{H}}\).
// pre-condition: \(\sum_{v \in V_{\mathcal{H}}} C(v)==\left|V_{\mathcal{G}}\right|\)
if more than one vertex \(v \in V_{\mathcal{H}}\) with \(C(v) \neq 0\) then
\(\left(C_{1}, C_{2}\right)=\operatorname{bisect}(\mathcal{H}, C)\);
\(\left(\mathcal{G}_{1}, \mathcal{G}_{2}\right)=\operatorname{bisect}(\mathcal{G}) ;\)
if \(\sum_{c \in C_{1}} c==\left|V_{\mathcal{G}_{1}}\right|\) then
map_recursive \(\left(\mathcal{H}, \mathcal{G}_{1}, C_{1}\right)\);
map_recursive \(\left(\mathcal{H}, \mathcal{G}_{2}, C_{2}\right) ;\)
else
map_recursive \(\left(\mathcal{H}, \mathcal{G}_{1}, C_{2}\right)\);
map_recursive \(\left(\mathcal{H}, \mathcal{G}_{2}, C_{1}\right)\);
else
// map all \(n\) vertices in \(\mathcal{G}\) to vertex with load \(n\) in \(\mathcal{H}\)
is expected to compute relatively good mappings. However, Simon and Teng show that in some cases, the recursive bisection approach might result in bad p-way partitions [21]. 有望计算出相对较好的映射。然而,Simon 和 Teng 的研究表明,在某些情况下,递归分割法可能会产生糟糕的 p 路分区[21]。
THEOREM 3. The runtime of the recursive mapping algorithm is O(|E_(G)|log(|V_(G)|)+|E_(H)|*|V_(G)|)\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \log \left(\left|V_{\mathcal{G}}\right|\right)+\left|E_{\mathcal{H}}\right| \cdot\left|V_{\mathcal{G}}\right|\right). 定理 3.递归映射算法的运行时间为 O(|E_(G)|log(|V_(G)|)+|E_(H)|*|V_(G)|)\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \log \left(\left|V_{\mathcal{G}}\right|\right)+\left|E_{\mathcal{H}}\right| \cdot\left|V_{\mathcal{G}}\right|\right) 。
Proof. The runtime of the multilevel kk-way partitioning approach to bisect a graph G=(V,E)G=(V, E) is O(|E|)\mathcal{O}(|E|) [20]. The depth of recursive calls to bisect G\mathcal{G} is |~log_(2)(|V_(G)|)~|\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)\right\rceil and the size of the graph GG is halved in each step. Thus, the total runtime is sum_(k=0)^(|~log_(2)(|V_(G)|)-1~|)2^(k)O(|E_(G)|)//2^(k)=log_(2)(|V_(G)|)|E_(G)|=\sum_{k=0}^{\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)-1\right\rceil} 2^{k} \mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right) / 2^{k}=\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)\left|E_{\mathcal{G}}\right|=O(|E_(G)|log(|V_(G)|))\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \log \left(\left|V_{\mathcal{G}}\right|\right)\right). The depth of recursive calls to bisect H\mathcal{H} is the same as for G\mathcal{G} because the number of processors in H\mathcal{H} ( sum_(v inV_(H))C(v)==|V_(G)|\sum_{v \in V_{\mathcal{H}}} C(v)==\left|V_{\mathcal{G}}\right| ) is equal to the |V_(G)|\left|V_{\mathcal{G}}\right|. However, in H\mathcal{H}, all vertices are considered at each recursion level of the bisection (only edges cut in previous recursions are removed). If we assume that no edges are cut (removed), then the runtime is sum_(k=0)^(|~log_(2)(|V_(G)|)-1~|)2^(k)*O(|E_(H)|)=O(|E_(H)|)*(|V_(G)|-1)=\sum_{k=0}^{\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)-1\right\rceil} 2^{k} \cdot \mathcal{O}\left(\left|E_{\mathcal{H}}\right|\right)=\mathcal{O}\left(\left|E_{\mathcal{H}}\right|\right) \cdot\left(\left|V_{\mathcal{G}}\right|-1\right)=O(|E_(H)|*|V_(G)|)\mathcal{O}\left(\left|E_{\mathcal{H}}\right| \cdot\left|V_{\mathcal{G}}\right|\right). 证明。将图 G=(V,E)G=(V, E) 一分为二的多级 kk 向分割方法的运行时间为 O(|E|)\mathcal{O}(|E|) [20]。将 G\mathcal{G} 一分为二的递归调用深度为 |~log_(2)(|V_(G)|)~|\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)\right\rceil ,图 GG 的大小每一步减半。因此,总运行时间为 sum_(k=0)^(|~log_(2)(|V_(G)|)-1~|)2^(k)O(|E_(G)|)//2^(k)=log_(2)(|V_(G)|)|E_(G)|=\sum_{k=0}^{\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)-1\right\rceil} 2^{k} \mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right) / 2^{k}=\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)\left|E_{\mathcal{G}}\right|=O(|E_(G)|log(|V_(G)|))\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \log \left(\left|V_{\mathcal{G}}\right|\right)\right) 。由于 H\mathcal{H} ( sum_(v inV_(H))C(v)==|V_(G)|\sum_{v \in V_{\mathcal{H}}} C(v)==\left|V_{\mathcal{G}}\right| ) 中的处理器数量等于 |V_(G)|\left|V_{\mathcal{G}}\right| 中的处理器数量,因此对 H\mathcal{H} 进行的递归调用深度与 G\mathcal{G} 相同。但是,在 H\mathcal{H} 中,每一级递归分割都会考虑所有顶点(只删除前一级递归中切割的边)。如果我们假设没有边被剪切(删除),那么运行时间就是 sum_(k=0)^(|~log_(2)(|V_(G)|)-1~|)2^(k)*O(|E_(H)|)=O(|E_(H)|)*(|V_(G)|-1)=\sum_{k=0}^{\left\lceil\log _{2}\left(\left|V_{\mathcal{G}}\right|\right)-1\right\rceil} 2^{k} \cdot \mathcal{O}\left(\left|E_{\mathcal{H}}\right|\right)=\mathcal{O}\left(\left|E_{\mathcal{H}}\right|\right) \cdot\left(\left|V_{\mathcal{G}}\right|-1\right)=O(|E_(H)|*|V_(G)|)\mathcal{O}\left(\left|E_{\mathcal{H}}\right| \cdot\left|V_{\mathcal{G}}\right|\right) 。
We used the METIS library [20] to compute a (2,1+epsilon)(2,1+\epsilon) balanced bisection. The bisection had to be balanced in some rare cases. Our library does this by moving the vertex with the lowest cumulative edge weight from the bigger to the smaller partition. 我们使用 METIS 库[20]来计算 (2,1+epsilon)(2,1+\epsilon) 平衡分段。在一些罕见的情况下,分段必须是平衡的。我们的库通过将累积边权重最小的顶点从较大的分区移到较小的分区来实现平衡。
In the following, we discuss a new algorithm based on graph similarity. This algorithm has significantly lower time complexity and improves dilation and congestion. 下面,我们将讨论一种基于图相似性的新算法。该算法大大降低了时间复杂度,并改善了扩张和拥塞问题。
3.3 Mapping based on Graph Similarity 3.3 基于图相似性的映射
It is well known that there is a duality between graphs and sparse matrices and techniques from sparse linear algebra have been applied to solve graph problems [11]. The basic idea is that a graph’s adjacency matrix can be modeled as a sparse matrix which enables the application of established techniques from sparse linear algebra. 众所周知,图与稀疏矩阵之间存在对偶性,稀疏线性代数的技术已被用于解决图问题 [11]。其基本思想是,图的邻接矩阵可以建模为稀疏矩阵,这样就可以应用稀疏线性代数的成熟技术。
A well-studied NP-hard problem is the reduction of the bandwidth of a sparse matrix which tries to eliminate nonzero elements that are far from the diagonal elements by re-numbering columns of the matrix. This can be used to bring the adjacency matrices two graphs G\mathcal{G} and H\mathcal{H} in a similar shape. This technique effectively transforms both graphs into a shape where edges are localized. The Reverse Cuthill 减少稀疏矩阵的带宽是一个经过充分研究的 NP 难问题,它试图通过对矩阵的列重新编号来消除远离对角线元素的非零元素。这可用于使两个图 G\mathcal{G} 和 H\mathcal{H} 的邻接矩阵形状相似。这种技术能有效地将两个图转化为边缘局部化的形状。反向 Cuthill
McKee (RCM) algorithm [6] is a successful heuristic for the bandwidth reduction problem. McKee (RCM) 算法 [6] 是一种成功的带宽缩减问题启发式算法。
RCM mapping applies the RCM algorithm to G\mathcal{G} and H\mathcal{H} to compute pi_(G)\pi_{\mathcal{G}} and pi_(H)\pi_{\mathcal{H}} and then computes the final process permutation pi(pi_(G))=pi_(H)\pi\left(\pi_{\mathcal{G}}\right)=\pi_{\mathcal{H}}, that is, pi=pi_(H)@pi_(G)^(-1)\pi=\pi_{\mathcal{H}} \circ \pi_{\mathcal{G}}^{-1}. To handle mappings with |G| < |H||\mathcal{G}|<|\mathcal{H}| correctly, all vertices vv with C(v)=C(v)= 0 are removed from H\mathcal{H}. Despite potential disconnectivity on the sub-graph, RCM handles the proximity condition well and produces mappings with low dilation and congestion. RCM 映射将 RCM 算法应用于 G\mathcal{G} 和 H\mathcal{H} 以计算 pi_(G)\pi_{\mathcal{G}} 和 pi_(H)\pi_{\mathcal{H}} ,然后计算最终过程排列 pi(pi_(G))=pi_(H)\pi\left(\pi_{\mathcal{G}}\right)=\pi_{\mathcal{H}} ,即 pi=pi_(H)@pi_(G)^(-1)\pi=\pi_{\mathcal{H}} \circ \pi_{\mathcal{G}}^{-1} 。为了正确处理具有 |G| < |H||\mathcal{G}|<|\mathcal{H}| 的映射,所有具有 C(v)=C(v)= 0 的顶点 vv 都会从 H\mathcal{H} 中删除。尽管子图上存在潜在的断开性,RCM 仍能很好地处理邻近性条件,并生成低扩张和低拥塞的映射。
Figures 3(a) and 3(b) show the adjacency matrices for the problem graph G\mathcal{G} and the network graph H\mathcal{H}, respectively. 图 3(a) 和 3(b) 分别显示了问题图 G\mathcal{G} 和网络图 H\mathcal{H} 的邻接矩阵。
(a) G\mathcal{G} adjacency map of the F1 matrix on 512 processes. (a) 512 个进程上 F1 矩阵的 G\mathcal{G} 邻接图。
(b) H\mathcal{H} adjacency map for an 8xx8xx88 \times 8 \times 8 torus. (b) 一个 8xx8xx88 \times 8 \times 8 环形的 H\mathcal{H} 邻接图。
Figure 3: Example for RCM topology mapping of the F1 matrix to a torus network. 图 3:F1 矩阵到环形网络的 RCM 拓扑映射示例。
Figure 3(a) shows the adjacency matrix of the communication topology for a sparse matrix-vector product of the F1 matrix on 512 processes. This represents one of our application-use-cases and described in detail in Section 5.2. Figure 3(b) shows the physical topology of an 8x8x83-d8 \mathrm{x} 8 \mathrm{x} 83-\mathrm{d} torus network with 512 processes. 图 3(a) 显示了 512 个进程上 F1 矩阵的稀疏矩阵向量积的通信拓扑邻接矩阵。这是我们的一个应用案例,详见第 5.2 节。图 3(b) 显示了具有 512 个进程的 8x8x83-d8 \mathrm{x} 8 \mathrm{x} 83-\mathrm{d} 环形网络的物理拓扑结构。
Both figures show the original permutation on the left and the RCM permutation on the right. RCM mapping is now based on the similarity between both RCM graphs. This effectively minimizes dilation and congestion. 两图左边显示的是原始排列,右边显示的是 RCM 排列。RCM 映射现在基于两个 RCM 图形之间的相似性。这有效地减少了扩张和拥塞。
THEOREM 4. Let m=max{degree(v)∣vm=\max \{\operatorname{degree}(v) \mid v in V}V\}. RCMR C M topology mapping computes a mapping in time O(m_(H)log(m_(H))|V_(H)|+m_(G)log(m_(G))|V_(G)|)\mathcal{O}\left(m_{\mathcal{H}} \log \left(m_{\mathcal{H}}\right)\left|V_{\mathcal{H}}\right|+m_{\mathcal{G}} \log \left(m_{\mathcal{G}}\right)\left|V_{\mathcal{G}}\right|\right). 定理 4.让 V}V\} 中的 m=max{degree(v)∣vm=\max \{\operatorname{degree}(v) \mid v . RCMR C M 拓扑映射在 O(m_(H)log(m_(H))|V_(H)|+m_(G)log(m_(G))|V_(G)|)\mathcal{O}\left(m_{\mathcal{H}} \log \left(m_{\mathcal{H}}\right)\left|V_{\mathcal{H}}\right|+m_{\mathcal{G}} \log \left(m_{\mathcal{G}}\right)\left|V_{\mathcal{G}}\right|\right) 时间内计算出一个映射。
Proof. The complexity of RCM is O(m log(m)|V|)\mathcal{O}(m \log (m)|V|) [6] where m=max{degree(v)∣vm=\max \{\operatorname{degree}(\mathrm{v}) \mid v in V}V\}. The algorithm applies RCM to H\mathcal{H} and G\mathcal{G} and the mapping can be computed from the results in O(|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right|\right). 证明。RCM 的复杂度为 O(m log(m)|V|)\mathcal{O}(m \log (m)|V|) [6],其中 m=max{degree(v)∣vm=\max \{\operatorname{degree}(\mathrm{v}) \mid v 在 V}V\} 中。该算法将 RCM 应用于 H\mathcal{H} 和 G\mathcal{G} ,可根据 O(|V_(G)|)\mathcal{O}\left(\left|V_{\mathcal{G}}\right|\right) 中的结果计算映射。
The discussions in the introduction suggests that m_(H)=m_{\mathcal{H}}=O(1)\mathcal{O}(1) and scalable parallel algorithms often have m_(G)=m_{\mathcal{G}}=O(log(|V_(G)|)).quadRCM\mathcal{O}\left(\log \left(\left|V_{\mathcal{G}}\right|\right)\right) . \quad \mathrm{RCM} is thus significantly faster than the greedy and the recursive mapping approaches and is a good candidate for large-scale systems. 导言中的讨论表明, m_(H)=m_{\mathcal{H}}=O(1)\mathcal{O}(1) 和可扩展的并行算法通常具有 m_(G)=m_{\mathcal{G}}=O(log(|V_(G)|)).quadRCM\mathcal{O}\left(\log \left(\left|V_{\mathcal{G}}\right|\right)\right) . \quad \mathrm{RCM} ,因此其速度明显快于贪婪和递归映射方法,是大规模系统的良好候选方案。
3.4 Supporting Multicore Nodes 3.4 支持多核节点
If compute nodes (vertices v inV_(H)v \in V_{\mathcal{H}} ) execute more than one process, then a graph partitioner can be used to divide G\mathcal{G} before other mapping strategies are applied. The common case where each allocated node executes the same number of processes C(v)=p AA v in Gamma(V_(G))C(v)=p \forall v \in \Gamma\left(V_{\mathcal{G}}\right) and the topology graph G\mathcal{G} needs to be partitioned into P//pP / p equal pieces is supported by graph partitioners. 如果计算节点(顶点 v inV_(H)v \in V_{\mathcal{H}} )执行一个以上的进程,那么在应用其他映射策略之前,可以使用图分割器来分割 G\mathcal{G} 。常见的情况是,每个分配节点都执行相同数量的进程 C(v)=p AA v in Gamma(V_(G))C(v)=p \forall v \in \Gamma\left(V_{\mathcal{G}}\right) ,拓扑图 G\mathcal{G} 需要分割成 P//pP / p 个相等的部分,图分割器支持这种情况。
This technique benefits from the long experience in serial and parallel graph partitioning. Multiple heuristics for (k,1+epsilon)(\mathrm{k}, 1+\epsilon)-balanced partitioning using geometric, combinatorial, spectral and multilevel schemes exist [8,§18][8, \S 18]§. 这项技术得益于串行和并行图分割方面的长期经验。 (k,1+epsilon)(\mathrm{k}, 1+\epsilon) 平衡分区有多种启发式方法,可使用几何、组合、谱和多级方案 [8,§18][8, \S 18]§ 。
Libraries, such as METIS [20] or SCOTCH [18] and their parallel versions offer optimized partitioning heuristics. However, most graph partitioners cannot guarantee perfectly ( k,1k, 1 )-balanced but ( k,1+epsilonk, 1+\epsilon )-balanced partitions (for small epsilon\epsilon ). Thus, the partition might need to be corrected to be (k,1)-balanced. We use the ParMeTiS partitioner (Multilevel kk-way Partitioning in {:O(|E_(G)|)[20])\left.\mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right)[20]\right) to compute ( k,1+epsilonk, 1+\epsilon )-balanced partitions and balance the partitions if necessary. METIS [20] 或 SCOTCH [18] 等库及其并行版本提供了优化的分区启发式方法。然而,大多数图分割器不能保证完全( k,1k, 1 )平衡,但( k,1+epsilonk, 1+\epsilon )平衡的分割(对于小 epsilon\epsilon )。因此,分区可能需要修正为 (k,1) 平衡。我们使用 ParMeTiS 分区器( {:O(|E_(G)|)[20])\left.\mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right)[20]\right) 中的多级 kk 向分区)来计算 ( k,1+epsilonk, 1+\epsilon )- 平衡分区,并在必要时平衡分区。
3.5 Improving the Initial Solution 3.5 改进初始解决方案
We now describe a heuristic that might further improve the found solution as was used in several previous works. Several heuristics exist for such problems. Threshold Accepting [9] is an improved algorithm for simulated annealing or hill climbing which takes an initial solution and tries to optimize it further by searching a local minimum. We use 20 iterations in the inner optimization loop and a time limit to determine the number of outer optimization iterations. Candidate solutions are modified by swapping two random positions in the mapping pi\pi. We will introduce a fast algorithm to estimate the congestion in Section 5.1 which is also used as weight function to minimize the optimization in our TA implementation. The asymptotic running time of each iteration of TA is equal to the running time of Algorithm 3 (cf. Theorem 5). 现在,我们将介绍一种启发式方法,它可以进一步改进已找到的解决方案,这在之前的几项研究中已经使用过。针对此类问题,已有几种启发式算法。阈值接受法[9]是模拟退火法或爬坡法的一种改进算法,它采用初始解,并试图通过搜索局部最小值来进一步优化。我们在内优化循环中使用 20 次迭代,并使用时间限制来确定外优化迭代次数。通过交换映射 pi\pi 中的两个随机位置来修改候选解。我们将在第 5.1 节中介绍一种估算拥塞的快速算法,在我们的 TA 实现中,它也被用作权重函数来最小化优化。TA 每次迭代的渐进运行时间等于算法 3 的运行时间(参见定理 5)。
In the next section we describe how to effectively compose all strategies into a topology mapping framework and apply them to real-world network architectures. 在下一节中,我们将介绍如何有效地将所有策略整合到拓扑映射框架中,并将其应用到现实世界的网络架构中。
4. A TOPOLOGY MAPPING LIBRARY 4.拓扑映射库
Several problems need to be solved in addition to the mapping problem in order to use topology mapping in practice. We show a mechanism that supports most interconnection networks and implement it in a portable library to perform automated topology mapping for parallel applications. 要在实践中使用拓扑映射,除了映射问题外,还需要解决几个问题。我们展示了一种支持大多数互连网络的机制,并将其实现在一个可移植库中,以便为并行应用自动执行拓扑映射。
4.1 Determining the Network Topology 4.1 确定网络拓扑结构
The first practical problem is to determine the network topology graph H\mathcal{H}. This task can be handled manually based on the physical connections between compute nodes. However, many interconnection networks offer automated tools to query the topology and connectivity. Table 1 lists the tools that can be used to query the topology of different networks. All listed networks are supported by our implementation. The result of running those tools, the graph H\mathcal{H}, is stored as an adjacency list in a configuration file on disk. When a parallel job starts up, each process loads H\mathcal{H} and identifies the vertex that it runs on. The identification is done with the hostname of the machine, that is, each vertex in the H\mathcal{H} file (representing a compute node) has its hostname 第一个实际问题是确定网络拓扑图 H\mathcal{H} 。这项任务可以根据计算节点之间的物理连接进行手动处理。不过,许多互连网络都提供了自动工具来查询拓扑和连接性。表 1 列出了可用于查询不同网络拓扑的工具。我们的实现支持所有列出的网络。运行这些工具的结果,即图 H\mathcal{H} 将作为邻接表存储在磁盘上的配置文件中。并行作业启动时,每个进程都会加载 H\mathcal{H} 并识别其运行的顶点。识别是通过机器的主机名完成的,也就是说, H\mathcal{H} 文件中的每个顶点(代表一个计算节点)的主机名都是
Figure 4: Optimization process flow. The mapping with the lowest congestion is chosen at the end. 图 4:优化流程。最后选择拥堵程度最低的映射。
Input: Graphs \(\mathcal{H}\) and \(\mathcal{G}\), mapping \(\Gamma: V_{\mathcal{G}} \rightarrow V_{\mathcal{H}}\).
Output: congestion \(\rho(e)\) for all \(e \in E_{\mathcal{H}}\).
\(\hat{\omega}=\max _{e \in E_{\mathcal{G}}}\{\omega(e)\} \cdot\left|V_{\mathcal{H}}\right|^{2}\)
initialize all \(\rho(e)\) with \(\hat{\omega}\); // enforce paths with minimal number of edges in SSSP
foreach \(e=(u, v) \in E_{\mathcal{G}}\) do
find shortest path \(\Gamma(u) \leadsto \Gamma(v)\) in \(\mathcal{H}\);
// implicitly minimizing congestion
increase edge weight \(\rho(f)\) of each edge \(f \in E_{\mathcal{H}}\) along path \(\Gamma(u) \leadsto \Gamma(v)\) by \(\omega(e) / c(f)\);
7 subtract \(\hat{\omega}\) from all \(\rho(e) ; / /\) correct edge weights
as attribute attached. Each process pp has now access to the initial mapping Gamma(p)\Gamma(p) which is often not under the user’s control (e.g., determined by the batch system). BlueGene/P is an exception where H\mathcal{H} is created on the fly after querying the Deep Computing Messaging Framework (DCMF) for all topology information. 作为附加属性。现在,每个进程 pp 都可以访问初始映射 Gamma(p)\Gamma(p) ,而初始映射通常不受用户控制(例如,由批处理系统决定)。BlueGene/P 是一个例外, H\mathcal{H} 是在查询深度计算消息框架 (DCMF) 的所有拓扑信息后即时创建的。
4.2 Composing a Mapping Strategy 4.2 制定绘图策略
We now seek to permute processes so as to reduce congestion and dilation. We assume that interference with other jobs is negligible, so that congestion can be computed from the network topology, the location of the allocated processes and the communication graph. 现在,我们设法对进程进行排列,以减少拥塞和扩张。我们假设与其他作业的干扰可以忽略不计,因此可以通过网络拓扑结构、分配进程的位置和通信图来计算拥塞情况。
If all C(v)=pC(v)=p are all equal, then an optional graph partitioning phase as described in Section 3.4 is used to divide G\mathcal{G} into P//pP / p partitions. The topomapper library uses ParMETIS [20] to perform partitioning and corrects the resulting ( k,1+epsilon)\mathrm{k}, 1+\epsilon)-balanced partitioning by moving vertices from partitions with more then pp vertices to partitions with less than pp vertices. The correction step moves vertices with the least cumulative edge weight. After this optional partitioning step, a new graph G^(')\mathcal{G}^{\prime} that contains the partitions as vertices with |V_(G)^(')|=P//p\left|V_{\mathcal{G}}^{\prime}\right|=P / p is created. Only inter-partition edges from G\mathcal{G} remain in G^(')\mathcal{G}^{\prime} and vertices are numbered from 0 to (P)/(p)-1\frac{P}{p}-1. 如果所有 C(v)=pC(v)=p 都相等,则会使用第 3.4 节所述的可选图形划分阶段,将 G\mathcal{G} 划分为 P//pP / p 分区。topomapper 库使用 ParMETIS [20] 来执行分区,并通过将顶点从顶点数大于 pp 的分区移动到顶点数小于 pp 的分区,来修正由此产生的 ( k,1+epsilon)\mathrm{k}, 1+\epsilon) 平衡分区。修正步骤会移动累积边权重最小的顶点。在完成这个可选的分区步骤后,就会创建一个新的图 G^(')\mathcal{G}^{\prime} ,其中包含作为顶点的分区 |V_(G)^(')|=P//p\left|V_{\mathcal{G}}^{\prime}\right|=P / p 。只有来自 G\mathcal{G} 的分区间边保留在 G^(')\mathcal{G}^{\prime} 中,顶点的编号从 0 到 (P)/(p)-1\frac{P}{p}-1 。
A second step applies Greedy, Recursive, or RCM mapping as described in Sections 3.1, 3.2, and 3.3 respectively. The mappings can be optimized additionally by applying the threshold accepting algorithm discussed in Section 3.5. 第二步分别应用第 3.1、3.2 和 3.3 节所述的 "贪婪"、"递归 "或 RCM 映射。此外,还可以通过应用第 3.5 节中讨论的阈值接受算法来优化映射。
The complete control flow of the optimization process is shown in Figure 4. All processes apply the optimization process, subsets of processes can perform different optimizations, for example, each process chooses a different starting vertex for the Greedy mapping. The permutation with the 优化流程的完整控制流如图 4 所示。所有进程都执行优化流程,进程子集可以执行不同的优化,例如,每个进程为贪婪映射选择不同的起始顶点。与
lowest congestion is chosen at the end of the optimization process and returned. 在优化过程结束时,选择拥堵程度最低的一个并返回。
5. EXPERIMENTAL ANALYSIS 5.实验分析
We analyze the efficiency and performance of mappings of irregular process topologies onto different multicore network topologies. 我们分析了将不规则流程拓扑映射到不同多核网络拓扑的效率和性能。
5.1 A Fast Algorithm to Assess Congestion 5.1 评估拥塞的快速算法
Assessing the congestion with the technique described in Section 2.5 is, due to the high time complexity (O(|V_(H)|^(2)|E_(H)|)=O(|V_(H)|^(4)))\left(\mathcal{O}\left(\left|V_{\mathcal{H}}\right|^{2}\left|E_{\mathcal{H}}\right|\right)=\mathcal{O}\left(\left|V_{\mathcal{H}}\right|^{4}\right)\right), impractical at large scale. Thus, we propose a portable and fast heuristic for determining the approximate congestion of all edges in H\mathcal{H} in Algorithm 3. The congestion is computed by repeated shortest path calculations. To find the minimal congestion, the edge weights along used (shortest) paths are updated after each search to reflect the current load. This leads to an automatic balancing of edges along all paths. However, with this scheme, paths with more edges and less congestion on those edges might have shorter weighted distances. This is avoided by initializing the edges to a high weight hat(omega)=max_(e inE_(G)){omega(e)}*|V_(H)|^(2)\hat{\omega}=\max _{e \in E_{\mathcal{G}}}\{\omega(e)\} \cdot\left|V_{\mathcal{H}}\right|^{2} so that a path with less edges always has a shorter weighted distance regardless of the congestion. Among all paths with the minimal number of edges, those with minimal congestion are then preferred. 由于时间复杂度 (O(|V_(H)|^(2)|E_(H)|)=O(|V_(H)|^(4)))\left(\mathcal{O}\left(\left|V_{\mathcal{H}}\right|^{2}\left|E_{\mathcal{H}}\right|\right)=\mathcal{O}\left(\left|V_{\mathcal{H}}\right|^{4}\right)\right) 较高,使用第 2.5 节中描述的技术评估拥塞情况在大规模应用中并不可行。因此,我们在算法 3 中提出了一种可移植的快速启发式方法,用于确定 H\mathcal{H} 中所有边的近似拥塞情况。拥挤度是通过重复最短路径计算得出的。为了找到最小拥塞,每次搜索后都会更新已用(最短)路径上的边权重,以反映当前负载。这样就能自动平衡所有路径上的边。不过,采用这种方案后,边缘较多但拥堵程度较低的路径可能会有较短的加权距离。为了避免这种情况,我们将边缘初始化为高权重 hat(omega)=max_(e inE_(G)){omega(e)}*|V_(H)|^(2)\hat{\omega}=\max _{e \in E_{\mathcal{G}}}\{\omega(e)\} \cdot\left|V_{\mathcal{H}}\right|^{2} ,这样无论拥堵程度如何,边缘较少的路径加权距离总是较短。这样,在所有边缘数量最少的路径中,拥堵程度最小的路径就会被优先选择。
TheOrem 5. The runtime of Algorithm 3 is O(|E_(G)|:}\mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right.. {:(|E_(H)|+|V_(H)|*log|V_(H)|))\left.\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right). 运行时间算法 3 的运行时间为 O(|E_(G)|:}\mathcal{O}\left(\left|E_{\mathcal{G}}\right|\right. 。 {:(|E_(H)|+|V_(H)|*log|V_(H)|))\left.\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right) 。
Proof. Exactly one SSSP-run on H\mathcal{H} (e.g., Dijkstra’a algorithm using a Fibonacci heap) is started for each edge in G\mathcal{G} (line 3-4). Thus, the asymptotic runtime of Algorithm 3 is O(|E_(G)|*(|E_(H)|+|V_(H)|*log|V_(H)|))\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right). 证明。对于 G\mathcal{G} 中的每一条边,都要在 H\mathcal{H} 上进行一次精确的 SSSP 运行(例如,使用斐波那契堆的 Dijkstra 算法)(第 3-4 行)。因此,算法 3 的渐进运行时间为 O(|E_(G)|*(|E_(H)|+|V_(H)|*log|V_(H)|))\mathcal{O}\left(\left|E_{\mathcal{G}}\right| \cdot\left(\left|E_{\mathcal{H}}\right|+\left|V_{\mathcal{H}}\right| \cdot \log \left|V_{\mathcal{H}}\right|\right)\right) 。
5.2 Real-world Irregular Process Topologies 5.2 真实世界的不规则流程拓扑
Sparse matrix-vector multiplication is one of the most important kernels in large-scale scientific applications and can be used to solve a large class of scientific computing problems [8]. In order to capture the characteristics of real irregular applications, we use parallel sparse matrixvector products with real-world input matrices from the University of Florida Sparse Matrix Collection [7]: F1, nlpkkt240, and audikw_1. All three matrices represent un- 稀疏矩阵向量乘法是大规模科学应用中最重要的内核之一,可用于解决大量科学计算问题[8]。为了捕捉真实不规则应用的特点,我们使用佛罗里达大学稀疏矩阵集[7]中的真实输入矩阵进行并行稀疏矩阵向量乘法:F1、nlpkkt240 和 audikw_1。这三个矩阵都表示未
Figure 5: Topology mapping results for different topologies. 图 5:不同拓扑的拓扑映射结果。
structured matrices/grids. F1 and audikw_1 are symmetric stiffness matrices-approximating elasticities in structural mechanics-modeling automotive crankshafts. The nlpkkt240 matrix is the largest matrix in the collection and represents a nonlinear programming problem for a 3d PDEconstrained optimization. Table 2 lists the dimensions and number of non-zero (nnz) entries for each matrix. 结构矩阵/网格。F1 和 audikw_1 是对称刚度矩阵--近似结构力学中的弹性--模拟汽车曲轴。nlpkkt240 矩阵是矩阵集合中最大的矩阵,代表了 3d PDE 受限优化的非线性编程问题。表 2 列出了每个矩阵的尺寸和非零(nnz)条目数。
Table 2: Properties of the test matrices. 表 2:测试矩阵的属性。
The vector of a sparse matrix-vector product is initially distributed block-wise. Each element ii of the vector requires all elements jj where the matrix element A_(i,j)A_{i, j} is nonzero. Most matrix elements are zero and the pattern of nonzero elements depends on the structure of the input system. Thus, in order to minimize the communication, scientific codes usually partition the matrix with a graph partitioner and redistribute matrix and vector elements accordingly. We use ParMeTiS to find a decomposition of the matrix in order to minimize communication. ^(1){ }^{1} The domain-optimized decomposition is then used to derive the number of vector elements that need to be communicated from and to each process. We build a weighted MPI- 2.2 process topology that reflects the communication requirements of the decomposition. The resulting distributed topology communicator [12] is used by our topology mapping library to optimize the process-to-node mapping. 稀疏矩阵-向量积的向量最初是按块分布的。向量的每个元素 ii 都需要矩阵元素 A_(i,j)A_{i, j} 为非零的所有元素 jj 。大多数矩阵元素为零,非零元素的模式取决于输入系统的结构。因此,为了尽量减少通信量,科学代码通常使用图分割器对矩阵进行分割,并相应地重新分配矩阵和矢量元素。我们使用 ParMeTiS 对矩阵进行分解,以尽量减少通信。 ^(1){ }^{1} 经过领域优化的分解之后,我们就可以得出每个进程之间需要通信的矢量元素数量。我们构建了一个加权 MPI- 2.2 进程拓扑,它反映了分解的通信要求。由此产生的分布式拓扑通信器 [12] 被我们的拓扑映射库用于优化进程到节点的映射。
All experiments presented below use the same input matrices which means that they simulate a strong scaling problem. We ran all experiments with up to 1,792 processes (or the maximum supported by the physical topology). All presented results used TA to refine the mapping until otherwise noted. Results without TA are omitted for brevity. TA improved the congestion between 2%2 \% and 9%9 \%. 下面介绍的所有实验都使用相同的输入矩阵,这意味着它们模拟的是一个强扩展问题。我们使用多达 1792 个进程(或物理拓扑支持的最大值)运行所有实验。除非另有说明,所有展示的结果都使用了 TA 来完善映射。为简洁起见,未使用 TA 的结果从略。TA 改善了 2%2 \% 和 9%9 \% 之间的拥塞情况。
5.3 Petascale Network Topologies 5.3 百亿亿次级网络拓扑
We investigate topologies that are used to build current and future petascale-class systems: A three-dimensional torus is used in the Cray XT-5 and IBM Blue Gene architectures. The IBM PERCS network [1] uses a heterogeneous hierarchical fully-connected topology to construct a 10 petaflop computer. 我们研究了用于构建当前和未来 petascale 级系统的拓扑结构:Cray XT-5 和 IBM Blue Gene 架构中使用了三维环形结构。IBM PERCS 网络 [1] 采用异构分层全连接拓扑结构来构建 10 petaflop 计算机。
We present only one representative matrix for each network topology due to space limitations. We also analyze only one process per node in Sections 5.3 and 5.4 because we assume that hybrid programming schemes will be used to exploit the full potential of those machines. 由于篇幅有限,我们只为每种网络拓扑结构提供一个代表性矩阵。在第 5.3 和 5.4 节中,我们也只分析了每个节点的一个进程,因为我们假定将使用混合编程方案来充分挖掘这些机器的潜力。
5.3.1 Three-Dimensional Torus 5.3.1 三维环面
A kk-dimensional torus of size x_(1)xx cdots xxx_(k)x_{1} \times \cdots \times x_{k} has vertices < m_(1)dotsm_(k) ><m_{1} \ldots m_{k}>, where 0 <= m_(i) < x_(i)0 \leq m_{i}<x_{i} and edges connecting < m_(1)dotsm_(k) ><m_{1} \ldots m_{k}> to < m_(1)dotsm_(i)+-1(modx_(i))dotsm_(k) ><m_{1} \ldots m_{i} \pm 1\left(\bmod x_{i}\right) \ldots m_{k}>, for i=1,dots,ki=1, \ldots, k. 大小为 x_(1)xx cdots xxx_(k)x_{1} \times \cdots \times x_{k} 的 kk 维环面有顶点 < m_(1)dotsm_(k) ><m_{1} \ldots m_{k}> ,其中 0 <= m_(i) < x_(i)0 \leq m_{i}<x_{i} 和连接 < m_(1)dotsm_(k) ><m_{1} \ldots m_{k}> 和 < m_(1)dotsm_(i)+-1(modx_(i))dotsm_(k) ><m_{1} \ldots m_{i} \pm 1\left(\bmod x_{i}\right) \ldots m_{k}> 的边,为 i=1,dots,ki=1, \ldots, k 。
We investigate 3-dimensional toruses with cube topologies ( x_(1)=x_(2)=x_(3)x_{1}=x_{2}=x_{3} ) which maximize bisection bandwidth. Processes are mapped in lexicographical order, i.e., < 0,0,0 > , < 0,0,1 > ,dots, < x_(1)-1,x_(2)-1,x_(3)-2 > , <<0,0,0>,<0,0,1>, \ldots,<x_{1}-1, x_{2}-1, x_{3}-2>,<x_(1)-1,x_(2)-1,x_(3)-1 >x_{1}-1, x_{2}-1, x_{3}-1> in the initial allocation. 我们研究了具有立方体拓扑结构( x_(1)=x_(2)=x_(3)x_{1}=x_{2}=x_{3} )的三维环,它能最大限度地提高分段带宽。进程按词典顺序映射,即在初始分配中按 < 0,0,0 > , < 0,0,1 > ,dots, < x_(1)-1,x_(2)-1,x_(3)-2 > , <<0,0,0>,<0,0,1>, \ldots,<x_{1}-1, x_{2}-1, x_{3}-2>,<x_(1)-1,x_(2)-1,x_(3)-1 >x_{1}-1, x_{2}-1, x_{3}-1> 映射。
Figure 5(a) shows the maximum congestion of mapping the communication topology that results from a domaindecomposition of the nlpkkt240 matrix to different 3d-Torus networks. The relative gain over the initial consecutive mapping increases with the network size. Greedy mapping reduces the maximum congestion by 27%27 \% for a 3^(3)3^{3} and up to 32%32 \% for a 12^(3)12^{3} torus network. RCM is slightly worse than greedy in all configurations, however, it reduces the dilation significantly. The recursive mapping algorithm delivers the best results at large scale where it outperforms greedy with a relative gain of 44%44 \% for a 12^(3)12^{3} network. The average dilation a 12^(3)12^{3} torus was 9.00,9.03,7.02,4.509.00,9.03,7.02,4.50 for the initial, Greedy, RCM, and Recursive mappings, respectively. Recursive reduces the average dilation by 50%50 \% and might thus result in lowest power consumption. 图 5(a) 显示了将 nlpkkt240 矩阵的域分解产生的通信拓扑映射到不同 3d-Torus 网络时的最大拥塞情况。与初始连续映射相比,相对收益随网络规模的增加而增加。对于 3^(3)3^{3} 环状网络,贪婪映射可将最大拥塞降低 27%27 \% ;对于 12^(3)12^{3} 环状网络,贪婪映射可将最大拥塞降低 32%32 \% 。在所有配置中,RCM 都比贪婪稍差,但它能显著减少扩张。递归映射算法在大规模情况下效果最佳,对于 12^(3)12^{3} 网络,它的相对增益 44%44 \% 优于贪婪算法。对于初始映射、贪婪映射、RCM 映射和递归映射, 12^(3)12^{3} 环的平均扩张率分别为 9.00,9.03,7.02,4.509.00,9.03,7.02,4.50 。递归映射将平均扩张减少了 50%50 \% ,因此功耗可能最低。
The memory overhead to start the physical topology was between 0.63 kiB for 3^(3)3^{3} and 31.20 kiB for 12^(3)12^{3} respectively. It shows that RCM takes basically constant time (never more than 0.01 s ) and Greedy and Recursive take up to 1 s while TA can be infeasibly expensive with nearly 10 minutes. 启动物理拓扑的内存开销分别为 3^(3)3^{3} 的 0.63 kiB 和 12^(3)12^{3} 的 31.20 kiB。结果表明,RCM 所需的时间基本不变(从不超过 0.01 秒),而 Greedy 和 Recursive 所需的时间最多为 1 秒,而 TA 所需的时间可能长达近 10 分钟,耗费巨大。
5.3.2 PERCS Network 5.3.2 PERCS 网络
The PERCS topology [1] was designed by IBM to construct a multi-petaflop machine. The network consists of three different link types: LL, LR, and D with different speeds. Each endpoint connects to 7 neighbors via LL links with a rate of 24GiB//s,2424 \mathrm{GiB} / \mathrm{s}, 24 neighbors via LR links at a rate of 5GiB//s5 \mathrm{GiB} / \mathrm{s}, and up to 16 neighbors via D links with 10GiB//s10 \mathrm{GiB} / \mathrm{s}. Each stage (link-type) forms a fully-connected network. A set of nodes that is fully connected with LL links is called drawer and a set of nodes fully-connected with LL+LR links is called supernode; supernodes are fully connected by D links. Each drawer consists of 8 nodes and each supernode consists of 4 drawers. The size of the network is determined by the number of D links. The maximum distance between PERCS 拓扑[1]由 IBM 设计,用于构建多千万亿次机器。该网络由三种不同的链路类型组成:LL、LR 和 D,速度各不相同。每个端点通过 LL 链路连接 7 个邻居(速率为 24GiB//s,2424 \mathrm{GiB} / \mathrm{s}, 24 ),通过 LR 链路连接邻居(速率为 5GiB//s5 \mathrm{GiB} / \mathrm{s} ),通过 D 链路连接最多 16 个邻居(速率为 10GiB//s10 \mathrm{GiB} / \mathrm{s} )。每个阶段(链接类型)形成一个全连接网络。通过 LL 链路全连接的节点集称为抽屉,通过 LL+LR 链路全连接的节点集称为超级节点;超级节点通过 D 链路全连接。每个抽屉由 8 个节点组成,每个超级节点由 4 个抽屉组成。网络的大小由 D 链路的数量决定。D 链路之间的最大距离为
any two nodes is three. A detailed description of the network and the topology can be found in [1]. We assume 9 D links per node which results in 9248 nodes total and we connect all D links randomly. The total topology occupies 1,445kiB1,445 \mathrm{kiB} in main memory. 任意两个节点的总和为三个。关于网络和拓扑结构的详细描述,请参阅 [1]。我们假设每个节点有 9 个 D 链路,因此节点总数为 9248 个,我们随机连接所有 D 链路。整个拓扑结构占用主内存 1,445kiB1,445 \mathrm{kiB} 。
For the first simple example, we assume that processes are allocated and mapped consecutively to nodes in drawers and then drawers in supernodes. Figure 5(b) shows the result of topology mapping for this heterogeneous network architecture. Topology mapping can reduce the maximum congestion by up to 80%80 \% ( P=1,792\mathrm{P}=1,792 ). The huge improvement comes from the effective exploitation of the different link speeds in the greedy strategy. RCM performs consistently slightly worse than greedy because it does not take the link capacities into account. Recursive achieves with 1.82 a lower average congestion than Greedy with 2.89 with 1,728 nodes. The benefits grow with the size of the allocation. 在第一个简单的例子中,我们假设进程被连续分配和映射到抽屉中的节点,然后再映射到超级节点中的抽屉。图 5(b) 显示了这种异构网络架构的拓扑映射结果。拓扑映射可将最大拥塞降低达 80%80 \% ( P=1,792\mathrm{P}=1,792 )。巨大的改善来自于贪婪策略中对不同链路速度的有效利用。RCM 的表现一直略逊于贪婪策略,因为它没有考虑链路容量。在 1728 个节点的情况下,递归法的平均拥塞度为 1.82,低于贪婪法的 2.89。收益随着分配规模的扩大而增加。
Again, RCM mapping consistently takes less than 0.01 s while Greedy grows from 0.8 s to 22 s and Recursive from 4.51 s to 7.51 s . TA took 41 minutes at P=512\mathrm{P}=512 and was thus disabled for P > 512P>512. 同样,RCM 映射持续耗时不到 0.01 秒,而 Greedy 从 0.8 秒增长到 22 秒,Recursive 从 4.51 秒增长到 7.51 秒。 在 P=512\mathrm{P}=512 中,TA 耗时 41 分钟,因此在 P > 512P>512 中被禁用。
We now investigate our topology mapping strategies on large-scale InfiniBand installations. We used the tools described in Section 4 to query the network topology of two large-scale systems, Juropa at the Jülich Supercomputing Center and Ranger at the Texas Advanced Computing Center. Both systems use InfiniBand topologies that are similar to fat-trees. The number of nodes in the systems were 3,292 for Juropa and 4,081 for Ranger. For our initial allocations, we use the order of hostnames like a batch-system does by default. As before, we assume one process per core in our analyses to investigate the quality of topology mapping separately from multicore mapping. 现在,我们将在大型 InfiniBand 设备上研究拓扑映射策略。我们使用第 4 节中描述的工具查询了两个大型系统的网络拓扑结构,一个是位于尤里希超级计算中心的 Juropa,另一个是位于德克萨斯高级计算中心的 Ranger。这两个系统都使用与胖树类似的 InfiniBand 拓扑。Juropa 系统的节点数为 3292 个,Ranger 系统的节点数为 4081 个。在初始分配时,我们像批处理系统默认的那样使用主机名顺序。和以前一样,我们在分析中假定每个内核只有一个进程,以便单独研究拓扑映射和多核映射的质量。
RCM is again fastest with less than 0.01 s . Greedy takes between 0.16 s and 2.6 s and Recursive between 0.63 s and 1.21 s , while TA is with up to 9 minutes only feasible at small scales. Juropa’s complete topology occupied 87 kiB memory. RCM 又是最快的,耗时小于 0.01 秒。 Greedy 的耗时在 0.16 秒到 2.6 秒之间,Recursive 的耗时在 0.63 秒到 1.21 秒之间,而 TA 的耗时高达 9 分钟,只有在小规模情况下才可行。Juropa 的完整拓扑占用了 87 kiB 内存。
5.4.2 Ranger 5.4.2 游侠
Figure 6(a) shows the results of topology mapping on the Ranger cluster. The maximum congestion was improved by up to 50%50 \%, depending on the allocation size. Figure 6(b) shows the mapping times for the Ranger system. Again, Greedy performs significantly better than RCM at a much higher cost. RCM finished all mapping problems in less than 0.01 s while Greedy used between 0.26 s and 3.85 s and Recursive between 0.76 s and 1.5 s . TA took up to 14 minutes for the largest problem and only improved it modestly. Ranger’s complete topology occupied 134 kiB memory. 图 6(a) 显示了 Ranger 集群的拓扑映射结果。根据分配大小的不同,最大拥塞情况最多可改善 50%50 \% 。图 6(b) 显示了 Ranger 系统的映射时间。同样,Greedy 的性能明显优于 RCM,但成本却高得多。RCM 在不到 0.01 秒的时间内完成了所有映射问题,而 Greedy 用时在 0.26 秒到 3.85 秒之间,Recursive 用时在 0.76 秒到 1.5 秒之间。 TA 处理最大问题的时间长达 14 分钟,而且仅略有改善。Ranger 的完整拓扑占用了 134 kiB 内存。
5.5 Benchmark Results 5.5 基准结果
In our theoretical analysis and simulations, we made several assumptions on the (ideal) routing scheme and network behavior. The improvements reported by our mapping strategies are thus lower bounds and are hard to achieve in practice. 在我们的理论分析和模拟中,我们对(理想)路由方案和网络行为做了一些假设。因此,我们的映射策略所报告的改进只是下限,在实践中很难实现。
We now show benchmark results on Surveyor, an IBM BlueGene/P system at the Argonne National Lab, to demonstrate the utility of our topology mapping library and algorithms in practice. 现在,我们在阿贡国家实验室的 IBM BlueGene/P 系统 Surveyor 上展示了基准结果,以证明我们的拓扑映射库和算法在实践中的实用性。
As for the simulation, each process loads a part of the matrix, decomposes it with a graph partitioner, constructs an MPI-2.2 graph topology, and calls the topomapper library to optimize the mapping. The library exercises all options as described in Section 4 and returns an optimized mapping. 至于模拟,每个进程加载矩阵的一部分,用图形分割器将其分解,构建 MPI-2.2 图形拓扑,并调用 topomapper 库优化映射。该库将执行第 4 节所述的所有选项,并返回优化后的映射。
We measured the time to perform 100 communication phases in isolation and report the maximum time across all ranks before and after applying the mapping. We also compute a predicted time from the improvement in maximum congestion which is a lower bound to the actual improvement. 我们测量了单独执行 100 个通信阶段的时间,并报告了应用映射前后所有等级的最长时间。我们还根据最大拥塞改善情况计算了预测时间,这是实际改善情况的下限。
These experiments show that topology mapping leads to significant improvements in practical settings. 这些实验表明,拓扑映射在实际应用中能带来显著的改进。
6. CONCLUSIONS AND FUTURE WORK 6.结论和未来工作
In this work, we defined the topology mapping problem and presented a proof that an finding an optimal solution to the problem is NP-hard. This opens the door to investigate the efficiency of different heuristics for topology mapping. 在这项工作中,我们定义了拓扑映射问题,并提出了一个证明,即找到该问题的最优解是 NP-hard。这为研究拓扑映射的不同启发式方法的效率打开了大门。
We propose different topology mapping algorithms that support arbitrary heterogeneous network and application topologies and showed their effective use in the context of sparse linear algebra computation. The proposed topology mapping algorithms have been implemented to support reordering in the intuitive distributed graph topology interface in MPI-2.2. 我们提出了支持任意异构网络和应用拓扑的不同拓扑映射算法,并展示了它们在稀疏线性代数计算中的有效应用。所提出的拓扑映射算法已在 MPI-2.2 的直观分布式图拓扑界面中实现,以支持重新排序。
We showed improvements of the maximum congestion of up to 80%80 \% and our results indicate that the benefits of topology mapping grow with the system size. We analyzed the scalability of the different mapping approaches. Our theoretical and practical analysis shows that Greedy and Recursive are slower than RCM and that additional optimization with threshold accepting (TA) might be prohibitively expensive. Greedy scales approximately linearly with the system size for all our investigated application and network topologies which means that it might not be suitable for large mappings. Recursive mapping is faster but might result in worse congestion. However, RCM is fastest in theory and never took longer than 0.01 s in our experiments. We also found that the Greedy performs well for minimizing congestion and Recursive and RCM for minimizing dilation. This creates interesting opportunities for further investigation. We conclude that TA can improve most mappings further but it is not scalable to large systems. 我们的结果表明,拓扑映射的优势随着系统规模的扩大而增加。我们分析了不同映射方法的可扩展性。我们的理论和实践分析表明,Greedy 和 Recursive 比 RCM 慢,而且使用阈值接受 (TA) 进行额外优化的成本可能过高。对于我们研究的所有应用和网络拓扑结构,贪婪法的扩展与系统规模近似线性关系,这意味着它可能不适合大型映射。递归映射速度更快,但可能导致更严重的拥塞。不过,RCM 理论上速度最快,而且在我们的实验中从未超过 0.01 秒。我们还发现,Greedy 在最小化拥塞方面表现良好,而 Recursive 和 RCM 在最小化扩张方面表现良好。这为进一步研究创造了有趣的机会。我们的结论是,TA 可以进一步改进大多数映射,但无法扩展到大型系统。
Figure 6: Topology mapping results for different networks. 图 6:不同网络的拓扑映射结果。
Our proposed optimization framework utilizes the available parallelism in the system. It starts the Greedy algorithm at different source vertices on each node and simultaneously applies RCM and Recursive on one node each and selects the best solution found. We demonstrated speedups of up to 18%18 \% of the communication phase of a sparse matrixvector multiplication on 512 BlueGene/P nodes. 我们提出的优化框架利用了系统中可用的并行性。它在每个节点的不同源顶点上启动贪婪算法,同时在每个节点上应用 RCM 和递归算法,并选择找到的最佳解决方案。我们在 512 个 BlueGene/P 节点上演示了稀疏矩阵向量乘法的通信阶段提速高达 18%18 \% 。
We plan to investigate optimized strategies for initial process-to-node mappings on different architectures. The PERCS network topology presents multiple interesting challenges in this area. 我们计划研究不同架构上初始进程到节点映射的优化策略。PERCS 网络拓扑结构在这一领域提出了多种有趣的挑战。
Our implementation can immediately be used to optimize communication on petascale systems. However, the proposed mapping algorithms can scale to the size of exascale systems. The two metrics, maximum congestion and average dilation can be used to optimize and trade application runtime and power consumption on such systems. Exascale systems will need to exhibit substantially improved communication locality in order to achieve acceptable energy consumption [14]. The use of high quality mapping procedures will be essential to achieving this goal. 我们的实现可立即用于优化千万亿次系统上的通信。不过,我们提出的映射算法可以扩展到超大规模系统。最大拥塞和平均扩张这两个指标可用于优化和交换此类系统上的应用运行时间和功耗。为了实现可接受的能耗,超大规模系统将需要大幅提高通信局部性[14]。使用高质量的映射程序对实现这一目标至关重要。
Acknowledgments. We thank Peter Gottschling and Andrew Lumsdaine for many helpful discussions and comments. Thanks to Bernd Mohr for providing the Juropa topology and Len Wisniewski and the TACC for providing the Ranger topology. This work is supported by the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (award number OCI 07-25070) and the state of Illinois. 致谢。我们感谢 Peter Gottschling 和 Andrew Lumsdaine 的讨论和评论。感谢 Bernd Mohr 提供 Juropa 拓扑,感谢 Len Wisniewski 和 TACC 提供 Ranger 拓扑。这项工作得到了美国国家科学基金会(奖号:OCI 07-25070)和伊利诺伊州政府支持的 "蓝水 "持续级计算项目的支持。
7. REFERENCES 7.参考文献
[1] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. The PERCS High-Performance Interconnect. In Proc. of 18th Symposium on High-Performance Interconnects (HotI’10), Aug. 2010. [1] B. Arimilli、R. Arimilli、V. Chung、S. Clark、W. Denzel、B. Drerup、T. Hoefler、J. Joyner、J. Lewis、J. Li、N. Ni 和 R. Rajamony。PERCS 高性能互连。第 18 届高性能互连研讨会(HotI'10)论文集,2010 年 8 月。
[2] A. Bhatelé, L. V. Kalé, and S. Kumar. Dynamic topology aware load balancing algorithms for molecular dynamics applications. In ICS '09, pages 110-116, New York, NY, USA, 2009. ACM. [2] A. Bhatelé、L. V. Kalé 和 S. Kumar。分子动力学应用的动态拓扑感知负载均衡算法。In ICS '09, pages 110-116, New York, NY, USA, 2009.ACM.
[3] S. H. Bokhari. On the mapping problem. IEEE Trans. Comput., 30(3):207-214, 1981. [3] S. H. Bokhari.论映射问题。IEEE Trans.Comput.,30(3):207-214,1981.
[4] S. W. Bollinger and S. F. Midkiff. Heuristic technique for processor and link assignment in multicomputers. IEEE Trans. Comput., 40(3):325-333, 1991. [4] S. W. Bollinger 和 S. F. Midkiff.多计算机中处理器和链路分配的启发式技术。IEEE Trans.Comput.,40(3):325-333,1991.
[5] U. Brandes. A faster algorithm for betweenness centrality. The Journal of Math. Sociology, 25(2):163-177, 2001. [5] U. Brandes.间度中心性的快速算法。The Journal of Math.Sociology, 25(2):163-177, 2001.
[6] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetric matrices. In Proceedings of the 1969 24th national conference, ACM '69, pages 157-172, New York, NY, USA, 1969. ACM. [6] E. Cuthill 和 J. McKee.减少稀疏对称矩阵的带宽。In Proceedings of the 1969 24th national conference, ACM '69, pages 157-172, New York, NY, USA, 1969.ACM.
[7] T. A. Davis. University of Florida Sparse Matrix Collection. NA Digest, 92, 1994. [7] T. A. Davis.University of Florida Sparse Matrix Collection.NA Digest, 92, 1994.
[8] J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennedy, L. Torczon, and A. White, editors. Sourcebook of parallel computing. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [8] J. Dongarra、I. Foster、G. Fox、W. Gropp、K. Kennedy、L. Torczon 和 A. White 编辑。并行计算资料集》。摩根考夫曼出版公司,美国加利福尼亚州旧金山,2003 年。
[9] G. Dueck and T. Scheuer. Threshold accepting: a general purpose optimization algorithm appearing superior to simulated annealing. J. Comput. Phys., 90(1):161-175, 1990. [9] G. Dueck 和 T. Scheuer.阈值接受:一种优于模拟退火的通用优化算法。J. Comput.90(1):161-175,1990.
[10] M. Gary and D. Johnson. Computers and Intractability: A Guide to NP-Completeness. New York: W H. Freeman and Company, 1979. [10] M. Gary 和 D. Johnson.Computers and Intractability:A Guide to NP-Completeness.New York:W H. Freeman and Company, 1979.
[11] J. R. Gilbert, S. Reinhardt, and V. B. Shah. High-performance graph algorithms from parallel sparse matrices. In PARA’06: Proceedings of the 8th international conference on Applied parallel computing, pages 260-269, 2007. [11] J. R. Gilbert、S. Reinhardt 和 V. B. Shah.来自并行稀疏矩阵的高性能图算法。In PARA'06: Proceedings of the 8th international conference on Applied parallel computing, pages 260-269, 2007.
[12] T. Hoefler, R. Rabenseifner, H. Ritzdorf, B. R. de Supinski, R. Thakur, and J. L. Traeff. The Scalable Process Topology Interface of MPI 2.2. Concurrency and Computation: Practice and Experience, 23(4):293-310, Aug. 2010. [12] T. Hoefler、R. Rabenseifner、H. Ritzdorf、B. R. de Supinski、R. Thakur 和 J. L. Traeff。MPI 2.2 的可扩展进程拓扑接口。并发与计算:实践与经验》,23(4):293-310,2010 年 8 月。
[13] R. Johari and D. Tan. End-to-end congestion control for the internet: delays and stability. Networking, IEEE/ACM Transactions on, 9(6):818-832, Dec. 2001. [13] R. Johari 和 D. Tan.互联网端到端拥塞控制:延迟与稳定性。Networking, IEEE/ACM Transactions on, 9(6):818-832, Dec. 2001.
[14] P. Kogge et al. Exascale computing study: Technology challenges in achieving exascale systems. DARPAD A R P A Information Processing Techniques Office, Washington, DC,2008D C, 2008. [14] P. Kogge 等人,超大规模计算研究:实现超大规模系统的技术挑战。 DARPAD A R P A 信息处理技术办公室,华盛顿, DC,2008D C, 2008 。
[15] S.-Y. Lee and J. K. Aggarwal. A mapping strategy for parallel processing. IEEE Trans. Comput., 36(4):433-442, 1987. [15] S.-Y. Lee and J. K. Aggarwal.Lee and J. K. Aggarwal.并行处理的映射策略。IEEE Trans.Comput.,36(4):433-442,1987.
[16] MPI Forum. MPI: A Message-Passing Interface Standard. Version 2.2, June 23rd 2009. www.mpi-forum.org. [16] MPI 论坛。MPI:消息传递接口标准》。版本 2.2,2009 年 6 月 23 日。www.mpi-forum.org.
[17] D. Pekurovsky. P3DFFT - Highly scalable parallel 3D Fast Fourier Transforms library. Technical report, 2010. [17] D. Pekurovsky.P3DFFT - 高度可扩展的并行 3D 快速傅立叶变换库。技术报告,2010 年。
[18] F. Pellegrini and J. Roman. Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs. In HPCN Europe’96, pages 493-498, 1996. [18] F. Pellegrini 和 J. Roman.Scotch: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs.In HPCN Europe'96, pages 493-498, 1996.
[19] A. L. Rosenberg. Issues in the study of graph embeddings. In WG^(')80W G^{\prime} 80, pages 150-176, London, UK, 1981. [19] A. L. Rosenberg.图嵌入研究中的问题。In WG^(')80W G^{\prime} 80 , pages 150-176, London, UK, 1981.
[20] K. Schloegel, G. Karypis, and V. Kumar. Parallel static and dynamic multi-constraint graph partitioning. Concurrency and Computation: Practice and Experience, 14(3):219-240, 2002. [20] K. Schloegel, G. Karypis, and V. Kumar.并行静态和动态多约束图分割。并发与计算:实践与经验,14(3):219-240,2002.
[21] H. D. Simon and S.-H. Teng. How good is recursive bisection? SIAM J. Sci. Comput., 18:1436-1445, September 1997. [21] H. D. Simon 和 S.-H. Teng.Teng.递归一分法有多好?SIAM J. Sci. Comput., 18:1436-1445, September 1997.
[22] J. L. Träff. Implementing the MPI process topology mechanism. In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1-14,20021-14,2002. [22] J. L. Träff.实现 MPI 进程拓扑机制。In Supercomputing '02: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, pages 1-14,20021-14,2002 .
[23] H. Yu, I.-H. Chung, and J. Moreira. Topology mapping for Blue Gene/L supercomputer. In SC^(')06S C^{\prime} 06, page 116, New York, NY, USA, 2006. ACM. [23] H. Yu, I.-H. Chung, and J. Moreira.Chung, and J. Moreira.蓝色基因/L 超级计算机的拓扑映射。 SC^(')06S C^{\prime} 06 ,第 116 页,美国纽约州纽约市,2006 年。ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 允许将本作品的全部或部分内容制作成数字或硬拷贝,供个人或课堂使用,不收取任何费用,但不得以营利或商业利益为目的制作或分发拷贝,且拷贝必须在首页上标明本声明和完整的引文。如需复制、再版、在服务器上发布或在列表中重新分发,则需事先获得特别许可和/或付费。
ICS’11, May 31-June 4, 2011, Tucson, Arizona, USA. ICS'11,2011 年 5 月 31 日至 6 月 4 日,美国亚利桑那州图森。
Copyright 2011 ACM 978-1-4503-0102-2/11/05…$10.00. 版权 2011 ACM 978-1-4503-0102-2/11/05...10.00 美元。
^(1){ }^{1} This step should not be confused with graph partitioning for multicore topology mapping even though it uses the same tools! ^(1){ }^{1} 这一步骤不应与多核拓扑映射的图形分区混淆,尽管它使用了相同的工具!