Check patentability & draft patents in minutes with Patsnap Eureka AI!

Communication channel failover in a high performance computing (HPC) network

Active Publication Date: 2014-06-19
IBM CORP
View PDF4 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

The invention provides a solution to failover issues in a cluster fabric by allowing for the transfer of a communication channel's state between windows in a hardware device. This is achieved by updating mappings between memory resources and hardware resources in the fabric interface device without modifying the memory resources. The technical effect is minimizing or eliminating the impact on clients that utilize the communication channel during failover.

Problems solved by technology

Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability.
For example, many SMP architectures suffer to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases.
Processing units in the nodes enjoy relatively low access latencies for data contained in the local system memory of the processing units' respective nodes, but suffer significantly higher access latencies for data contained in the system memories in remote nodes.
Thus, access latencies to system memory are non-uniform.
Communication loss between coordinating processes on different computation nodes (e.g., user jobs or OS instances) has been found to lead to delay / loss of job progress, lengthy recovery, and / or jitter in the system, effectively wasting computing resources, power and delaying the eventual result.
However, doing so doubles CPU / memory resources and bandwidth usage, and requires merging / discarding results coming back from multiple sources.
Doing so, however, additional channel resources to be assigned per end-client (compute job), additional resources to manage multiple channels, and additional overhead in user jobs or OS libraries to manage merging communications streams.
Moreover, any operations queued to failed hardware will often be lost, as failure of one channel often may only be detected by a long-interval software timer.
However, such solutions require additional channel resources to be assigned per end-client (compute job), most of which are never used.
Additional resources are also typically required to manage multiple channels, and any operations queued to the failed hardware will typically be lost.
In addition, failure of one channel typically may only be detected with a long-interval software timer.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Communication channel failover in a high performance computing (HPC) network
  • Communication channel failover in a high performance computing (HPC) network
  • Communication channel failover in a high performance computing (HPC) network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025]Now turning to the drawings, wherein like numbers denote like parts throughout the several views, FIG. 1 illustrates a high-level block diagram depicting a first view of an example data processing system 100 configured with two nodes connected via respective host fabric interfaces, according to one illustrative embodiment of the invention, and within which many of the functional features of the invention may be implemented. As shown, data processing system 100 includes multiple processing nodes 102A, 102B (collectively 102) for processing data and instructions. Processing nodes 102 are coupled via host fabric interface (HFI) 120 to an interconnect fabric 110 that supports data communication between processing nodes 102 in accordance with one or more interconnect and / or network protocols. Interconnect fabric 110 may be implemented, for example, utilizing one or more buses, switches and / or networks. Any one of multiple mechanisms may be utilized by the HFI 120 to communicate acr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method, apparatus and program product implement a failover of a communication channel in a cluster fabric that transfers a state of the communication channel between windows resident in a hardware fabric interface device. The failover is desirably implemented by updating a plurality of mappings between memory resources in a host memory and hardware resources in the fabric interface device, and typically without modifying the memory resources such that involvement of a client that utilizes the communication channel in the failover is minimized or eliminated.

Description

FIELD OF THE INVENTION[0001]The invention is generally related to data processing systems, and in particular to handling communication failures in distributed data processing systems.BACKGROUND OF THE INVENTION[0002]It is well-accepted in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processing units. Multi-processor (MP) computer systems may implement a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One common MP computer architecture is a symmetric multi-processor (SMP) architecture in which multiple processing units, each supported by a multi-level cache hierarchy, share a common pool of resources, such as a system memory and input / output (I / O) subsystem, which are often coupled to a shared system interconnect.[0003]Although SMP computer systems permi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F11/14
CPCG06F11/1412G06F11/1438G06F11/1484G06F11/20G06F11/0709G06F11/2007
Inventor ARROYO, JESSE P.BAUMAN, ELLEN M.SCHIMKE, TIMOTHY J.
Owner IBM CORP
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More