Internet-Draft | Fast Fault Tolerance Architecture | October 2024 |
Li, et al. | Expires 24 April 2025 | [Page] |
This document introduces a fast rerouting architecture for enhancing network resilience through rapid failure detection and swift traffic rerouting within the programmable data plane, leveraging in-band network telemetry and source routing. Unlike traditional methods that rely on the control plane and face significant delays in traffic rerouting, the proposed architecture utilizes a white-box modeling of the data plane to distinguish and analyze packet losses accurately, enabling immediate identification for link failures (including black-hole and gray failures). By utilizing real-time telemetry and SR-based rerouting, the proposed solution significantly reduces rerouting times to a few milliseconds, offering a substantial improvement over existing practices and marking a pivotal advancement in fault tolerance of datacenter networks.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 24 April 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In the rapidly evolving landscape of network technologies, ensuring the resilience and reliability of data transmission has become paramount. Traditional approaches to network failure detection and rerouting, heavily reliant on the control plane, often suffer from significant delays due to the inherent latency in failure notification, route learning, and route table updates. These delays can severely impact the performance of time-sensitive applications, making it crucial to explore more efficient methods for failure detection and traffic rerouting. Fast fault tolerance (FFT) architecture leverages the capabilities of the programmable data plane to significantly reduce the time required to detect link failures and reroute traffic, thereby enhancing the overall robustness of datacenter networks.¶
FFT architecture stands at the forefront of innovation by integrating in-band network telemetry (INT [RFC9232]) with source routing (SR [RFC8402]) to facilitate rapid path switching directly within the data plane. Unlike traditional schemes that treat the data plane as a "black box" and struggle to distinguish between different types of packet losses, our approach adopts a "white box" modeling of the data plane's packet processing logic. This allows for a precise analysis of packet loss types and the implementation of targeted statistical methods for failure detection. By deploying packet counters at both ends of a link and comparing them periodically, FFT can identify fault-induced packet losses with unprecedented speed and accuracy.¶
Furthermore, by pre-maintaining a path information table and utilizing SR (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]), FFT architecture enables the sender to quickly switch traffic to alternative paths without the need for control plane intervention. This not only circumvents the delays associated with traditional control plane rerouting but also overcomes the limitations of data plane rerouting schemes that cannot pre-prepare for all failure scenarios. The integration of INT allows for real-time failure notification, making it possible to control traffic recovery times within a few milliseconds, significantly faster than conventional methods. This document details the principles, architecture, and operational mechanisms of FFT, aiming to contribute to the development of more resilient and efficient datacenter networks.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The counter or data structure used to count the number of passed packets in a given time interval.¶
The table maintained by the sender that contains information about the available paths and their associated metrics.¶
The meter used to measure the number of packets passing through the upstream egress port of a link.¶
The meter used to measure the number of packets passing through the downstream ingress port of a link.¶
The FDM agent deployed on the upstream switch, it is used to generate probe packets to collect UM and DM.¶
The FDM agent deployed on the downstream switch, it is used to generate response packets to feedback UM and DM.¶
Traditional network failure detection methods generate probe packets through the control plane (such as BFD [RFC5880]), treating the network data plane as a "black box". If there is no response to a probe, it is assumed that a link failure has occurred, without the ability to distinguish between fault-induced packet loss and non-fault packet loss (such as congestion loss, policy loss, etc.). FFT models the packet processing logic in the data plane as a white box, analyzing all types of packet loss and designing corresponding statistical methods. As shown in Figure 1, FFT deploys packet counters at both ends of a link, which tally the total number of packets passing through as well as the number of non-fault packet losses, periodically comparing the two sets of counters to precisely measure fault-induced packet loss. This method operates entirely in the data plane, with probe packets directly generated by programmable network chips, thus allowing for a higher frequency of probes and the ability to detect link failures within a millisecond.¶
After detecting a link failure, FFT enables fast path switching for traffic in the data plane by combining INT with source routing. As shown in Figure 1, after a switch detects a link failure, it promptly notifies the sender of the failure information using INT technology; the sender then quickly switches the traffic to another available path using source routing, based on a path information table maintained in advance. All processes of this method are completed in the data plane, allowing traffic recovery time to be controlled within a few RTTs (on the order of milliseconds).¶
The fast fault tolerance architecture involves accurately detecting link failures within the network, distinguishing between packet losses caused by failures and normal packet losses, and then having switches convey failure information back to the end hosts via INT [RFC9232]. The end hosts, in turn, utilize SR (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]) to reroute traffic. Therefore, the fast fault tolerance architecture comprises three processes.¶
This document designs a failure detection mechanism (FDM) based on packet counters, leveraging the programmable data plane. As shown in Figure 2, this mechanism employs counters at both ends of a link to tally packet losses. So adjacent switches can collaborate to detect failures of any type (including gray failures), and the mechanism is capable of accurately distinguishing non-failure packet losses, thus avoiding false positive.¶
FDM places a pair of counter arrays on two directly connected programmable switches to achieve rapid and accurate failure detection. Figure 2 illustrates the deployment locations of these counters, which include two types of meter arrays: (1) the Upstream Meter (UM) is positioned at the beginning of the egress pipeline of the upstream switch; (2) the Downstream Meter (DM) is located at the end of the ingress pipeline of the downstream switch. Each meter records the number of packets passing through. With this arrangement, the difference between UM and DM represents the number of packets lost on the link. It is important to note that packets dropped due to congestion in the switch buffers are not counted, as the counters do not cover the buffer areas.¶
Furthermore, to exclude packet losses caused by non-failure reasons, each meter array includes some counters to tally the number of non-failure packet losses (e.g., TTL expiry). Therefore, FDM is capable of accurately measuring the total number of failure-induced packet losses occurring between UM and DM, including losses due to physical device failures (e.g., cable dust or link jitter) and control plane oscillations (e.g., route lookup misses).¶
Figure 3 illustrates the deployment method of FDM across the entire datacenter network. Similar to the BFD mechanism, FDM needs to cover every link in the network. Therefore, each link in the network requires the deployment of a pair of UM and DM. It is important to note that although only the unidirectional deployment from Switch#1 to Switch#2 is depicted in Figure 3, Switch#2 also sends traffic to Switch#1. To monitor the link from Switch#2 to Switch#1, FDM deploys a UM on the egress port of Switch#2 and a DM on the ingress port of Switch#1. Consequently, FDM utilizes two pairs of UM and DM to monitor a bidirectional link.¶
As shown in Figure 2, the FDM agent in the upstream switch (FDM-U) periodically sends request packets to the link's opposite end. These request packets record specific data of UM and DM along the path through the INT mechanism. Upon detecting the request packets, the FDM agent in the downstream switch (FDM-D) immediately modifies them as response packets and bounces them back, allowing the packets containing UM and DM data to return to the FDM-U. Subsequently, the FDM-U processes the response packets and calculates the packet loss rate of the link over the past period. If FDM-U continuously fails to receive a response packet, indicating that either the response or request packets are lost, then FDM-U considers the packet loss rate of that link to be 100%. This can be used to detect black-hole failure in the link. In other scenarios, if the packet loss rate exceeds a threshold (e.g., 5%) for an extended period, FDM-U will mark that outgoing link as failure.¶
To ensure the correctness of packet loss rate statistics, FDM must ensure that the packets recorded by UM and DM belong to the same batch. Upon closer analysis, it's found that request packets provide native batch synchronization, and FDM only needs to reset the counters upon receiving a request packet and then start counting the new batch. Specifically, since packets between two directly connected ports do not get out of order, the sequence of packets passing through UM and DM is consistent. As shown in Figure 4, the request packets serve to isolate different intervals and record the number of packets in the right interval. When such a request packet reaches the downstream switch, the DM records the number of packets for the same interval. Thus, UM and DM count the same batch of packets. However, the loss of request packets would disrupt FDM's batch synchronization. To avoid this, FDM configures active queue management to prevent the dropping of request packets during buffer congestion. If a request packet is still lost, it must be due to a fault.¶
To ensure stable network operation after failure recovery, FDM also periodically monitors the recovery status of links. This requires the FDM-U to send a batch of test packets, triggering UM and DM to count. Then, the FDM-U sends request packets to collect data from UM and DM. If the link's packet loss rate remains below the threshold for an extended period, FDM-U will mark the link as healthy. To reduce the bandwidth overhead of FDM, considering that the detection of failure recovery is not as urgent as failure detection, FDM can use a lower recovery detection frequency, such as once every second.¶
This section presents an example of how FDM calculates the packet loss rate of a link. Assume that 100 packets pass through the upstream switch UM, which records [100,0], with 0 representing no non-fault-related packet loss. Suppose 8 packets are dropped on the physical link and 2 packets are dropped at the ingress pipeline of the downstream switch due to ACL rules. Then, the DM records [90,2], where 90 represents the number of packets that passed through DM, and 2 represents the number of packets dropped due to non-fault reasons. Finally, by comparing the UM with DM, FDM calculates the packet loss rate of the link as 8% ((100-90-2)/100), rather than 10%.¶
Traditional control plane rerouting schemes require several steps after detecting a failure, including failure notification, route learning, and routing table updates, which can take several seconds to modify traffic paths. Data plane rerouting schemes, on the other hand, cannot prepare alternative routes for all possible failure scenarios in advance. To achieve fast rerouting in the data plane, FFT combines INT with source routing to quickly reroute traffic.¶
Assume that the sender periodically sends INT probe packets along the path of the traffic to collect fine-grained network information, such as port rates, queue lengths, etc.. After a switch detects a link failure, it promptly notifies the sender of the failure information within the INT probe. Specifically, when a probe emitted by an end host is about to be forwarded to an egress link that has failed, FFT will immediately bounce the probe back within the data plane and mark the failure status in the probe. Finally, the probe with the failure status will return to the sender.¶
To enable sender-driven fast rerouting, the sender needs to maintain a path information table in advance so that it can quickly switch to another available path upon detecting network failure. Specifically, within the transport layer protocol stack of the sender, this document designs a Path Management Mechanism (PMM), which periodically probes all available paths to other destinations. Of course, this information can also be obtained through other means, such as from an SDN controller. Then, for a new flow, the sender will randomly select an optimal available path from the path information table and use source routing (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]) to control the path of this flow. Similarly, the sender also controls the path of the INT probes using source routing, allowing them to probe the path taken by the traffic flow. The fine-grained network information brought back by these probes can be used for congestion control, such as HPCC [hpcc].¶
When the above FFM mechanism is effective, and the INT information makes the sender aware of a failure on the path, the sender will immediately mark this path as faulty in the path information table and choose another available path, accordingly modifying the source routing headers of both the data packets and the INT probes. To promptly understand the availability of other paths, PMM will periodically probe other paths and update the path information table, including failure entering and recovering.¶
TBD.¶
This document makes no request of IANA. Note to RFC Editor: this section may be removed on publication as an RFC.¶
TBD.¶