| Internet-Draft | IBCS Datapath Processing | March 2026 |
| Tian, et al. | Expires 2 September 2026 | [Page] |
In-band congestion signaling protocols, such as Congestion Signaling (CSIG) and High Precision Congestion Control (HPCC++), require intermediate Network Elements (NEs) to actively parse scalar congestion metrics from packet headers, evaluate them against local link states, and conditionally rewrite these fields before transmission. To ensure end-to-end algorithmic consistency and avoid unintended interactions with routing topologies (e.g., packet reordering), the datapath of these NEs must adhere to a standardized logical processing model.¶
This document defines the normative datapath processing architecture for Network Elements participating in In-Band Congestion Signaling (IBCS). By establishing abstract topological roles (Edge vs. Transit NEs) and standardizing the "Compare-and-Replace" operational paradigm, this specification abstracts the signal update logic from hardware-specific pipelines. It guarantees strict orthogonality between congestion signaling and Equal-Cost Multi-Path (ECMP) routing invariants, supporting diverse congestion metrics across multi-vendor deployments.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 2 September 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Modern high-speed data center networks increasingly rely on fine-grained, In-Band Congestion Signaling (IBCS) to achieve ultra-low latency and high throughput. Protocols being discussed in the IETF, such as CSIG [I-D.ravi-ippm-csig] and HPCC++ [I-D.miao-ccwg-hpcc], utilize packet headers to convey link-level congestion telemetry directly to end-hosts. A fundamental paradigm of these proposals is the "Compare-and-Replace" operation: as a packet traverses the network, each transit Network Element (NE) compares the congestion signal carried in the packet against its own local congestion metric. If the local NE represents a more severe bottleneck, it overwrites the signal field with its local metric.¶
Unlike traditional stacking-based telemetry (such as IOAM [RFC9197]) where metadata is appended hop-by-hop, the Compare-and-Replace paradigm maintains a constant header size, avoiding Maximum Transmission Unit (MTU) exhaustion. However, updating a packet header on-the-fly introduces significant architectural challenges for datapath pipelines. If the processing behavior is not rigorously defined, modifying packet fields can inadvertently alter hash-based load balancing (ECMP), leading to micro-burst flow reordering. Furthermore, inconsistent state handling at domain boundaries can result in spoofed or corrupted signals reaching the congestion control algorithm.¶
This document specifies the normative datapath behavior and abstract processing model required to support IBCS safely and efficiently. It introduces a role-based architecture (differentiating edge initialization from transit evaluation) and specifies a protocol-agnostic extremum evaluation model (e.g., evaluating minimum available bandwidth or maximum queue delay). By establishing this unified architectural framework, this document aims to ensure operational interoperability and robust signal delivery across heterogeneous network infrastructures.¶
A general mechanism where congestion state metrics are embedded within the data packet header and dynamically updated by Network Elements along the forwarding path.¶
The congestion signal value currently carried within the packet header. It represents the most severe bottleneck encountered so far on the packet's path.¶
The locally computed congestion metric at the transit NE's egress port (e.g., residual bandwidth, queue utilization, or link delay).¶
The abstract logical entity within a Network Element's datapath responsible for evaluating P_Metric against L_Metric and executing the conditional header rewrite.¶
The mathematical comparison operator (MIN or MAX) dictated by the specific signaling protocol's semantics to determine the tightest bottleneck.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
This document defines abstract datapath behaviors applicable to single administrative domains (e.g., autonomous data center fabrics) deploying end-to-end congestion control loops based on fixed-length, mutable in-band signals.¶
This specification explicitly differentiates itself from In-situ Operations, Administration, and Maintenance (IOAM) [RFC9197] and INT [P4-INT]. While IOAM focuses on comprehensive visibility through metadata stacking (Trace Option), the behavior described herein strictly addresses fixed-length "Compare-and-Replace" updates designed specifically for fast-path congestion control algorithms, where per-hop state history is discarded in favor of the path's bottleneck state.¶
To guarantee the integrity of the IBCS loop, Network Elements MUST apply different processing rules depending on their topological placement relative to the signaling domain. This document defines three distinct abstract NE roles:¶
The Ingress Edge NE operates at the boundary where traffic enters the trusted IBCS domain (e.g., a Top-of-Rack switch receiving traffic from a bare-metal server or an untrusted tenant VM).¶
The Ingress Edge NE MUST inspect arriving packets for existing IBCS fields.
To prevent signal spoofing attacks, it MUST act as a signal scrubber: any recognized
IBCS field arriving from an untrusted interface MUST be reset to its protocol-defined
UNINITIALIZED state before further processing. Only after initialization
may the packet be passed to the SUF for its first local evaluation.¶
Transit NEs operate entirely within the trusted boundaries of the IBCS domain
(e.g., Spine or Core switches). Transit NEs implicitly trust the P_Metric
carried in the packet header.¶
A Transit NE MUST NOT unconditionally reset or scrub the P_Metric.
Its sole responsibility regarding the IBCS field is to execute the strict Compare-and-Replace
logic defined in Section 6, ensuring that the metric is only
overwritten if the local datapath represents a tighter bottleneck.¶
The Egress Edge NE operates at the boundary where traffic exits the trusted IBCS domain. If the destination is outside the administrative domain and no explicit IBCS peering agreement exists, the Egress Edge NE SHOULD strip or zero-out the IBCS field to prevent internal telemetry leakage to external observers.¶
Regardless of the physical hardware pipeline architecture (e.g., run-to-completion, multi-stage ASIC, or programmable switch), the externally observable behavior of any IBCS-enabled Network Element MUST conform to the following abstract sequence. This model ensures that routing invariants are preserved.¶
The datapath parses the designated IBCS header field to extract the current
P_Metric. If the NE does not recognize the protocol or the IBCS field
is absent, the packet MUST bypass all subsequent IBCS update logic and be forwarded
opaquely.¶
The packet undergoes routing, access control list (ACL) application, and Equal-Cost Multi-Path (ECMP) or Link Aggregation Group (LAG) path selection.¶
CRITICAL REQUIREMENT: The IBCS signal update process MUST be strictly orthogonal
to path selection. The datapath MUST NOT mutate the P_Metric or any related
congestion header fields prior to or during the hash computation phase. Altering
header values before ECMP hashing violates fundamental flow invariance, causing
packets within the same microflow to traverse asymmetric paths, resulting in
TCP/transport reordering degradation.¶
Once the deterministic egress port is resolved, the Signal Update Function (SUF)
retrieves the real-time L_Metric specifically associated with that port.
The SUF evaluates P_Metric against L_Metric and conditionally
commits the update to the packet header. This mutation MUST be treated as an atomic
transaction applied immediately prior to serialization on the wire.¶
+-------------+ +--------------------+ +---------------+ | | | Routing, QoS, & | | | | Header | --> | ECMP Path Selection| --> | Signal Update |--> Tx | Resolution | | (Hash Computation) | | Function (SUF)| | | | | | | +-------------+ +--------------------+ +---------------+ Extract Strictly Orthogonal Compare & Replace; P_Metric (No Header Mutation) Checksum Update
Different IBCS protocols characterize congestion semantics differently. For instance,
CSIG signals Minimum Available Bandwidth (requiring a MIN operator), whereas HPCC++
may signal Maximum Queue Depth (requiring a MAX operator). The SUF MUST implement a
configurable Extremum_Operator to accommodate the semantics of the deployed protocol.¶
The normative state-machine logic executed by the SUF is defined as follows:¶
Function SUF_Evaluate(P_Metric, L_Metric, Extremum_Operator):
// Rule 1: Initialization Handling
IF P_Metric == UNINITIALIZED:
Rewrite packet header: P_Metric = L_Metric
RETURN
// Rule 2: Protocol-Specific Bottleneck Evaluation
IF Extremum_Operator == MIN:
IF L_Metric < P_Metric:
Rewrite packet header: P_Metric = L_Metric
ELSE IF Extremum_Operator == MAX:
IF L_Metric > P_Metric:
Rewrite packet header: P_Metric = L_Metric
// Rule 3: Preservation
// If local state is NOT the tighter bottleneck,
// the header MUST NOT be modified.
¶
Atomicity: The rewrite operation MUST be robust. Partial byte updates or malformed header emissions MUST NOT occur, even under extreme internal buffer exhaustion or exception path processing.¶
If the IBCS field is encapsulated within an IPv4 or UDP header, the SUF MUST update the corresponding Layer 3 / Layer 4 checksums. To achieve line-rate processing without introducing significant latency jitter, incremental checksum calculation [RFC1141] is highly RECOMMENDED.¶
If the IBCS field is embedded in a Layer 2 extension or a custom tag (as commonly deployed in closed data center fabrics), IP/UDP checksum modifications are bypassed, substantially reducing the silicon processing overhead.¶
The stability of the congestion control loop is inherently tied to how L_Metric
is generated. While the specific hardware counter implementation is out of scope for this
document, the NE MUST guarantee that L_Metric is relatively stable and decoupled
from instantaneous micro-burst noise.¶
Network Elements SHOULD provide a configurable moving average or sampling window for L_Metric.
The optimal sampling interval typically corresponds to the baseline Round-Trip Time (RTT)
of the network domain (e.g., 5 to 50 microseconds). Sampling too frequently causes
signal oscillation; sampling too slowly creates stale telemetry that dampens transport responsiveness.¶
If a Network Element experiences an internal architectural fault where the real-time
L_Metric from the egress port becomes temporarily unavailable to the SUF,
the NE MUST NOT drop the packet. Instead, it MUST execute a fail-open behavior,
forwarding the packet with the existing P_Metric completely unmodified.
This guarantees that transient local datapath faults do not sever the end-to-end signaling loop.¶
This document has no IANA actions.¶