Network Working Group                                      N. Davis, Ed.
Internet-Draft                                                     Ciena
Intended status: Informational                            A. Farrel, Ed.
Expires: 30 May 2025                                  Old Dog Consulting
                                                                 T. Graf
                                                                Swisscom
                                                                   Q. Wu
                                                                  Huawei
                                                                   C. Yu
                                                     Huawei Technologies
                                                        26 November 2024


        Some Key Terms for Network Fault and Problem Management
                     draft-ietf-nmop-terminology-09

Abstract

   This document sets out some terms that are fundamental to a common
   understanding of network fault and problem management within the
   IETF.

   The purpose of this document is to bring clarity to discussions and
   other work related to network fault and problem management, in
   particular to YANG models and management protocols that report, make
   visible, or manage network faults and problems.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 30 May 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Davis, et al.              Expires 30 May 2025                  [Page 1]

Internet-Draft          Network Fault Terminology          November 2024


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Usage of Terms  . . . . . . . . . . . . . . . . . . . . . . .   3
   3.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   3
     3.1.  Context Terminology . . . . . . . . . . . . . . . . . . .   4
     3.2.  Core Terms  . . . . . . . . . . . . . . . . . . . . . . .   5
     3.3.  Other Terms . . . . . . . . . . . . . . . . . . . . . . .   8
   4.  Workflow Explanations . . . . . . . . . . . . . . . . . . . .   8
   5.  Security Considerations . . . . . . . . . . . . . . . . . . .  13
   6.  Privacy Considerations  . . . . . . . . . . . . . . . . . . .  13
   7.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  13
   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . .  13
   Informative References  . . . . . . . . . . . . . . . . . . . . .  14
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  15

1.  Introduction

   Successful operation of large or busy networks depends on effective
   network management.  Network management comprises a virtuous circle
   of network control, network observability, network analytics, network
   assurance, and back to network control.  Network fault and problem
   management [RFC6632] is an important aspect of network management and
   control solutions.  It deals with the detection, reporting,
   inspection, isolation, correlation, and management of events within
   the network.  The intention is to focus on those events that have a
   negative effect on the network's ability to forward traffic acording
   to expected behavior.  Fault and problem management extends to
   include actions taken to determine the causes of problems and to work
   toward recovery of expected network behavior.

   A number of work efforts within the IETF seek to provide components
   of a fault management system, such as YANG models or management
   protocols.  It is important that a common terminology is used so that
   there is a clear understanding of how the elements of the management
   and control solutions fit together, and how faults and problems will
   be handled.


Davis, et al.              Expires 30 May 2025                  [Page 2]

Internet-Draft          Network Fault Terminology          November 2024


   This document sets out some terms that are fundamental to a common
   understanding of network fault and problem management.  While
   "faults" and "problems" are concepts that apply at all levels of
   technology in the Internet, the scope of this document is restricted
   to the network layer and below, hence this document is specifically
   about "network fault and problem management."  The concept of
   "incidents" is also touched on in this document, where an incident
   results from one or more problems and is the disruption of a network
   service.

   Note that some useful terms are defined in [RFC3877] and [RFC8632].
   The definitions in this document are informed by those documents, but
   they are not dependent on that prior work.

2.  Usage of Terms

   The terms defined in this document are principally intended for
   consistent use within the IETF.  Where similar concepts are described
   in other bodies, an attempt has been made to harmonize with those
   other descriptions, but there is care needed where terms are not used
   consistently between bodies or where terms are applied outside the
   network layer.  If other bodies find the terminology defined in this
   document useful, they are free to use it.

   Other documents may make use of the terms as defined in this
   document.  It is suggested here that such uses should use
   capitalization of the terms as in this document to help distinguish
   them from colloquial uses, and should include an early section
   listing the terms inherited from this document with a citation.

3.  Terminology

   This section contains key terms.  It is split into three subsections.

   *  Section 3.1 contains terms that help to set the context for the
      incident and fault management systems.

   *  Section 3.2 includes specific and detailed core terms that will be
      used in other documents that describe elements of the incident and
      fault management systems.

   *  Section 3.3 provides two further terms that may be helpful.


Davis, et al.              Expires 30 May 2025                  [Page 3]

Internet-Draft          Network Fault Terminology          November 2024


3.1.  Context Terminology

   This section includes some terminology that helps describe the
   context for the rest of this work.  The terms may be viewed as a
   cascaded hierarchy with each subsequent term building on the
   previous.  The definitions are deliberately kept relatively terse.
   Further documents may expand on these terms without loss of
   specificity.  Such contextualization (if any) should be highlighted
   clearly in those documents.

   Network Telemetry:  This is defined in [RFC9232] and describes the
      process of collecting operational network data categorized into
      network planes.  Data collected through the Network Telemetry
      process does not contain network or device configuration
      information.  Nor does it contain any data related to service
      definitions (i.e., "intent" per Section 3.1 of [RFC9315]).

   Network Monitoring:  This is the process of keeping a continuous
      record of a resource, function, or connectivity service.  The term
      'monitoring' focuses on one single dimension and measurement in
      dimensional data modeling [DimensionalModeling].  This could be a
      measurement of the service state, a network function measurement,
      or the state of a network function of a resource as an example.

   Network Analytics:  Network Analytics is the process of deriving
      analytical insights into or from operational network data.  A
      process could be a piece of software, a system, or a human that
      analyzes operational data and outputs new analytical data, ideally
      metadata (a symptom, for example), which is related to the
      operational data.

   Network Observability:  This is the enablement of network behavioral
      assessment through analysis of observed operational network data
      (logs, alarms, traces, etc.) with the aim of detecting symptoms of
      network behavior, and to identify, anomalies and their causes.
      Network Observability begins with information gathered using
      Network Monitoring tools and that may be further enriched with
      other operational data (e.g., change records).  The expected
      outcome of the observability processes is identification and
      analysis of deviations in observed state versus the expected state
      of a network.

   Thus, there is a cascaded sequence where:

   *  Network Telemetry: the process of collecting operational data from
      a network.


Davis, et al.              Expires 30 May 2025                  [Page 4]

Internet-Draft          Network Fault Terminology          November 2024


   *  Network Monitoring: the process of creating/keeping a record of
      data gathered in Network Telemetry.

   *  Network Analytics: the process of deriving insight through the
      data recorded in Network Monitoring.

   *  Network Observability: the process of enabling behavioral
      assessment of a network through Network Analytics.

3.2.  Core Terms

   The terms are presented below in an order that is intended to flow
   such that it is possible to gain understanding reading top to bottom.
   The figures and explanations in Section 4 may aid understanding the
   terms set out here.

   System:  An assembly of components that exhibits some behavior.

   Resource:  A component of a System.

      Resource is a recursive concept so that a Resource may be a
      collection of other Resources (for example, a network node
      comprises a collection of interfaces).

   Characteristic:  Observable or measurable aspect or behavior
      associated with a Resource.

      *  A Characteristic may be considered with respect to the concept
         of dimensional modeling that is built on facts (see 'Value',
         below) and dimensions (the contexts and descriptors that
         identify and give meaning to the facts).

      *  The term "Metric" is another word for "Characteristic".

   Value:  A Value is the measurement of a Characteristic associated
      with a Resource.  It may be in the form of a categorization (e.g.,
      high or low), an integer (e.g., a count), on a continuous variable
      (e.g., an analog measurement), etc.

   Condition:  A Condition is an interpretation of the Values of a set
      of Characteristics of a Resource (with respect to working order or
      some other aspect relevant to the Resource purpose/application).

   Change:  In the context of Network Monitoring, a Change is the
      variation in the Value of a Characteristic associated with a
      Resource.


Davis, et al.              Expires 30 May 2025                  [Page 5]

Internet-Draft          Network Fault Terminology          November 2024


      *  Not all Changes are noteworthy (i.e., they do not have
         Relevance).

      *  Perception of Change depends upon Detection, the sampling
         rate/accuracy/detail, and perspective.

   Detect:  To notice the presence of something (State, Change,
      activity, form, etc.).

      *  Hence also to notice a Change (from the perspective of an
         observer such as a monitoring system).

   Event:  The variation in Value of a Characteristic of a Resource at a
      measured instant in time (i.e., the period is negligible).

      *  Compared with a Change, which may be over a period of time, an
         Event happens at a distinct moment in time.

   State:  A particular Condition that something (e.g., a Resource) is
      in (at a specific time).

      *  While a State may be observed at a specific moment in time, it
         is actually achieved by summarizing the measurement over time
         in a process sometimes called State compression.

   Relevance:  Consideration of an Event, State, or Value (through the
      application of policy, relative to a specific perspective, intent,
      and in relation to other Events, States, and Values) to determine
      whether it is of note to the system that controls or manages the
      network.

   Occurrence:  An Event with Relevance.

      A particular Change with Relevance.

      *  An Occurrence may be an aggregation or abstraction of finer-
         grain Occurrences.

      *  Applies to all scales and scopes, i.e., is essentially fractal
         (can recurse indefinitely).

      *  Note that Occurrence is used here with respect to the temporal
         dimension.

   Fault:  An Occurrence that is not desired/required (as it may be
      indicative of a current or future undesired State).  A Fault can
      generally be associated with a known cause.  See [RFC8632] for a
      more detailed discussion of network faults.


Davis, et al.              Expires 30 May 2025                  [Page 6]

Internet-Draft          Network Fault Terminology          November 2024


   Problem:  A State regarded as undesirable and which may require
      remedial action.  A Problem cannot necessarily be associated with
      a cause.  The resolution of a Problem does not necessarily act on
      the thing that has the Problem.

      *  Note that there is a historic aspect to the concept of a
         Problem.  The current State may be operational, but there could
         have been a failure that is unexplained, and the fact of that
         unexplained recent failure is a Problem.

      *  Note that whilst a Problem is unresolved it may continue to
         require attention.  A record of resolved Problems may be
         maintained in a log.

      *  Note that there may be a State which is considered to be a
         Problem from several perspectives.  For example, consider a
         loss of light State may cause multiple services to fail.  In
         this example, a State Change (so that the light recovers) may
         cause the Problem to be resolved from one perspective (the
         services are operational once more), but may leave the Problem
         as unresolved (because the loss of light has not been
         explained).  Further, in this example, there could be another
         development (the reason for the temporary loss of light is
         traced to a microbend in the fiber that is repaired) resulting
         in that unresolved Problem now being resolved.  But, in this
         example, this still leaves a further Problem unresolved (why
         did the microbend occur in the first place?).

   Incident:  A Network Incident is an undesired Occurrence such as an
      unexpected interruption of a network service, degradation of the
      quality of a network service, or the below-target performance of a
      network service.  An Incident results from one or more Problems,
      and a Problem may give rise to or contribute to one or more
      Incidents.  Greater discussion of Network Incidents, including
      Incident management, can be found in
      [I-D.ietf-nmop-network-incident-yang].

   Anomaly:  A (network) Anomaly is an unusual or unexpected Event or
      pattern in network data in the forwarding plane, control plane, or
      management plane that deviates from the normal, expected behavior.
      See [I-D.ietf-nmop-network-anomaly-architecture] for more details.

   Symptom:  An observable Characteristic, State, or Condition
      considered as an indication of a Problem or potential Problem.

   Cause:  The Events (Detected or otherwise) that gave rise to a Fault/
      Problem.


Davis, et al.              Expires 30 May 2025                  [Page 7]

Internet-Draft          Network Fault Terminology          November 2024


   Consolidation:  The process of considering multiple Faults, Problems,
      Symptoms, and their Causes to determine the underlying Causes.

   Alert:  An indication of a Fault.

   Alarm:  Per [RFC8632], an Alarm signifies an undesirable State in a
      Resource that requires corrective action.  From a management point
      of view, an Alarm can be seen as a State in its own right and the
      transition to this State is a Fault and may result in an Alert
      being issued.  The receipt of this Alert may give rise to a
      continuous indication (to a human operator) highlighting the
      potential or actual presence of a Problem.

3.3.  Other Terms

   Two other terms may be helpful:

   Transient:  A State, considered as a Problem, that persists for a
      limited amount of time before becoming resolved without direct
      action by an operator or by a system that controls or manages the
      network.

   Intermittent:  A State that is not continuous, but keeps occurring in
      some time frame.

4.  Workflow Explanations

   The relationship between System, Resources, and Characteristics is
   shown in Figure 1.  A System is comprised of Resources, and Resources
   have Characteristics.


                                      Characteristics
                                             ^
                                             |
                                          Resources
                                             ^
                                             |
                                           System


            Figure 1: Relationship Between Elements of a System

   The Value of a Characteristic of a Resource may change over time.
   Specific Changes in Value may be noticed at a specific time (as
   digital Changes), Detected, and treated as Events.  This is shown on
   the left of Figure 2.


Davis, et al.              Expires 30 May 2025                  [Page 8]

Internet-Draft          Network Fault Terminology          November 2024


   The center of Figure 2 shows how the Value of a Characteristic may
   change over time.  The Value may be Detected at specific times or
   periodically and give rise to States (and consequently State
   Changes).

   In practice, the Characteristic may vary in an analog manner over
   time as shown on the right-hand side of Figure 2.  The Value can be
   read or reported (i.e., Detected) periodically leading to Analog
   Values that may be deemed Values with Relevance, or may be evaluated
   over time as shown in Figure 6.


              Event                State                  Value

                ^                    ^                      ^
         Detect :             Detect :               Detect :
                :                    :                      :

           ^        ^          ^     ^     ^                   /\
           :        :          :     :     :                  /  \
           :        :          :     :     :             /\  /    \
            __    __               _____                /  \/
           |        |             |     |            /\/
         __|        |__       ____|     |____       /

        Change at a time     Change over time      Change over time


                   Figure 2: Characteristics and Changes

   Figure 3 shows the workflow progress for Events.  As noted above, an
   Event is a Change in the Value of a Characteristic at a time.  The
   Event may be evaluated (considering policy, relative to a specific
   perspective, with a view to intent, and in relation to other Events,
   States, and Values) to determine if it is an Occurrence and possibly
   to indicate a Change of State.  An Occurrence may be undesirable (a
   Fault) and that can cause an Alert to be generated, may be evidence
   of a Problem and could directly indicate a Cause.  In some cases, an
   Alert may give rise to an Alarm highlighting the potential or actual
   presence of a Problem.


Davis, et al.              Expires 30 May 2025                  [Page 9]

Internet-Draft          Network Fault Terminology          November 2024


                              Alert- - - - > Alarm
                                ^
                                |
                                |     -----> Cause
                                |    |
                                |----------> Problem
                                |
                                |
                              Fault
                                ^
                                |
                                |
                                |
                            Occurrence
                                ^
                                |
                                |----------> State
                                |
                                |
                              Event


                    Figure 3: Event and Dependent Terms

   Parallel to the workflow for Events, Figure 4 shows the workflow
   progress for States.  As shown in Figure 2, Change noted at a
   particular time gives rise to State.  The State may be deemed to have
   Relevance considering policy, relative to a specific perspective,
   with a view to intent, and in relation to other Events, States, and
   Values.  A State with Relevance may be deemed a Problem, or may
   indicate a Problem or potential Problem.

   Problems may be considered as Symptoms and may map directly or
   indirectly to Causes.  An Incident results from one or more Problems.
   An Alarm may be raised as the result of a Problem.


Davis, et al.              Expires 30 May 2025                 [Page 10]

Internet-Draft          Network Fault Terminology          November 2024


                             Alarm
                               ^
                               |     ------> Incident
                               |    |
                               |    |   ---> Cause
                               |    |  |
                           Problem---------> Symptom
                               ^
                               |
                               | Relevance
                               |
                               |
                             State


                    Figure 4: State and Dependent Terms

   Figure 5 shows how Faults and Problems may be Consolidated to
   determine the Causes.  The arrows show how one item may give rise to
   another.

   A Cause can be indicated by or determined from Faults, Problems, and
   Symptoms.  It may be that one Cause points to another, and can also
   be considered as a Symptom.  The determination of Causes can consider
   multiple inputs.  An Incident results from one or more Problems.


                                          ---------
                           ------------- |         |
                          |  ----------> | Symptom |
                          | |            |         |
                          | |             ---------
                          v |                 ^
                       ---------              |
              ------->|  Cause  |<---------   |
             |         ---------           |  |
             |           ^   |             |  |
             |           |   |             |  |
             |            ---              |  |
             |                             |  |
         ---------                      ---------          ----------
        |  Fault  |------------------->| Problem |------->| Incident |
         ---------                      ---------          ----------


               Figure 5: Consolidation of Symptoms and Causes


Davis, et al.              Expires 30 May 2025                 [Page 11]

Internet-Draft          Network Fault Terminology          November 2024


   Figure 6 shows how thresholds are important in the consideration of
   Analog Values and Events.  The arrows in the figure show how one item
   may give rise to or utilize another.  The use of threshold-driven
   Events and States (and the Alerts that they might give rise to) must
   be treated with caution to dampen any "flapping" (so that consistent
   States may be observed) and to avoid overwhelming management
   processes or systems.  Analog Values may be read or notified from the
   Resource and could transition a threshold, be deemed Values with
   Relevance, or evaluated over time.  Events may be counted, and the
   Count may cross a threshold or reach a Value of Relevance.

   The Threshold Process may be implementation-specific and subject to
   policies.  When a threshold is crossed and any other conditions are
   matched, an Event may be determined, and treated like any other
   Event.


    Occurrence
         ^
         |
         |---------------------> State
         |
         |        -------                  Relevance
         |------>| Count |-----------------------------> Value
         |        -------          |                       ^
         |           |             |                       |
         |           |             |                       | Relevance
         |           |             v                       |
         |           |        -----------           ----------------
       Event         |       | Evaluated |         |                |
         ^           |       | over time |<--------|  Analog Value  |
         |           v        -----------          |                |
         |      -----------        |               |                |
         |     | Threshold |       |               |                |
         |<----|  Process  |<------                |                |
         |     |           |<----------------------|                |
         |      -----------                         ----------------
         |                                                 ^
         |                                                 |
         | Detect                                   Detect |
         |                                                 |
    Change at a Time                                Change over Time


                  Figure 6: Counts, Thresholds, and Values


Davis, et al.              Expires 30 May 2025                 [Page 12]

Internet-Draft          Network Fault Terminology          November 2024


5.  Security Considerations

   This document specifies terminology and has no direct effect on the
   security of implementations or deployments.  However, protocol
   solutions and management models need to be aware of several aspects:

   *  The exposure of information pertaining to Faults may make
      available knowledge of the internal workings of a network (in
      particular its vulnerabilities) that may be of use to an attacker.

   *  Systems that generate management information (messages,
      notifications, etc.) when Faults occur, may be attacked by causing
      them to generate so much information that the system that manages
      the network is swamped and unable to properly manage the network.

   *  Reporting false information about Faults (or masking reports of
      Faults) may cause the system that manages the network to function
      incorrectly.

6.  Privacy Considerations

   In general, Fault Management should not expose information about end-
   user activities or user data.  The main privacy concern is for a
   network operator to keep control of all information about Faults to
   protect their privacy and the details of how the network operators
   operate their network.

7.  IANA Considerations

   This document makes no requests for IANA action.

Acknowledgments

   The authors would like to thank Med Boucadair, Wanting Du, Joe
   Clarke, Javier Antich, Benoit Claise, Christopher Janz, Sherif
   Mostafa, Kristian Larsson, Dirk Hugo, Carsten Bormann, Hilarie Orman,
   Stewart Bryant, Paul Kyzivat, and Jouni Korhonen for their helpful
   comments.

   Special thanks to the team that met at a side meeting at IETF-120 to
   discuss some of the thorny issues:

   *  Benoit Claise

   *  Watson Ladd

   *  Brad Peters


Davis, et al.              Expires 30 May 2025                 [Page 13]

Internet-Draft          Network Fault Terminology          November 2024


   *  Bo Wu

   *  Georgios Karagiannis

   *  Olga Havel

   *  Vincenzo Riccobene

   *  Yi Lin

   *  Jie Dong

   *  Aihua Guo

   *  Thomas Graf

   *  Qin Wu

   *  Chaode Yu

   *  Adrian Farrel

Informative References

   [DimensionalModeling]
              Wikipedia, "Dimensional Modeling", 18 November 2024,
              <https://en.wikipedia.org/w/
              index.php?title=Dimensional_modeling>.

   [I-D.ietf-nmop-network-anomaly-architecture]
              Graf, T., Du, W., and P. Francois, "An Architecture for a
              Network Anomaly Detection Framework", Work in Progress,
              Internet-Draft, draft-ietf-nmop-network-anomaly-
              architecture-01, 20 October 2024,
              <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-
              network-anomaly-architecture-01>.

   [I-D.ietf-nmop-network-incident-yang]
              Hu, T., Contreras, L. M., Wu, Q., Davis, N., and C. Feng,
              "A YANG Data Model for Network Incident Management", Work
              in Progress, Internet-Draft, draft-ietf-nmop-network-
              incident-yang-02, 10 October 2024,
              <https://datatracker.ietf.org/doc/html/draft-ietf-nmop-
              network-incident-yang-02>.

   [RFC3877]  Chisholm, S. and D. Romascanu, "Alarm Management
              Information Base (MIB)", RFC 3877, DOI 10.17487/RFC3877,
              September 2004, <https://www.rfc-editor.org/info/rfc3877>.


Davis, et al.              Expires 30 May 2025                 [Page 14]

Internet-Draft          Network Fault Terminology          November 2024


   [RFC6632]  Ersue, M., Ed. and B. Claise, "An Overview of the IETF
              Network Management Standards", RFC 6632,
              DOI 10.17487/RFC6632, June 2012,
              <https://www.rfc-editor.org/info/rfc6632>.

   [RFC8632]  Vallin, S. and M. Bjorklund, "A YANG Data Model for Alarm
              Management", RFC 8632, DOI 10.17487/RFC8632, September
              2019, <https://www.rfc-editor.org/info/rfc8632>.

   [RFC9232]  Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and
              A. Wang, "Network Telemetry Framework", RFC 9232,
              DOI 10.17487/RFC9232, May 2022,
              <https://www.rfc-editor.org/info/rfc9232>.

   [RFC9315]  Clemm, A., Ciavaglia, L., Granville, L. Z., and J.
              Tantsura, "Intent-Based Networking - Concepts and
              Definitions", RFC 9315, DOI 10.17487/RFC9315, October
              2022, <https://www.rfc-editor.org/info/rfc9315>.

Authors' Addresses

   Nigel Davis (editor)
   Ciena
   United Kingdom
   Email: ndavis@ciena.com


   Adrian Farrel (editor)
   Old Dog Consulting
   United Kingdom
   Email: adrian@olddog.co.uk


   Thomas Graf
   Swisscom
   Binzring 17
   CH-8045 Zurich
   Switzerland
   Email: thomas.graf@swisscom.com


   Qin Wu
   Huawei
   101 Software Avenue, Yuhua District
   Nanjing
   Jiangsu, 210012
   China
   Email: bill.wu@huawei.com


Davis, et al.              Expires 30 May 2025                 [Page 15]

Internet-Draft          Network Fault Terminology          November 2024


   Chaode Yu
   Huawei Technologies
   Email: yuchaode@huawei.com


Davis, et al.              Expires 30 May 2025                 [Page 16]