In today’s post, we will take a deep dive into Part 5 of ISO 26262, which covers product development at the hardware level. In particular, we will take a close look at the Fault Metrics defined by the standard. This part is where ISO 26262 differs the most from IEC 61508. As we will see below, some of these differences are just terminology, while others are more fundamental.
Part 5 of the standard is dedicated to the development of the hardware required to achieve safety goals (software is covered in the next part). In this part, the technical safety requirements developed in Part 4 are allocated to specific hardware and software designs. This could be thought of as equivalent to detailed engineering in a typical IEC 61511 project.
Without going too deep into the details, the ISO standard requires that the design consider several factors, including:
- Response to failures, including transient faults
- Diagnostic capabilities
- Consideration of fault detection times
- Expected failure rates of components
- Design verification requirements
- Consistency with the higher-level safety specifications
The hardware detailed design is captured in three main deliverables
- Hardware Safety Requirements Specification
- Hardware-software Interface Specification
- Hardware Safety Requirements Verification Report
The ISO standard does not go into great detail on the hardware design process (neither does IEC 61508), so I will not either.
The later sections of Part 5 discuss the quantitative verification of the hardware via various metrics, which is where the rest of this article will focus.
Clause 8 covers the evaluation of the hardware architectural metrics. Specifically, these metrics are intended to evaluate the effectiveness of the hardware architecture in dealing with random failures. This is distinct from the evaluation of random hardware failures (i.e. ASIL) covered in Clause 9.
Diagnostic coverage is defined much the same way as it is in the IEC standards. Diagnostic coverage is required to be estimated based on failure rates from recognized industry source, statistics based on field returns or tests, expert judgement.
The two metrics defined in this part are:
- Single-point Fault Metric – measures the robustness of the design to single-point and residual faults. Higher is better.
- Latent-fault Metric – measures the robustness of the design to latent faults. Higher is better.
As with many of the ISO requirements, these metrics only apply to higher ASIL function (i.e. B, C, or D).
Before discussing the metrics, it is useful to remember the taxonomy of faults/failures (from Part 1) used by ISO 26262, which is different from IEC. The total failure rate λ can be broken down into:
λ = λSPF + λRF + λMPF + λS
λSPF: Single Point Faults (i.e. a DU fault where there are no diagnostics)
λRF: Residual Faults (i.e. a DU fault not covered by diagnostics)
λMPF: Multiple Point Faults (i.e. a combination of independent SPFs)
λS: Safe Faults
Single-point Fault Metric
The single-point fault metric is defined as the sum of the multiple-point faults and the safe faults divided by the total failure rate, i.e. the following ratio: Σ(λMPF + λS) / Σ(λ)
Note: The name “single-point fault metric may initially be confusing, since the single point fault rate (λSPF) does not appear in the formula! However, the formula can equivalently be written as: 1- Σ(λSPF + λRF) / Σ(λ)
This ratio looks suspiciously like the IEC 61508 concept of Safe Failure Fraction (SFF), with the notable exception that multiple-point faults are also considered “safe”. The inclusion of multiple-point faults is a somewhat unusual approach, but it is probably why the latent-fault metric is also calculated.
A quantitative target for the single-point fault metric is set by the standard based on the ASIL target:
By combining safe faults and multiple-point faults into the same metric, this metric has a similar impact to the SFF-based hardware fault tolerance requirements in IEC 61508. In other words, if the single-point fault metric is to low, additional fault tolerance will convert those faults to multiple point faults and improve the metric. Sound familiar?
Latent Fault Metric
The latent fault metric is defined as the sum of the multiple-point faults that are perceived by the driver or detected by diagnostics plus the safe faults divided by the total multiple-point and safe faults, i.e. the following ratio: Σ(λMPF(Per/Det) +λS) /Σ(λMPF +λS).
Again, this concept looks very similar to the concept of IEC 61508 diagnostic coverage, except that it also includes the possibility of driver “perception” of and response to faults. This is generally not considered in IEC 61508 (or IEC 61511) since most systems are dormant and activate on demand.
A quantitative target value for the latent-fault metric is set by the standard based on the ASIL target:
My interpretation of the two required fault metrics is that they are a sort of roll-up of several IEC concepts, including diagnostic coverage, safe failure fraction, and hardware fault tolerance. The end result should be that reasonable amounts of diagnostics and fault tolerance should be built into the system before the ASIL is even calculated. Of course, the SFF concept and the HFT requirements in the IEC standards are somewhat notorious for the “numbers games” they have inspired (e.g. see here and here). It would be interesting to see if the automotive industry is more successful in this regard.
Key takeaways from this part of the article include:
- The hardware design process is not specified in detailed in the ISO 26262 standard, but includes familiar concepts such as diagnostic coverage, failure rates, and verifications.
- The standard includes two fault metrics: (i) the single-point fault metric and (ii) the latent fault metric, which are similar in concept to the safe failure fraction (SFF) and diagnostic coverage (DC) metrics from IEC 61508
Part 5 of the standard is also where the Automotive Safety Integrity Level (ASIL) is introduced. We will cover ASIL in detail in a future post. Thanks for reading!