## **PCIe Advanced Error Reporting Plugin** ## Metrics List & Descriptions: | Tech<br>nolo<br>gy<br>/Cate<br>gory | Metr<br>ic<br>/Fea<br>ture<br>/Inp<br>ut | Name | D<br>at<br>e<br>T<br>y | Form<br>at<br>Exam<br>ple | Col<br>lect<br>d<br>Rel<br>ease | Co<br>lle<br>ctd<br>Plu<br>gin | Description | Dependencies | Limitatio<br>ns | C<br>o<br>m<br>m<br>e | |-------------------------------------|------------------------------------------|------------------------------------------------|------------------------|-------------------------------------------|---------------------------------|--------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--------------------------------------|-----------------------| | PCIE<br>AER | | PCIe AER<br>Plugin | - | - | - | - | Plugin to provide PCIe AER metrics, errors, notifications & device information | Depends on<br>sysfs and proc<br>file systems | To be used on little endian systems. | | | PCIe<br>AER | Feat<br>ure | Device<br>Domain | H<br>ex | 10 | Ma<br>ster | pci<br>e_<br>err<br>ors | The PCI address domain consisting of three distinct address spaces: configuration, memory, and I/O space. | None | | | | PCIe<br>AER | Feat<br>ure | Device Bus | H<br>ex | 10 | Ma<br>ster | pci<br>e_<br>err<br>ors | PCIe Bus number | None | | | | PCIe<br>AER | Feat<br>ure | Device ID | H<br>ex | 3597 | Ma<br>ster | pci<br>e_<br>err<br>ors | PCIe Device ID of the device | None | | | | PCIe<br>AER | Feat<br>ure | Device<br>Function | H | 10 | Ma<br>ster | pci<br>e_<br>err<br>ors | Bus:Device.Function notation used to succinctly describe PCI and PCIe devices | None | | | | PCIe<br>AER | Feat<br>ure | Instance<br>Type | T<br>e<br>xt | correc<br>table<br>/uncor<br>rectab<br>le | Ma<br>ster | pci<br>e_<br>err<br>ors | PCIe instance type | None | | | | PCIe<br>AER | Feat<br>ure | Severity | T<br>e<br>xt | Fatal<br>/Non-<br>fatal | Ma<br>ster | pci<br>e_<br>err<br>ors | Severity flag indicating nature of severity of uncorrectable errors with fatal or non-fatal error types | None | | | | PCIe<br>AER | Feat<br>ure | Persistent<br>Notification | T<br>e<br>xt | True<br>/False | Ma<br>ster | pci<br>e_<br>err<br>ors | If any uncorrectible error is already reported once, persistent flag is set in the plugin and not reported again | None | | | | PCIE<br>AER | Metric | Uncorrecta<br>ble Error | T<br>e<br>xt | uncor<br>rectab<br>le | Ma<br>ster | pci<br>e_<br>err<br>ors | The errors which don't have impact on integrity of the PCI Express fabric, but data/information is lost. Non-fatal errors are corrupted transactions that can't be corrected by PCIe hardware. However, the PCI Express fabric continues to function correctly and other transactions are unaffected, only particular transaction is affected. Recovery from a non-fatal error may or may not, depends on device-specific software associated with the requester that initiated the transaction | None | | | | PCIe<br>AER | Metric | Correctabl<br>e Error | T<br>e<br>xt | correc<br>table | Ma<br>ster | pci<br>e_<br>err<br>ors | the errors which may have an impact on performance (like latency, bandwidth), but no data/information is lost and PCle fabric remains reliable. Such errors are corrected by hardware and no software intervention is required | None | | | | PCIe<br>AER | Metric | Severity<br>Non-Fatal<br>Error | T<br>e<br>xt | non_f<br>atal | Ma<br>ster | pci<br>e_<br>err<br>ors | Error severity indicating no reboot necessary | None | | | | PCIe<br>AER | Metric | Severity<br>Fatal Error | T<br>e<br>xt | fatal | Ma<br>ster | pci<br>e_<br>err<br>ors | Error severity indicating reboot necessary | None | | | | PCIe<br>AER | Metric | Unsupport<br>ed Request | T<br>e<br>xt | unsup<br>ported | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when an endpoint or a root port recieves any of a set of transactions as defined by PCIe Spec defined in [1]. In all cases the TLP is deleted in the Hard IP block and not presented to the Application Layer. If the TLP is a non-posted request, the Hard IP block generates a completion with Unsupported Request status. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | | PCIe<br>AER | Metric | Data Link<br>Protocol<br>Uncorrecte<br>d Error | T<br>e<br>xt | Data<br>Link<br>Proto<br>col | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when a sequence number specified by the Ack/Nak block in the Data Link Layer (AckNak_Seq_Num) does not correspond to an unacknowledged TLP. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | | PCIe<br>AER | Metric | Surprise<br>Down<br>Uncorrecte<br>d Error | T<br>e<br>xt | Surpri<br>se<br>Down | Ma<br>ster | pci<br>e_<br>err<br>ors | When the PCIe device goes down without a notice | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | | PCIe<br>AER | Metric | Poisoned<br>TLP<br>Uncorrecte<br>d Error | T<br>e<br>xt | Poiso<br>ned<br>TLP | Ma<br>ster | pci<br>e_<br>err<br>ors | anytime a poisoned TLP is destined to PCIe device, IIO module will drop the poisoned data packet, contain the error in the domain that it was detected in, bring down the link, and signal a fatal error to SW /FW | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | | PCIe<br>AER | Metric | Flow<br>Control<br>Protocol<br>Uncorrecte<br>d Error | T<br>e<br>xt | Flow<br>Contr<br>ol<br>Proto<br>col | Ma<br>ster | pci<br>e_<br>err<br>ors | An uncorrected error in flow control protocol found in transaction layer that prevents flow control credits transactions being sent. This error occurs when a component does not receive update flow control credits with the 200 $\mu$ s limit. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | |-------------|--------|-------------------------------------------------------|--------------|-----------------------------------------|------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--| | PCIe<br>AER | Metric | Completion<br>Timeout<br>Uncorrecte<br>d Error | T<br>e<br>xt | Comp<br>letion<br>Timeo<br>ut | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when a request originating from the Application Layer does not generate a corresponding completion TLP within the established time. It is the responsibility of the Application Layer logic to provide the completion timeout mechanism. The completion timeout should be reported from the Transaction Layer using the cpl_err[0] signal. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Completer<br>Abort<br>Uncorrecte<br>d Error | T<br>e<br>xt | Comp<br>leter<br>Abort | Ma<br>ster | pci<br>e_<br>err<br>ors | The Application Layer reports this error using thecpl_err[2]signal when it aborts receipt of a TLP. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Unexpecte<br>d<br>Completion<br>Uncorrecte<br>d Error | T<br>e<br>xt | Unex<br>pecte<br>d<br>Comp<br>letion | Ma<br>ster | pci<br>e_<br>err<br>ors | This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Receiver<br>Overflow<br>Uncorrecte<br>d Error | T<br>e<br>xt | Recei<br>ver<br>Overfl<br>ow | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when a component receives a TLP that violates the FC credits allocated for this type of TLP. In all cases the hard IP block deletes the TLP and it is not presented to the Application Layer. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Malformed<br>TLP<br>Uncorrecte<br>d Error | T<br>e<br>xt | Malfor<br>med<br>TLP | Ma<br>ster | pci<br>e_<br>err<br>ors | This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | ECRC<br>Uncorrecte<br>d Error<br>Status | T<br>e<br>xt | ECRC | Ma<br>ster | pci<br>e_<br>err<br>ors | ECRC ensures end-to-end data integrity for systems that require high reliability. When the ECRC generation option is turned on, errors are detected when receiving TLPs with a bad ECRC. More details in [2] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Unsupport<br>ed<br>Uncorrecte<br>d Error<br>Request | T<br>e<br>xt | Unsu<br>pport<br>ed | Ma<br>ster | pci<br>e_<br>err<br>ors | This error is caused by an unexpected completion transaction as listed in [1]. The TLP is not presented to the Application Layer; the Hard IP block deletes it. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | ACS<br>Violation<br>Uncorected<br>Error | T<br>e<br>xt | ACS<br>Violati<br>on | Ma<br>ster | pci<br>e_<br>err<br>ors | Violation in Access Control Services. More details in [3] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Internal<br>Uncorrecte<br>d Error | T<br>e<br>xt | Intern<br>al<br>Uncor<br>rected | Ma<br>ster | pci<br>e_<br>err<br>ors | An error associated with a PCI Express interface that occurs within a component and which may not be attributable to a packet or event on the PCI Express interface itself or on behalf of transactions initiated on PCI Express. More details in [4] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | MC<br>Blocked<br>TLP<br>Uncorrecte<br>d Error | T<br>e<br>xt | MC<br>Block<br>ed<br>TLP | Ma<br>ster | pci<br>e_<br>err<br>ors | An error with Multicast TLP processing. More details in [5] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Atomic<br>Egress<br>Blocked<br>Uncorrecte<br>d Error | T<br>e<br>xt | Atomi<br>c<br>Egres<br>s<br>Block<br>ed | Ma<br>ster | pci<br>e_<br>err<br>ors | Error with setting AtomicOp Egress Blocking bit. More details in [6] | Depends on<br>what's exposed<br>in systs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | TLP Prefix<br>Blocked<br>Uncorrecte<br>d Error | T<br>e<br>xt | TLP<br>Prefix<br>Block<br>ed | Ma<br>ster | pci<br>e_<br>err<br>ors | The TLP Prefix mechanism extends the header size by adding DWORDS to the front of headers that carry additional information. The uncorrected error reflects failure in the process. More details in [7] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Receiver<br>Error<br>Status<br>Corrected<br>Error | T<br>e<br>xt | Recei<br>ver<br>Error<br>Status | Ma<br>ster | pci<br>e_<br>err<br>ors | Receiver error at PCIe physical layer | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Bad TLP<br>Status<br>Corrected<br>Error | T<br>e<br>xt | Bad<br>TLP<br>Status | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when a LCRC verification fails or when a sequence number error occurs. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Bad DLLP<br>Status<br>Corrected<br>Error | T<br>e<br>xt | Bad<br>DLLP<br>Status | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when a CRC verification fails. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Replay<br>NUM<br>Rollover<br>Corrected<br>Error | T<br>e<br>xt | Repla<br>y<br>NUM<br>Rollov<br>er | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when the replay number rolls over. | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Replay<br>Timer<br>Timeout<br>Corrected<br>Error | T<br>e<br>xt | Repla<br>y<br>Timer<br>Timeo<br>ut | Ma<br>ster | pci<br>e_<br>err<br>ors | This error occurs when the replay timer times out | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | |-------------|--------|--------------------------------------------------|--------------|------------------------------------|------------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--| | PCIe<br>AER | Metric | Advisory<br>Non-Fatal<br>Corrected<br>Error | T<br>e<br>xt | Advis<br>ory<br>Non-<br>Fatal | Ma<br>ster | pci<br>e_<br>err<br>ors | The error are reported and signaled as ERR_COR, ERR_NONFATAL, ERR_FATAL or not signaled at all, depending upon the role of the agent that detects the error and whether the agent implements AER as an advisory capacity to application. More details in [8] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Corrected<br>Internal<br>Corrected<br>Error | T<br>e<br>xt | Corre<br>cted<br>Intern<br>al | Ma<br>ster | pci<br>e_<br>err<br>ors | An error associated with a PCI Express interface that occurs within a component and which may not be attributable to a packet or event on the PCI Express interface itself or on behalf of transactions initiated on PCI Express. More details in [4] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | | PCIe<br>AER | Metric | Header<br>Log<br>Overflow<br>Corrected<br>Error | T<br>e<br>xt | Head<br>er<br>Log<br>Overfl<br>ow | Ma<br>ster | pci<br>e_<br>err<br>ors | When a header is logged, the header is that of the first TLP that was lost or corrupted by the Uncorrectable Internal Error. More detilas in [9] | Depends on<br>what's exposed<br>in sysfs and<br>proc file<br>systems | | ## Sub-sections: PCIe Errors High Level Design PCIe RAS Executed Tests