- Proposed name for the project:
data collection for failure prediction
- Proposed name for the repository:
- Project Categories:
A failure prediction system could be deployed to help the NFV system avoid the unexpected failure in advance.The failure prediction topic has been studied by ETSI NFV ISG, and some general requirements are developed there. These requirements should be the initial input for this topic in OPNFV. For example, the following requirements are listed in NFV GS Draft NFV-REL 001 (v1.0.0, 2014-11):
- The real-time resource usage such as the disk usage, CPU Load, memory usage, network IO and virtual IO usage and their loss rate, and available vCPUs, virtual memory, etc. shall be provided to VIM at configurable intervals by entities of infrastructure. It should also be possible to configure the thresholds or watermarks for the event notification instead of configuring the reporting interval.
- Each entity of infrastructure shall provide the open interfaces for communicating the performance and the consumption resource and for allowing polling of its working state and resource usage.
- The failure prediction framework should include the functionality of the false alarm filtering to avoid triggering unnecessary prevention procedure for anomaly alarming.
- The failure prediction framework should include trend identification, period or seasonal variations, and randomness analysis of the collected data on resource usage (e.g., memory, file descriptors, sockets, database connections) to predict the progression of the operated NFV system to an unhealthy state, e.g., resource exhausting.
- The failure prediction framework should be able to diagnose or verify which entity is suspected to be progressing towards a failure and which VNFs might be affected due to the predicted anomaly.
- The entities of VNF and its supported infrastructure should have their own self-diagnostic functionality in order to provide their health information to the VIM.
- The log report associated with a NFVI resource failure or an error detected by a hardware component, software module, hypervisor, VM, or the network should include the error severity.
- The log report should include an indication of the failure cause.
The failure prediction topic will also be one of the important issues in ETSI NFV Phase 2. So it is possible that such requirements for failure prediction will be updated or elaborated during the lifecycle of this project, and the updated part will be synchronized and merged into this project.
Collecting data is the first and most significant step of a failure prediction system, which requires different kinds of data (e.g., log files, real time parameters of hardware and software, environment parameters, events, etc.) from various sources (e.g, NFVI, VIM, etc.). A failure predictor can notify us about failure in advance by analyzing the collected data. Some upstream projects for data collection has existed (e.g. the Monasca project and Ceilometer project in OpenStack for system resource monitoring). However, they do not cover all specific requirements in the OPNFV environment. Therefore our first task is to investigate the gaps between those upstream projects, other OpenStack components and the OPNFV requirements. Meanwhile, we will identify which kind of data is required to collect. After that, we plan to deliver some documents on the VIM northbound API, implementation architecture and plan. Finally, we will implement the failure prediction framework in detail.
Ceilometer and Monasca can get some metrics about physical resource and virtual resource. But they do not cover some metrics about application and guest OS. We try to give some metrics as examples, but the following list is non-exhaustive:
- vCPU interrupt time, vCPU idle time, vCPU system time, vCPU user time.
- File system information.
- Memory utilization (allocated and utilizations, size of block and how many blocks are used, page swaps)
- Hypervisor priority level for VNF
- Acceleration technology in use (DPDK in and out frames/sec, SR-IOV, etc.)
The whole failure prediction system is made up of a data collector, a failure predictor and a failure management module, which is shown in the following figure.
The data collector consists of Ceilometer and Monasca which can be extended to plugin some other open source data collectors, e.g. Zabbix, Nagios, Cacti. Based on real-time analytics techniques and machine learning techniques, the failure predictor analyses the data gathered by the data collector to automatically determine whether a failure will happen. If a failure is judged, then the failure predictor sends failure notifications to the failure management module (e.g. the Doctor module), which could handle these notifications.
In OPNFV release 2, we limit the scope of this project to the data collector.
Describe the problem being solved by project:
As a requirements category project, it plans to solve the problem as following:
- The Ceilometer and Monasca projects in OpenStack Juno Release could not collect data as much as possible for a failure prediction system in the OPNFV environment. Due to this drawback, this project is to solve this problem by analyzing the gaps between them.
Specify any interface/API specification proposed:
Additional interface specifications:
- Other interfaces potentially to be brought during the project
Identity a list of features and functionality will be developed:
- Additional features of Ceilometer and Monasca to support OPNFV failure prediction.
Identify what is in or out of scope. So during the development phase, it helps reduce discussion:
- Monasca and Ceilometer as the data collector
- Failure that could be predicted by the failure predictor
- Data category for failure prediction
- VIM northbound interfaces
Out of scope
- An engine for analyzing collected data and predicting failure.
Describe how the project is extensible in future:
The achievements of this project will be used as the input for next stage, e.g. Integration & Testing, and Collaborative Development.
(optional, Project Categories: Integration & Testing)
Specify testing and integration like interoperability, scalability, high availablity
(optional, Project Categories: Documention)
Identify similar projects is underway or being proposed in OPNFV or upstream project
- The “Doctor (Fault Management and Maintenance)” project.
Identify any open source upstream projects and release timeline.
- Monasca project (https://wiki.openstack.org/wiki/Monasca ) is the upstream project of this project. It will align with OpenStack release schedule.
- Ceilometer project (https://wiki.openstack.org/wiki/Ceilometer ) is the upstream project of this project. It will align with OpenStack release schedule.
- TSDR Project (https://wiki.opendaylight.org/view/Project_Proposals:Time_Series_Data_Repository ) is the upstream project of this project. It will align with Opendaylight release schedule.
- OpenStack Juno Release
- OpenDaylight Helium Release
Identify any specific development be staged with respect to the upstream project and releases.
Are there any external fora or standard development organization dependencies. If possible, list and informative and normative reference specifications.
- ETSI NFV draft REL GS (v1.0.0, 2014-11)
- ETSI GS NFV REL004
Key Project Facts
Project Creation Date:
Lifecycle State: Incubation
Primary Contact: email@example.com
Project Lead: firstname.lastname@example.org
Jira Project Name: Data Collection for Failure Prediction
Jira Project Prefix: PREDICTION
Mailing list tag: [prediction]
Committers and Contributors:
Names and affiliations of the committers:
- Hai Liu, email@example.com
- Yijun Yu, firstname.lastname@example.org
- Jun Li, email@example.com
- Yifei Xue, firstname.lastname@example.org
- Linghui Zeng, email@example.com
- Lanchao Zheng, firstname.lastname@example.org
- Qiao Fu, email@example.com
Any other contributors: TBD
Described the project release package as OPNFV or open source upstream projects.
- Potential failure needed to be handled
- Data category required for failure prediction
- System capability requirement (e.g. disk space needed) and performance requirement (e.g. how many data needs to be collected per second)
- The mechanism on how to hook various kinds of open source monitors (e.g. Cacti, Nagios, Zabbix and Ganglia)
If project deliverables have multiple dependencies across other project categories, described linkage of the deliverables.
Proposed Release Schedule:
When is the first release planned?
- May, 2015.
Will this align with the current release cadence