Reliability and Resilience: Difference between revisions

No edit summary
Line 15: Line 15:
desc none
desc none
</imagemap>
</imagemap>
 
<span id="RnR-bottom"></span>
== AIoT Framework: Reliability & Resilience ==
== AIoT Framework: Reliability & Resilience ==
Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.  
Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.  

Revision as of 07:18, 24 January 2021

AIoTArtificial IntelligenceInternet of ThingsBusiness ModelProduct ArchitectureAIoT DevOps & InfrastructureTrust & SecurityReliability & ResilienceVerification & ValidationIgnite AIoT - Reliability & Resilience

AIoT Framework: Reliability & Resilience

Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.

Ensuring reliability and resilience is a broad topic, which ranges from basics such as proper error handling on the code level up to geo-replication and disaster recovery. Also, there are some overlaps with Security, as well as Verification and Validation. This section is discussing reliability and resilience in the context of AIoT DevOps first, before looking at the AI and IoT specifics in more detail.

R&R for AIoT DevOps

Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems [1]. These types of systems need to address challenges like implementing recovery mechanisms for individual IT services or entire regions, dealing with data backups, replication, clustering, network load-balancing and fail-over, geo-redundancy, etc.

IoT is adding to these challenges, because in IoT parts of the system is implemented not in the data center, but as hardware and software components which are deployed in the field. These field-deployments can be based on sophisticated EDGE platforms, or on some very rudimentary embedded controllers. Nevertheless, IT components deployed in the field often play by different rules - and if it is only for the fact that it is much harder (or even technically or economically impossible) to access them for any kind of unplanned physical repairs or upgrades.

Finally, AI is adding further challenges in terms of model robustness and model performance. As will be discussed later, some of these challenges are related to the algorithmic complexity of the AI models, while many more arise from complexities of handling the AI development cycle in production environments - and finally adding the specifics of IoT on top of it all.

R&R DevOps for AIoT

Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one go, so it seems natural to integrate the R&R perspective into the AIoT DevOps cycle. Naturally, the R&R perspective must be integrated with each of the four AIoT DevOps quadrants. From the R&R perspective, agile development must address not only the application code level, but also the AI/model level, as well as the infrastructure level. Continuous Integration must ensure that all R&R-specific aspects are integrated properly. This can go as far as preparing the system for Chaos Engineering Experiments[2]. Continuous Testing must ensure that all R&R concepts are continuously validated. This must include basic system-level R&R, as well as AI and IoT-specific R&R aspects. Finally, Continuous Delivery / Operations must bring R&R to production. Some companies are even going to the extreme to conduct continuous R&R tests as part of their production systems (one big proponent of this approach is Netflix, where the whole Chaos Engineering approach originated).

Analyze Rate Act

While it is important that R&R is treated as a normal part of the AIoT DevOps cycle, it usually makes sense to have a dedicated R&R planning mechanism, which looks at R&R specifically. Please note that a similar approach has also been suggested for Security, as well as Verification & Validation. It is important that none of these three areas is viewed in isolation, and that redundancies are avoided.

The AIoT Framework is proposing a dedicated Analyze/Rate/Act planning process for R&R, embedded into the AIoT DevOps cycle - as shown by the figure above.

The "Analyze" phase of this process must take two key elements into consideration:

  • R&R metrics / KPIs: A performance analysis and evaluation of the actual live system. This must be updated and used as input for each iteration of the planning process. In the early phases, the focus will be more on how to actually define the R&R KPIs and acquire related data, while in the later phases this information will become an integral part of the R&R planning process.
  • Component / Dependency Analysis (C/DA): Utilizing existing system documentation such as architecture diagrams and flow charts, the R&R team should perform a thorough analysis of all the components in the system, and their potential dependencies. Out of this process, a list of potential R&R Risk Areas should be compiled ("RA list").

The RA list can contain risks on different levels of granularity, ranging from risks related to the availability of individual micro-services up to risks related to the availability of entire regions. The RA list must also be compared to the results of the Threat Modeling which comes out of the DevSecOps planning process. In some cases, it can even make sense to join these two perspectives into a single list or risk repository.

The "Rate" phase must look at each item from the RA list in detail, including the potential impact of the risk, the likelihood that it occurs, ways of detecting issues related to the risk, and ways for resolving them. Finally, a brief action plan should describe a plan for automating detection and resolution of issues related to the risk - including a rough effort estimate. Based on all of the above, a rating for each item in the RA list should be provided.

The "Act" phase starts with prioritizing and scheduling the most pressing issues, based on the individual ratings. Highly rated issued must then be transferred to the general development backlog. This will likely include additional analysis of dependencies to backlog items more related to the application development side of things.

Robust, AI-based components in AIoT

The AI community is still in the early stages of addressing reliability, resilience and related topics such as robustness and explainability of AI-based systems. H. Truong is providing the following definitions [3] from the ML perspective:

  • Robustness: Dealing with imbalanced data, learning in an open-world(out of distribution) situations
  • Reliability: Reliable learning and reliable inference in terms of accuracy and reproducibility of ML models; uncertainties/-confidence in inferences; reliable ML service serving
  • Resilience: bias in data, adversary attacks in ML, resilience learning, computational Byzantine failures

In the widely cited paper on Hidden Technical Debt in Machine Learning Systems[4], the authors emphasize that only a small fraction of real-world ML systems is composed of the ML code, while the required surrounding infrastructure is vast and complex - including configuration, data collection, feature extraction, data verification, machine resource management, analysis tools, process management tools, serving infrastructure and monitoring.

Robust AI Components for AIoT

The AIoT Framework is suggesting to differentiate between the online and offline perspective of the AI-based components in the AIoT system. The offline perspective must cover data sanitation, robust model design, and model verification. The online perspective must include runtime checks (e.g. feature values out of range or invalid outputs), an approach for graceful model degradation, and runtime monitoring. Between the online and offline perspectives, a high level of automation must be achieved, covering everything from training to testing and deployments.

Architecture for robust, AI-enabled AIoT components

Mapping all of the above R&R elements to an actual AIoT system architecture is not an easy feat. Acquiring high-quality test data from assets in the field is not always easy. Managing the offline AI development and experimentation cycle can rely on standard AI engineering and automation tools. However, model deployments to assets in the field rely on non-standard mechanisms, e.g. relying on OTA (over-the-air) updates from the IoT toolbox. Dealing with multiple instances of models deployed onto multiple assets (or EDGE instances) in the field is something which goes beyond standard AI-processing in the cloud. And finally gathering - and making sense of - monitoring data from multiple instances / assets is beyond today`s well established AI engineering principles.

Reliability & Resilience for IoT

Finally, we need to address the IoT-specifics of Reliability & Resilience. For the backend (cloud or enterprise), of course most of the standard R&R aspects of Internet/cloud/enterprise systems apply. Since IoT is adding new categories of clients (i.e. assets) to access the backends, this has to be taken into consideration from an R&R perspective. For example, and IoT backend must be able to cope with malfunctioning or potentially malicous behaviour of EDGE or embedded components.

For the IoT components deployed in the field, environmental factors can play a significant role, which required extra ruggedness for hardware components, which can be key from the R&R perspective. Also, depending on the complexity of the EDGE/embedded functions, many of the typical R&R features found in modern cloud environments, will have to be re-invented to ensure R&R also for components deployed in the field.

Finally, for many IoT systems - especially where assets can physically move - there will be much higher chances of losing connectivity from the asset to the backend. This typically requires that both backend and field-based components implement a certain degree of autonomy. For example, and autonomous vehicle must be able to function in the field without access to additional data (e.g. map data) from the cloud. Equally, a backend asset monitoring solution must be able to function, even if the asset is currently not connected. For example, the asset status information must be augmented with a timestamp which indicates when this information was last updated.

Building robust IoT solutions

References

  1. Site Reliability Engineering: How Google Runs Production Systems, N. Murphy et al., 2016
  2. Chaos Engineering, Wikipedia
  3. R3E – An Approach to Robustness, Reliability, Resilience andElasticity Engineering for End-to-End Machine Learning, Hong-Linh Truong, 2020
  4. Hidden Technical Debt in Machine Learning Systems, D. Sculley et al., 2015

Authors and Contributors

Dirk Slama.jpeg
DIRK SLAMA
(Editor-in-Chief)

AUTHOR
Dirk Slama is VP and Chief Alliance Officer at Bosch Software Innovations (SI). Bosch SI is spearheading the Internet of Things (IoT) activities of Bosch, the global manufacturing and services group. Dirk has over 20 years experience in very large-scale distributed application projects and system integration, including SOA, BPM, M2M and most recently IoT. He is representing Bosch at the Industrial Internet Consortium and is active in the Industry 4.0 community. He holds an MBA from IMD Lausanne as well as a Diploma Degree in Computer Science from TU Berlin.