Reliability and Resilience: Difference between revisions

Line 31: Line 31:
[[File:2.5-RR DevOps.png|800px|frameless|center|R&R DevOps for AIoT]]
[[File:2.5-RR DevOps.png|800px|frameless|center|R&R DevOps for AIoT]]


Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one go, so it seems natural to integrate the R&R perspective into the [[DevOps_and_Infrastructure|AIoT DevOps]] cycle.
Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one go, so it seems natural to integrate the R&R perspective into the [[DevOps_and_Infrastructure|AIoT DevOps]] cycle. Naturally, the R&R perspective must be integrated with each of the four AIoT DevOps quadrants. From the R&R perspective, agile development must address not only the application code level, but also the AI/model level, as well as the infrastructure level. Continuous Integration must ensure that all R&R-specific aspects are integrated properly. This can go as far as preparing the system for Chaos Engineering Experiments.
 


[[File:2.5-RR Analyze Rate Act.png|800px|frameless|center|Analyze Rate Act]]
[[File:2.5-RR Analyze Rate Act.png|800px|frameless|center|Analyze Rate Act]]

Revision as of 23:43, 28 December 2020

Ignite AIoTArtificial IntelligenceInternet of ThingsBusiness ModelProduct ArchitectureDevOps & InfrastructureTrust & SecurityReliability & ResilienceVerification & ValidationIgnite AIoT - Reliability & Resilience

AIoT: Reliability & Resilience

Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.

Ensuring reliability and resilience is a broad topic, which ranges from basics such as proper error handling on the code level up to geo-replication and disaster recovery. Also, there are some overlaps with Security, as well as Verification and Validation. This section is discussing reliability and resilience in the context of AIoT DevOps first, before looking at the AI and IoT specifics in more detail.

R&R for AIoT DevOps

Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems [1]. These types of systems need to address challenges like implementing recovery mechanisms for individual IT services or entire regions, dealing with data backups, replication, clustering, network load-balancing and fail-over, geo-redundancy, etc.

IoT is adding to these challenges, because in IoT parts of the system is implemented not in the data center, but as hardware and software components which are deployed in the field. These field-deployments can be based on sophisticated EDGE platforms, or on some very rudimentary embedded controllers. Nevertheless, IT components deployed in the field often play by different rules - and if it is only for the fact that it is much harder (or even technically or economically impossible) to access them for any kind of unplanned physical repairs or upgrades.

Finally, AI is adding further challenges in terms of model robustness and model performance. As will be discussed later, some of these challenges are related to the algorithmic complexity of the AI models, while many more arise from complexities of handling the AI development cycle in production environments - and finally adding the specifics of IoT on top of it all.

R&R DevOps for AIoT

Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one go, so it seems natural to integrate the R&R perspective into the AIoT DevOps cycle. Naturally, the R&R perspective must be integrated with each of the four AIoT DevOps quadrants. From the R&R perspective, agile development must address not only the application code level, but also the AI/model level, as well as the infrastructure level. Continuous Integration must ensure that all R&R-specific aspects are integrated properly. This can go as far as preparing the system for Chaos Engineering Experiments.


Analyze Rate Act

Robust, AI-based components in AIoT

[2]

Robust AI Components for AIoT
Architecture for robust, AI-enabled AIoT components

Reliability & Resilience for IoT

Building robust IoT solutions

References

  1. Site Reliability Engineering: How Google Runs Production Systems, N. Murphy et al., 2016
  2. Hidden Technical Debt in Machine Learning Systems, D. Sculley et al., 2015

Authors and Contributors

DIRK SLAMA
(Editor-in-Chief)

AUTHOR
Dirk Slama is VP and Chief Alliance Officer at Bosch Software Innovations (SI). Bosch SI is spearheading the Internet of Things (IoT) activities of Bosch, the global manufacturing and services group. Dirk has over 20 years experience in very large-scale distributed application projects and system integration, including SOA, BPM, M2M and most recently IoT. He is representing Bosch at the Industrial Internet Consortium and is active in the Industry 4.0 community. He holds an MBA from IMD Lausanne as well as a Diploma Degree in Computer Science from TU Berlin.