Reliability and Resilience: Difference between revisions

Line 23: Line 23:
__TOC__
__TOC__
=== R&R for AIoT DevOps ===
=== R&R for AIoT DevOps ===
Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems
Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems <ref name="Murphy" />.





Revision as of 23:20, 28 December 2020

Ignite AIoTArtificial IntelligenceInternet of ThingsBusiness ModelProduct ArchitectureDevOps & InfrastructureTrust & SecurityReliability & ResilienceVerification & ValidationIgnite AIoT - Reliability & Resilience

AIoT: Reliability & Resilience

Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.

Ensuring reliability and resilience is a broad topic, which ranges from basics such as proper error handling on the code level up to geo-replication and disaster recovery. Also, there are some overlaps with Security, as well as Verification and Validation. This section is discussing reliability and resilience in the context of AIoT DevOps first, before looking at the AI and IoT specifics in more detail.

R&R for AIoT DevOps

Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems [1].


AIoT-enabled systems are facing more challanges from the perspective of for goes beyond R&R Challenges

AI Model robustness Model performance

Cloud/Enterprise Recovery: Services,�Regions Data Backups, �Replication Clustering Geo-redundancy

Network Network Load �Balancing, Fail Over Recovery from �network outage

IoT-Devices/Assets R&R for EDGE�components

R&R DevOps for AIoT
Analyze Rate Act

Robust, AI-based components in AIoT

[2]

Robust AI Components for AIoT
Architecture for robust, AI-enabled AIoT components

Reliability & Resilience for IoT

Building robust IoT solutions

References

  1. Cite error: Invalid <ref> tag; no text was provided for refs named Murphy
  2. Hidden Technical Debt in Machine Learning Systems, D. Sculley et al., 2015

Authors and Contributors

DIRK SLAMA
(Editor-in-Chief)

AUTHOR
Dirk Slama is VP and Chief Alliance Officer at Bosch Software Innovations (SI). Bosch SI is spearheading the Internet of Things (IoT) activities of Bosch, the global manufacturing and services group. Dirk has over 20 years experience in very large-scale distributed application projects and system integration, including SOA, BPM, M2M and most recently IoT. He is representing Bosch at the Industrial Internet Consortium and is active in the Industry 4.0 community. He holds an MBA from IMD Lausanne as well as a Diploma Degree in Computer Science from TU Berlin.