Ignite AIoTArtificial IntelligenceInternet of ThingsBusiness ModelProduct ArchitectureDevOps & InfrastructureTrust & SecurityReliability & ResilienceVerification & ValidationIgnite AIoT - Reliability & Resilience

AIoT: Reliability & Resilience

Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: Reliability and resilience ("R&R"). Reliability is about designing, running and maintaining systems to provide consistent and stable services. Resilience on the other hand refers to a system`s ability to resist adverse events and conditions.

Ensuring reliability and resilience is a broad topic, which ranges from basics such as proper error handling on the code level up to geo-replication and disaster recovery. Also, there are some overlaps with Security, as well as Verification and Validation. This section is discussing reliability and resilience in the context of AIoT DevOps first, before looking at the AI and IoT specifics in more detail.

R&R for AIoT DevOps

Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyper-scaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example Google´s Site Reliability Engineering approach for production systems [1]. These types of systems need to address challenges like implementing recovery mechanisms for individual IT services or entire regions, dealing with data backups, replication, clustering, network load-balancing and fail-over, geo-redundancy, etc.

IoT is adding to these challenges, because in IoT parts of the system is implemented not in the data center, but as hardware and software components which are deployed in the field. These field-deployments can be based on sophisticated EDGE platforms, or on some very rudimentary embedded controllers. Nevertheless, IT components deployed in the field often play by different rules - and if it is only for the fact that it is much harder (or even technically or economically impossible) to access them for any kind of unplanned physical repairs or upgrades.

Finally, AI is adding further challenges in terms of model robustness and model performance. As will be discussed later, some of these challenges are related to the algorithmic complexity of the AI models, while many more arise from complexities of handling the AI development cycle in production environments - and finally adding the specifics of IoT on top of it all.

R&R DevOps for AIoT

Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one go, so it seems natural to integrate the R&R perspective into the AIoT DevOps cycle. Naturally, the R&R perspective must be integrated with each of the four AIoT DevOps quadrants. From the R&R perspective, agile development must address not only the application code level, but also the AI/model level, as well as the infrastructure level. Continuous Integration must ensure that all R&R-specific aspects are integrated properly. This can go as far as preparing the system for Chaos Engineering Experiments[2]. Continuous Testing must ensure that all R&R concepts are continuously validated. This must include basic system-level R&R, as well as AI and IoT-specific R&R aspects. Finally, Continuous Delivery / Operations must bring R&R to production. Some companies are even going to the extreme to conduct continuous R&R tests as part of their production systems (one big proponent of this approach is Netflix, where the whole Chaos Engineering approach originated).

Analyze Rate Act

While it is important that R&R is treated as a normal part of the AIoT DevOps cycle, it usually makes sense to have a dedicated R&R planning mechanism, which looks at R&R specifically. Please note that a similar approach has also been suggested for Security, as well as Verification & Validation. It is important that none of these three areas is viewed in isolation, and that redundancies are avoided.

The AIoT Framework is proposing a dedicated Analyze/Rate/Act planning process for R&R, embedded into the AIoT DevOps cycle - as shown by the figure above.

The "Analyze" phase of this process must take two key elements into consideration:

  • R&R metrics / KPIs: A performance analysis and evaluation of the actual live system. This must be updated and used as input for each iteration of the planning process. In the early phases, the focus will be more on how to actually define the R&R KPIs and acquire related data, while in the later phases this information will become an integral part of the R&R planning process.
  • Component / Dependency Analysis (C/DA): Utilizing existing system documentation such as architecture diagrams and flow charts, the R&R team should perform a thorough analysis of all the components in the system, and their potential dependencies. Out of this process, a list of potential R&R Risk Areas should be compiled ("RA list").

The RA list can contain risks on different levels of granularity, ranging from risks related to the availability of individual micro-services up to risks related to the availability of entire regions. The RA list must also be compared to the results of the Threat Modeling which comes out of the DevSecOps planning process. In some cases, it can even make sense to join these two perspectives into a single list or risk repository.

The "Rate" phase must look at each item from the RA list in detail, including the potential impact of the risk, the likelihood that it occurs, ways of detecting issues related to the risk, and ways for resolving them. Finally, a brief action plan should describe a plan for automating detection and resolution of issues related to the risk - including a rough effort estimate. Based on all of the above, a rating for each item in the RA list should be provided.

The "Act" phase starts with prioritizing and scheduling the most pressing issues, based on the individual ratings. Highly rated issued must then be transferred to the general development backlog. This will likely include additional analysis of dependencies to backlog items more related to the application development side of things.

Robust, AI-based components in AIoT

The AI community is still in the early stages of addressing reliability, resilience and related topics such as robustness and explainability of AI-based systems.

In the widely cited paper on [3]

Robust AI Components for AIoT
Architecture for robust, AI-enabled AIoT components

Reliability & Resilience for IoT

Building robust IoT solutions

References

  1. Site Reliability Engineering: How Google Runs Production Systems, N. Murphy et al., 2016
  2. Chaos Engineering, Wikipedia
  3. Hidden Technical Debt in Machine Learning Systems, D. Sculley et al., 2015

Cite error: <ref> tag with name "r3e" defined in <references> is not used in prior text.

Authors and Contributors

DIRK SLAMA
(Editor-in-Chief)

AUTHOR
Dirk Slama is VP and Chief Alliance Officer at Bosch Software Innovations (SI). Bosch SI is spearheading the Internet of Things (IoT) activities of Bosch, the global manufacturing and services group. Dirk has over 20 years experience in very large-scale distributed application projects and system integration, including SOA, BPM, M2M and most recently IoT. He is representing Bosch at the Industrial Internet Consortium and is active in the Industry 4.0 community. He holds an MBA from IMD Lausanne as well as a Diploma Degree in Computer Science from TU Berlin.