Reliability and Resilience
Ensuring a high level of robustness for AIoT-based systems is usually a key requirement. Robustness is a result of two key concepts: reliability and resilience ("R&R"). Reliability concerns designing, running and maintaining systems to provide consistent and stable services. Resilience refers to a system's ability to resist adverse events and conditions.
Ensuring reliability and resilience is a broad topic that ranges from basics such as proper error handling on the code level up to georeplication and disaster recovery. In addition, there are some overlaps with Security, as well as Verification and Validation. This section first discusses reliability and resilience in the context of AIoT DevOps and then looks at the AI and IoT specifics in more detail.
R&R for AIoT DevOps
Traditional IT systems have been using reliability and resilience engineering methods for decades. The emergence of hyperscaling cloud infrastructures has taken this to new levels. Some of the best practices in this space are well documented, for example, Google's Site Reliability Engineering approach for production systems . These types of systems need to address challenges such as implementing recovery mechanisms for individual IT services or entire regions, dealing with data backups, replication, clustering, network load-balancing and failover, georedundancy, etc.
The IoT adds to these challenges because parts of the system are implemented not in the data center but rather as hardware and software components that are deployed in the field. These field deployments can be based on sophisticated EDGE platforms or on some very rudimentary embedded controllers. Nevertheless, IT components deployed in the field often play by different rules -- and if it is only for the fact that it is much harder (or even technically or economically impossible) to access them for any kind of unplanned physical repairs or upgrades.
Finally, AI is adding further challenges in terms of model robustness and model performance. As will be discussed later, some of these challenges are related to the algorithmic complexity of the AI models, while many more arise from complexities of handling the AI development cycle in production environments, and finally adding the specifics of the IoT on top of it all.
Ensuring R&R for AIoT-enabled systems is usually not something that can be established in one step, so it seems natural to integrate the R&R perspective into the AIoT DevOps cycle. Naturally, the R&R perspective must be integrated with each of the four AIoT DevOps quadrants. From the R&R perspective, agile development must address not only the application code level but also the AI/model level, as well as the infrastructure level. Continuous Integration must ensure that all R&R-specific aspects are integrated properly. This can go as far as preparing the system for Chaos Engineering Experiments. Continuous Testing must ensure that all R&R concepts are continuously validated. This must include basic system-level R&R, as well as AI and IoT-specific R&R aspects. Finally, Continuous Delivery/Operations must bring R&R to production. Some companies are even going to the extreme to conduct continuous R&R tests as part of their production systems (one big proponent of this approach is Netflix, where the whole Chaos Engineering approach originated).
R&R Planning: Analyze Rate Act
While it is important that R&R is treated as a normal part of the AIoT DevOps cycle, it usually makes sense to have a dedicated R&R planning mechanism, which looks at R&R specifically. Note that a similar approach has also been suggested for Security, as well as Verification & Validation. It is important that none of these three areas is viewed in isolation, and that redundancies are avoided.
The AIoT Framework proposes a dedicated Analyze/Rate/Act planning process for R&R, embedded into the AIoT DevOps cycle, as shown by the figure preceding.
The "Analyze" phase of this process must take two key elements into consideration:
- R&R metrics/KPIs: A performance analysis and evaluation of the actual live system. This must be updated and used as input for each iteration of the planning process. In the early phases, the focus will be more on how to actually define the R&R KPIs and acquire related data, while in the later phases this information will become an integral part of the R&R planning process.
- Component/Dependency Analysis (C/DA): Utilizing existing system documentation such as architecture diagrams and flowcharts, the R&R team should perform a thorough analysis of all the components in the system, and their potential dependencies. From this process, a list of potential R&R Risk Areas should be compiled ("RA list").
The RA list can contain risks at different levels of granularity, ranging from risks related to the availability of individual microservices up to risks related to the availability of entire regions. The RA list must also be compared to the results of the Threat Modeling that comes out of the DevSecOps planning process. In some cases, it can even make sense to join these two perspectives into a single list or risk repository.
The "Rate" phase must look at each item from the RA list in detail, including the potential impact of the risk, the likelihood that it occurs, ways of detecting issues related to the risk, and ways for resolving them. Finally, a brief action plan should describe a plan for automating the detection and resolution of issues related to the risk, including a rough effort estimate. Based on all of the above, a rating for each item in the RA list should be provided.
The "Act" phase starts with prioritizing and scheduling the most pressing issues based on the individual ratings. Highly rated issued must then be transferred to the general development backlog. This will likely include additional analysis of dependencies to backlog items more related to the application development side of things.
Minimum Viable R&R
Similar to the discussion on Minimum Viable Security, project management must carefully strike a balance between investments in R&R and other investments. A system that does not support basic R&R will quickly frustrate users or even worse -- result in lost business or actual physical harm. However, especially in the early stages of market exploration, the focus must be on features and usability. Determining when to invest in R&R as the system matures is a key challenge for the team.
Robust, AI-based components in AIoT
The AI community is still in the early stages of addressing reliability, resilience and related topics such as robustness and explainability of AI-based systems. H. Truong provides the following definitions  from the ML perspective:
- Robustness: Dealing with imbalanced data and learning in open-world(out-of-distribution) situations
- Reliability: Reliable learning and reliable inference in terms of accuracy and reproducibility of ML models; uncertainties/confidence in inferences; reliable ML service serving
- Resilience: bias in data, adversary attacks in ML, resilience learning, computational Byzantine failures
In the widely cited paper on Hidden Technical Debt in Machine Learning Systems, the authors emphasize that only a small fraction of real-world ML systems are composed of ML code, while the required surrounding infrastructure is vast and complex, including configuration, data collection, feature extraction, data verification, machine resource management, analysis tools, process management tools, serving infrastructure, and monitoring.
The AIoT Framework suggests differentiating between the online and offline perspectives of the AI-based components in the AIoT system. The offline perspective must cover data sanitation, robust model design, and model verification. The online perspective must include runtime checks (e.g., feature values out of range or invalid outputs), an approach for graceful model degradation, and runtime monitoring. Between the online and offline perspectives, a high level of automation must be achieved, covering everything from training to testing and deployments.
Mapping all of the above R&R elements to an actual AIoT system architecture is not an easy feat. Acquiring high-quality test data from assets in the field is not always easy. Managing the offline AI development and experimentation cycle can rely on standard AI engineering and automation tools. However, model deployments to assets in the field rely on nonstandard mechanisms, e.g., relying on OTA (over-the-air) updates from the IoT toolbox. Dealing with multiple instances of models deployed onto multiple assets (or EDGE instances) in the field is something that goes beyond standard AI processing in the cloud. Finally, gathering -- and making sense of -- monitoring data from multiple instances/assets is beyond today's well-established AI engineering principles.
Reliability & Resilience for IoT
Finally, we need to address the IoT specifics of Reliability & Resilience. For the backend (cloud or enterprise), of course, most of the standard R&R aspects of Internet/cloud/enterprise systems apply. Since the IoT adds new categories of clients (i.e., assets) to access the backends, this has to be taken into consideration from an R&R perspective. For example, the IoT backend must be able to cope with malfunctioning or potentially malicious behaviour of EDGE or embedded components.
For the IoT components deployed in the field, environmental factors can play a significant role, which requires extra ruggedness for hardware components, which can be key from the R&R perspective. Additionally, depending on the complexity of the EDGE/embedded functions, many of the typical R&R features found in modern cloud environments will have to be reinvented to ensure R&R for components deployed in the field.
Finally, for many IoT systems -- especially where assets can physically move -- there will be much higher chances of losing connectivity from the asset to the backend. This typically requires that both backend and field-based components implement a certain degree of autonomy. For example, an autonomous vehicle must be able to function in the field without access to additional data (e.g., map data) from the cloud. Equally, a backend asset monitoring solution must be able to function, even if the asset is currently not connected. For example, asset status information must be augmented with a timestamp that indicates when this information was last updated.
- Site Reliability Engineering: How Google Runs Production Systems, N. Murphy et al., 2016
- Chaos Engineering, Wikipedia
- R3E – An Approach to Robustness, Reliability, Resilience and Elasticity Engineering for End-to-End Machine Learning, Hong-Linh Truong, 2020
- Hidden Technical Debt in Machine Learning Systems, D. Sculley et al., 2015
Authors and Contributors