(Editorial Changes)
(Editorial changes)
 
Line 129: Line 129:


=== Deployment ===
=== Deployment ===
To ensure that the workflow functions as a pipeline the final solution of the data scientist should be in a way that it updates itself when new data are available. Additionally it should be possible to integrate the solution into the product that is built in this step in an automated fashion.
To ensure that the workflow functions as a pipeline the final solution of the data scientist should be in a way that it updates itself when new data are available. Additionally, it should be possible to integrate the solution into the product that is built in this step in an automated fashion.


This product will usually be a technology demonstrator in the first iteration of this workflow. In later iterations of this workflow, a prototype and minimal viable product might follow.  
This product will usually be a technology demonstrator in the first iteration of this workflow. In later iterations of this workflow, a prototype and minimal viable product might follow.


=== Domain Expert Review ===
=== Domain Expert Review ===

Latest revision as of 18:12, 27 June 2022

This case study highlights how Bürkert Fluid Control Systems, a successful I4.0 SME, masters the AIoT long tail by applying the BaseABC method. The case study was authored by Dr. Nikolai Hlubek, who works as a Senior Data Scientist at Bürkert Fluid Control Systems and develops new data-driven products. The author has a PhD in physics and has been using data science for more than 15 years to tackle various topics. As an example, before joining Bürkert Fluid Control Systems, he developed a real-time ionospheric monitoring service for the German Aerospace Center.

Introduction to Bürkert

Bürkert Fluid Control Systems develops and manufactures modules and systems for the analysis and control of fluids and gases. Examples of typical products include large process valves for the food and beverage industry, small electrodynamic valves for pharmaceutical applications, mass flow controllers, sensors for contactless flow measurement based on surface acoustic waves, and sensors to measure the water quality of drinking water. Bürkert is a 100% family-owned company that employs approximately 3000 people, has a consolidated turnover of ~560 M€, is headquartered in Ingelfingen (Germany), and has locations in 36 countries worldwide.

Introduction to Bürkert

Bürkert products have a moderate level of complexity, which means they can be developed in small project teams of usually less than 10 people over the course of one year. However, Bürkert has a very large portfolio of such products. This portfolio structure places Bürkert at the long tail of AIoT, where it has to manage product variants in a very efficient way. Therefore, product development at Bürkert is truly a good testbed for any AIoT long-tail development process, as the entire process is repeated many times in a relatively short period of time, due to the parallel development of many small products and the relatively short development time. The approach is now a well-documented best practice at Bürkert, as will be explained in the following.

BaseABC

BaseABC

BaseABC method for I4.0 data science projects

The figure above illustrates the BaseABC method that Bürkert uses as a workflow for its data science projects. The workflow does not distinguish between information visualization, algorithm development or machine learning, as the fundamental steps are the same in all cases. The workflow is highly iterative. The workflow is restarted anytime a new insight arises from a step that has implications for a previous step. For a new idea, the initial completion of the workflow will deliver a technological demonstrator. Successive iterations will build on this and deliver a prototype, a minimal viable product and finally a saleable product. Due to the iterative nature of the workflow, it can be stopped anytime with the minimum amount of time and money invested, should it become clear that the data science project would not be able to be transitioned into a sustainable business model.

The workflow starts with a business question. Such a business question could be a pain point for a customer, which can be mitigated by a new product that uses additional data. It could also be an idea for improving an existing product by using data science tools. It could also be an enhancement of an existing data-driven product in the field in the form of continuous learning. The next step is to define a data collection strategy and to implement that strategy, i.e., acquire data. We check whether the data are already available, whether we need to generate it, what quality we require and what tools are necessary to collect the data. Then we collect the data using the defined strategy. Then, storage and access to the data must be considered. We store the acquired data in such a way that access in the following steps is easy, computationally efficient and future proof. An exploratory analysis of the acquired data follows. A domain expert checks the data for consistency and completeness. He or she investigates isolated issues with some of the datasets, if any, and decides if these datasets need to be acquired again.

If all these prerequisites are in place, a data scientist begins the work in the advanced analytics step. He or she tries to find a solution to the business question with the help of data science tools. Once the data scientist has found a solution, we build a product as quickly as possible. The first iteration is, of course, a technology demonstrator. A prototype, a minimal viable product and the actual saleable product follow in consecutive iterations. When the product is ready, we communicate the result. In an early iteration, this is an internal review of the technology demonstrator. In later iterations, a pilot customer obtains a prototype for testing and feedback. Finally, this step will mark the introduction of a new data-driven product.


Bürkert valve type 6724 in a typical dosing application.


An example of a product that we developed at Bürkert using this workflow is a diagnostic application for a solenoid valve (Bürkert type 6724). These valves are small electrodynamic valves for different media, such as air and water, and can be used in dosing applications as shown in the picture above. In a typical variant, they are held closed by a spring and opened for as long as a current is applied. The behavior of the current during the opening of the valve - the so-called inrush current - can be used for diagnostics. The counter-electromotive force is a part of the inrush current and is proportional to the actuator movement. Therefore, it is possible to assess the dynamics of the actuator by a cheap current measurement using the actuator of the valve as a sensor. In particular, it is possible to check if the valve truly opened or if some blockage occurred without the need for any external sensors.

Raw current measurement for the first few milliseconds when switching an electromagnetic valve (Bürkert type 6724). The blue solid and gray dashed-dotted lines show the switching current for a working valve. The orange dashed line shows the switching current for a blocked valve.

The graph shows an example of such current curves. It shows two curves where the valve fully opens (100% stroke) and a curve where the valve only partially opens (50% stroke), i.e. where the fluidic channel is blocked. We can see that all curves are quite different from each other. In particular, the two good state curves (100% stroke) are different in shape. The reason for this is that the counter-electromotive force is only a part of the inrush current, and its exact shape depends on many internal parameters (diaphragm type, coil type, …) and external parameters (temperature, pressure, …). A possibility to estimate the movement of the actuator is by using a curve shape analysis of the inrush current for a data set that contains measurements with suitable combinations of all parameters. This is a standard machine learning task. We will use this example in the following to explain the workflow in detail.

Pipeline

Bürkert envisions the workflow as a pipeline. The complete workflow is started over every time new insights arise from a step that requires an adjustment of previous steps. We then ask for each step if we need to adjust it, based on the new insights. For this to be feasible, the implementation of the workflow must be automated as much as possible. Any manual task required during the workflow represents a pain point that makes it unlikely that the workflow will actually work as a pipeline.

In other words, the time, effort and cost of going through the workflow repeatedly must be as low as possible. Otherwise, people will just find excuses and either not do necessary adjustments or find workarounds that introduce a long-term maintenance burden. Of course, not every step can be automated, e.g., rechecking the business question. However, most of the time, this is also not necessary and we are good enough if the most time-consuming steps acquire data, storage and access and advanced analytics of the final solution are mostly automated. In the case of the business question it is usually sufficient to check if the new insight affects the business question, which is most often not the case. In the following, we illustrate this idea of a pipeline with examples that highlight how a step can lead to new insights that require an adjustment of the previous steps:

After coming up with a business question, we need to consider how to acquire the necessary data. It may turn out that it is not possible to obtain such data. For example, a client might not be willing to share the required data because it contains their trade secrets. Without an update to the business question, which must now include a way to protect the client’s secrets, the workflow cannot continue.

A domain expert that explores the acquired data might notice that the data have certain issues. The measured data could show drift due to changes in ambient temperature. In this case, the data must be measured again under controlled environmental conditions, as any follow-up analysis of bad data is scientifically unsound and a waste of time for the data scientist.

After we built a prototype and provided it to a pilot customer, they may request an additional feature that requires additional data and an update to the solution implemented in the advanced analytics step. The customer may want to operate the device in a different temperature range. Thus, we need to acquire additional data in this temperature range and adopt our solution accordingly, hence rerunning the pipeline.

All these examples show that the data science workflow is repeated many times. This justifies the requirement that the workflow must be as automated as possible so that simple adjustments to the product stay simple in the implementation. We want to react quickly to customer requests without working more.

Up until a minimal viable product is ready, each step and iteration is a stop criterion for the whole project. Therefore, we strive to keep the number of complex tools at a minimum in order to keep the initial costs and the maintenance burden small.

For our diagnostics project of a solenoid valve at Bürkert, we automated the data acquisition by developing an automated measurement setup using LabView. We made the explorative analysis easy for the domain expert by providing a tool for quick visualization of current curves. We automated the advanced analytics step so that any stored data would automatically be detected and would go into the classifier. The classifier could then be deployed to the technological demonstrator and later to the prototype automatically.

Details for Individual Steps

In the following, we will explain the individual steps of the data science pipeline in more detail.

Business Question

The business question is the starting point for the data science pipeline. A good business question has a well-defined idea and measurable objectives. It includes a basic idea for the acquisition of useful data.

Example Project

Task of the data science project that we use as an example to illustrate BaseABC. The input to and required output of the classifier are shown. The parameters are not input to the classifier, but complications that make the classification more complex.

The business question for our example project was as follows: Is it possible to build a classifier that reports successful valve opening if more than 90% flow is achieved and reports an error for any smaller flow by monitoring the inrush current only? The classifier should work for all conditions allowed by the valve datasheet. Data acquisition should be done by laboratory measurements. This question is visualized in the figure above. It shows the inputs and outputs of the classifier that is to be developed. The parameters are not input to the classifier but complications that make the classification more complex.

Acquire Data

We define a strategy for the acquisition of the data and metadata. This means we check whether the data are already available, whether we need to generate it, what quality we require and what tools are necessary to collect the data. Then, we acquire the data and ensure accurate tracking of provenance.

Design of Experiment

If we acquire the data through measurements, each experiment costs time and money. To optimize the number of required experiments for a maximum variance in the dependent variables, the design of the experimental method can be used.

Data Provenance

Metadata will always be necessary to understand our data and to track how we obtained our data. When we acquire the data, we need to document our setup and the process of acquisition. Both documents will serve as an explanation for our data. For the setup, we need to document which devices and tools we used, how we connected the devices and so on. For the process, we need to document how we acquired the data, in which order we acquired the data, when we acquired the data, which environmental conditions we used and so on. In short, we need to use good laboratory practice.

We save the resulting data sets and corresponding metadata in a structured form. Data sets and metadata must be automatically readable and parsable.

Labeling

We label data during the data acquisition step if labeling is necessary and if it is possible. Labeling at a later stage always introduces the danger of hindsight bias.

Example Project

Valve experts at Bürkert defined which range of parameters and fault states are relevant. We modified a valve so we could simulate fault states. We automated the acquisition of the current curves. In total, we acquired approximately 50000 current curves for 12 parameters. Our automated setup stores the data (current curves) and metadata (internal and external parameters of the measurement) in the same measurement file. We generate one file of approximately 3 MB per measurement, which netted us with approximately 150 GB of overall data for the project.

Storage/Access

The best way for storage and access is to use the principles for fair data[1]. They state that the data should be findable, accessible, interoperable and reusable. Any storage solution that will fulfill these requirements is sufficient.

If a data archive system is not used and the data are stored on a fileserver, the storage should be set to read-only after some initial time that is reserved for quick fixes. This guarantees that any following data analysis will be reproducible because it uses the same data.

The data should be stored in a form that is easy to handle. For many projects, this means using a database is unnecessary. In our experience, for everything up to a few 100 GB, storage on a file server and lazy loading were computationally efficient, easy to handle and did not have the maintenance burden that a database would have introduced.

Example Project

In our example project, we stored the files of our measurements on a file server. We compiled the metadata in tabular form with links to the data files. Using these links, we employed a strategy of lazy loading, i.e., loading the measurement data of the relevant curve into main memory only when required. Since we used a server-based approach for the data analysis, with the server in the same data center as the data, access to the data was fast. We profiled access times using our lazy loading strategy against using a NoSQL database. The results showed that the database was not faster.

Explanatory Analysis

Before any in-depth analysis of the data starts, we perform simple visualizations of the acquired data to obtain an overview of it. A domain expert uses these simple visualizations to check the data for consistency and investigates some isolated issues if such issues are present in the data.

Example Project

We examined some of the current curves using simple plots of current versus time. For some data sets, a domain expert observed that the current curves had an inflection point at an unexpected position. Further analysis with the help of a data scientist revealed that all data sets with this feature belonged to a particular valve. Analysis of that valve revealed that it performs its function like the other valves but has a much higher friction due to manufacturing tolerances. This insight leads to an update of the data acquisition step with an additional parameter.

Advanced Analytics

Once the data and metadata are available and of good quality, a data scientist uses her or his tools of the trade during the advanced analytics step. A measurable objective exists in the form of the business question.

Data Visualization

A first step of the analysis is usually to create a meaningful visualization of the data. During the exploratory analysis step, the domain expert was looking at individual datasets alone. In this step, the data scientist should design a visualization that encompasses all of the data. For example, this can be achieved by clustering algorithms such as principal component analysis for datasets that consist of sets of measurement curves.

Find Numerical Solution

Once the data scientist has gained an overview of the data, he or she searches for a suitable solution to the objective set by the business question. The solution can be in the form of an algorithm or a machine-learning model. It also does not matter whether the machine-learning model is a greedy model such as kNN, a shallow learning model such as a support vector machine, a random forest or gradient boosting, or a deep learning model. The data scientist has all information at hand to find the best solution to the business question. Best in this sense is the simplest solution that can solve the business question. If performance metrics are involved, usually speed of evaluation, model accuracy and reliability and required computational resources need to be balanced for an optimal solution.

Baseline Solution

The data scientist should always evaluate her or his solution against a simple baseline solution. This is required to prove that a more complex solution is necessary at all. By comparing the final solution to the baseline solution, the gain in efficiency can easily be shown.

Document Results and Failures

When data scientists try to find a solution, they will naturally encounter dead ends. Some methods might not work at all or might not work as expected. Machine learning methods are likely to find several solutions of varying quality. We document all these results and failures. All results are documented so that the best result can be selected. As stated in chapter- Find numerical solution, the best result is not necessarily the one with the highest accuracy, as other considerations such as computational efficiency or reliability can be of higher priority. All failures and dead ends should also be documented. There are several reasons for this. First, the failure might be due to some mistake on the part of the data scientist and a future evaluation could correct this mistake. Second, by documenting that a method did not work this method does not need to be tested in similar future analysis tasks. This prevents someone from making the same mistakes again. Third, if conditions change, this method could suddenly work. If it has been documented under which assumptions the method failed, it can be assessed if it is worth trying the method again.

Archive Data Analysis and Tools

The result of the advanced analytics step will be a solution to the business question. Either this solution will be incorporated in a product or a business decision will be based on that solution. To maintain a product or justify a business decision it must be possible to reproduce the exact solution at a later date. This is only possible if the data are preserved, the analytical work of the data scientist is preserved and the tools that the data scientist used are preserved. Data preservation is a prerequisite in the step Storage/Access. The analytical work of the data scientist must be stored in an archive system such as Subversion or Git. The tools must be stored in exactly the same version that the data scientist used. For programming languages, the libraries used must also be taken into account.

Analytics Expert Review

The data analysis should be reviewed by another analytics expert. This can be done either in a tool-assisted fashion using a code review tool such as Upsource or Phabricator on each increment of work or by a walkthrough of the analysis after a part of the analysis has been completed.

Example Project

Principal component analysis of the current curves. Shown are the first two principal components with the temperature show by color.

To obtain an initial visualization of the overall data we used principal component analysis and colored the data according to the parameters we had defined. The figure above shows such a plot for the first two principal components colored by the parameter temperature. The complete visualization would encompass the first four principal components and individual plots for all parameters.

Then, we developed our classifier using Jupyter notebooks[2]. These notebooks have the advantage that they can combine code for data analysis and explanatory text can include interactive figures and contain the results of the data analysis. They are a powerful tool for handling steps 3.5.1 through 3.5.5 in one view. The notebooks run on and are stored on a server. The source code in the notebooks is automatically exported when changes are made to the notebook. We archive the notebooks and the source code. The archive system is connected to a code review tool. We review each increment of work.

To work with reproducible tools, we use a virtual environment with fixed versions of the Python programming language and its libraries. This virtual environment is registered to the Jupyter notebook server and can be selected by a notebook for an analysis task. When the use of a new or updated library is required, we create a new virtual environment with the required version of the Python programming language and libraries, fix the versions and link it to the Jupyter notebook server. As a result, the existing notebooks use the old virtual environment and keep functioning. For new notebooks, the new or old environment can be selected.

Classification of valve stroke into good (circles, above the dashed line) and faulty states (crosses below the dashed line) based on the shape of the current curve. Coded by color is the actual stroke, which was independently acquired by a laser distance measurement.

The figure shows our resulting classifier tested against an independent ground truth. Each point is a measurement with different parameters. For some measurements, a valve blockage was simulated. The classifier divides the measurements into good (circles, above the dashed line) and faulty (crosses, below the dashed line) states. Coded by color is the actual stroke, which was independently obtained by a laser distance measurement. This ground truth shows that the classifier did classify all the tested cases correctly. This figure and the underlying classifier are the work products of the data scientist from the advanced analytics step. This is the result, which we hand over to the next step.

Build Product

When the data scientist finds a solution to the business question, he or she should isolate the required method to solve the question. The method can be an algorithm or a machine-learning model.

Deployment

To ensure that the workflow functions as a pipeline the final solution of the data scientist should be in a way that it updates itself when new data are available. Additionally, it should be possible to integrate the solution into the product that is built in this step in an automated fashion.

This product will usually be a technology demonstrator in the first iteration of this workflow. In later iterations of this workflow, a prototype and minimal viable product might follow.

Domain Expert Review

The technological demonstrator, prototype or product should always be checked by a domain expert. This review should be a black box review of the data scientist’s solution. The domain expert should only evaluate the effectiveness of the demonstrator with regard to the business question.

Example Project

The figure- Classification of valve stroke into good and faulty states above was the direct result of the advanced analytics step and used as initial technology demonstrator. In its first iterations, it contained a few measurements that were not classified correctly. The domain experts determined that the measurements were for valve states that were not allowed by the datasheet of the valve. Thus, after removing these faulty measurements from the dataset and redoing the analysis, we obtained the ideal classifier as shown in the figure.

Since this is a new technology, we decided to build a technology demonstrator in hardware. This demonstrator consists of a valve where the stroke can be reduced by an added screw to simulate blocking. The valve is connected to compressed air, and a flow sensor measures the resulting flow to obtain a reference measurement to be used as ground truth. The current is measured by a microcontroller, which also performs the classification. It shows the result by a simple LED. This technological demonstrator is important because it shows the effect of the technology without the mathematical details that are only accessible to an analytics expert.

Communicate Result

Once a technological demonstrator, prototype or minimal viable product is ready, it should be presented to a larger audience. This has multiple purposes. Inside a company, it is necessary to demonstrate the effect of the new technology on an audience that is not familiar with the details and gather feedback. It might also inspire people to use the technology for different products. Outside of the company, it is necessary to inform the technology to find pilot customers for field validation.

Field Validation

Technological demonstrators will usually be shown around internally or at exhibitions to gather some initial feedback on the technology. If a prototype is available, field validation is the next logical step. This will present a real-world usage for the prototype. It will give valuable insights that more often than not lead to necessary adjustments of earlier steps of the BaseABC.

Business Case Review

All the collected feedback should lead to an answer to the business question, whether the developed technology is able to solve the business question and whether the business question correctly addresses a customer issue. One might argue that the detailed identification of the customer issue should be earlier; however, real-world examples have proven time and time again that most customers can define their needs the best if they are given a prototype that deals with their issue. This is the main reason this workflow aims at automation and iteration toward an adoptable prototype as early as possible.

Example Project

We patented the technology for the diagnostics of the inrush current. The initial business question spawned a number of follow-up projects with slightly different business questions, which all try to capitalize on the technology in one way or the other.

Pilot customers evaluated prototype circuit boards. Such a board has a microcontroller that measures the current and classifies valve switching. Customer feedback showed that these prototypes address customer needs.

The technology is ready for integration into systems, designed for specific customer applications, and a project to develop a standard product, which is done as another iteration of BaseABC, is on its way.

FAQs

Why is there no step for continuous training

Training a machine learning model on historical data and deploying it assumes that the new data will be similar to the historical data. Sometimes this is not the case, and the machine-learning model needs to be adapted to retain its accuracy and reliability. This concept is called continuous training.

Within the BaseABC workflow, this is not an extension of an existing workflow. In some cases – mostly when the data-driven product is cloud based - it is just a rerun of the existing pipeline. If the data-driven product involves an edge device and access to new data is difficult, we handle it as a separate business question that uses its own pipeline. This is justified because, for example, the data acquisition and storage usually differ drastically in such a case from the initial model training. Typically, the initial training is against a large data set and happens with some powerful infrastructure involved. Later, data collection and computationally less expensive model refinement occurred on the edge device.

Why is there no step for monitoring an existing data-driven product

Similar to continuous training, the BaseABC workflow treats the monitoring of a data-driven product as an individual business question. It requires the acquisition of appropriate monitoring data – usually containing a ground truth, a storage location (edge device or cloud) and a suitable performance metric for monitoring the prediction quality that the data scientist has to find.

Authors and Contributors

Dr.Nikolai Hlubek.jpg
DR. NIKOLAI HLUBEK, BÜRKERT FLUID CONTROL SYSTEMS
AUTHOR
Nikolai works as a Senior Data Scientist at Bürkert Fluid Control Systems and develops new data-driven products. He has a PhD in physics and has been using data science for more than 15 years to tackle various topics. As an example, before joining Bürkert Fluid Control Systems, he developed a real-time ionospheric monitoring service for the German Aerospace Center.

References

  1. FAIR Guiding Principles for scientific data management and stewardship, GO FAIR, Scientific Data, 2016, https://www.go-fair.org/fair-principle
  2. Why Jupyter is data scientists’ computational notebook of choice, Nature 563, 145-146 (2018), https://doi.org/10.1038/d41586-018-07196-1