How Apache Kafka Optimizes SAP AI Architectures

All leading companies strive to utilize artificial intelligence (AI) or machine learning (ML) to optimize their business. Such projects usually start with a proof of concept (PoC), which is later tried to be brought into production somehow. Designing a proper architecture not only increases the chances for a successful PoC, but is also a requirement for later use in production.

For better understanding, let’s use an example. The company wants to learn which sales orders have a higher probability of being returned. If the return rate can be lowered significantly, that will be a cost advantage.

What are the requirements of such a project? That depends on the phases.

Learning phase

In this phase, the data are read, fed into the ML model to train it, and the results are validated. Then the same happens again, with a slightly enhanced model. This cycle is repeated until a useful ML model is found.

If the desired data reside in the ERP system (in our example the sales data), all tables will be read many times in their entirety. Each of these full data downloads takes a long time and puts a high strain on the ERP system. It makes more sense to read the data just once, then store them in inexpensive storage and use that as the starting point for the ML model training.

At the same time, the data must be analyzed as well. This will help you get ideas about which model structure to use to validate the results and to combine data from different tables. In our example, it is very likely that order returns cannot be predicted based on the order data alone. The buyer makes a difference (person A returns often, person B never does), the product master data will, too (clothes from vendor A are okay, but vendor B get returned often with the comment “too small”). The same goes for many more datasets. A fast database will help speed along this process.

The final factor are compute resources. The training process of a model consumes a lot of processing power, which will likely not be available in-house.

Everything I have laid out above leads to these requirements:

The ML training system must be in the cloud. All hyperscalers provide the necessary infrastructure, either as a product of their own, a programming environment suited for ML, or an open-source product like Tensorflow.
The data should be in a cloud system which supports fast queries including joins and mass data read/load. It should also be as cheap as possible.
The cloud provider must provide spot instances suited for ML training (all hyperscalers do).

With this system at hand, data scientists will combine data, e.g., order value, ordered product, vendor, product attributes, customer data, delivery times, and returns. They will prepare additional data as well, like clothing sizes or shopping patterns. The model learns from these data. Hopefully, it will figure out which input parameter and parameter combinations cause a return.

Such model design is more an art than a mathematical approach. The naïve approach to load all data in would cause the training to take forever. However, missing out on important information could mean that no available input parameter combination has statistical relevance. If, for example, one vendor tends to use a size L for what everyone else considers a size XL and the model is not made aware of this discrepancy in the input data, it does not know about the root cause and will not find solutions.

Data preparation and model training will always be an iterative process with lots of repeats. The easier it is for the data scientist, the faster companies will have a well-tuned model.

After a few weeks, we will have a model that got trained with most sales data. Some existing orders were intentionally left out. The model has never seen these orders; however, when feeding the additional orders into the model, it tells us the probability of a return, and that number will ideally match with what really happened. If we have gotten this far, we can be confident in loading every new order into the ML model to calculate the likelihood of a return. If a pre-defined threshold is exceeded, a pre-defined workflow will be triggered.

Usage phase

When should a sales order be validated by the model? From a business point of view, the answer is immediately, which is understandable. It does not make sense to take the order, ship it, and then be told it will likely be returned. The cost savings happen when something is done before the fact, not after. For example, the trained model has found that product A is returned more often by male customers who bought size L most of the time. A short e-mail suggesting that size XL might be a better choice can be used to clarify the situation with a link to change the sales order.

Another aspect is that these models will become outdated eventually as new products are added to the catalog which the trained model has never seen. Other factors change, too. If we retrain the model once a month, going through the tedious task of manually copying all data again is a valid concern. It would be much better if the training environment is kept up to date all the time. This requires some method to update the cloud database used for training if any changes occur in the ERP system. That would mean that two systems are now consuming ERP data: the business process validating new sales orders within seconds as well as the ML model.

If our ML project is successful, other services will be added soon. One analyzes for returns and triggers a workflow, another checks for cancellation reasons, one for payment terms, the next for fraud, etc. Suddenly, there are many services that need to be informed about a new or changed sales order. If all these systems constantly query the ERP system for new data, sooner or later, the ERP users will be impacted.

It makes much more sense to put a mediator in between – Apache Kafka is the industry standard. Then there is just a single consumer, Apache Kafka, and this system can stream the changes to all the services. The cloud database reads the changes from Kafka once a day. The ML services register themselves as real-time consumers of Kafka and get new changes within milliseconds.

This is the huge advantage of Kafka: It allows consumers to get data in the complete range of latency. From pulling the data once a week, once a day, every hour, down to pushing the data proactively into the consumer with a latency in milliseconds.

If these services are further split into two parts, e.g., a service API and a Kafka consumer, the service can be used synchronously and asynchronously. The Restful service API is an interface that takes sales orders and returns the probability. The Kafka consumer is calling this API and triggers the workflow to warn the user if a probability threshold is exceeded.

The email example is an asynchronous invocation, because it happens after the order was created in the ERP system. A synchronous invocation would be the online shop calling the service whenever the online shopping cart is viewed. It sends the current data as simulated sales order and can visualize the warnings even before the sales order is created.

Summary

A ML project requires freedom and creativity. The less cumbersome the data preparation is, the more ideas will be tested. Maximum flexibility is provided by a system that covers batch and real time, pull and push, and shields the ERP system from having to cater to too many consumers. Sending all ERP change data into Apache Kafka provides exactly that.

Learning phase

Usage phase

Summary

Links:

You may also like

About the author

Werner Daehn, rtdi.io

Add Comment