SAP Data Intelligence: A Technical Analysis

When SAP Data Intelligence (SAP DI) - formerly known as SAP Data Hub - started, I was part of the group that developed it. However, since then, I’ve tried my best to forget that time. After six years of tiptoeing around it, I decided to put my knowledge to good use instead.

I’ve said it before: A product must target at least one business problem. According to the SAP product page, SAP DI targets three:

to “integrate and orchestrate massive data volumes and streams at scale”,
to “streamline, operationalize, and govern innovation driven by machine learning”,
and to “optimize governance and minimize compliance risk with comprehensive metadata management rules”.

SAP DI for data integration

To integrate data, SAP DI must obviously have connectivity to the various systems in an IT landscape. For example, a customer might want to extract SAP ERP data and load it into an Oracle database. From a technical point of view, loading any database has different levels of support.

Prepared statements. The insert statement is created once and then called multiple times with the different row values. Without that, the database would have to parse, validate, and create an execution plan for every single row, which takes approximately 0.1 seconds per row. Meaning, instead of loading 10,000 rows per second, only one hundred rows could be loaded.
Array processing. When executing a prepared statement, the database driver provides the option to pass in multiple rows at once. This allows for better network utilization, as the database must allocate free space just once, and so many other internal optimizations are possible. This results in another performance gain by the factor of ten.
No data conversion. If a date value is provided as string and put into a datetime column, the database must convert it. This does not take long but when processing huge volumes at scale, it becomes an issue. Hence passing the data using the correct datatype is key. It would be odd to read a datetime, convert it to string, and then provide the database with that string for it to convert it back, wouldn’t it?
Support of database-specific bulk loaders. When loading large amounts of data, all databases provide a vendor-specific method to get the data in quickly. After, the table segment is locked, the data now written by the database directly into the database file. Everything else is bypassed. In Oracle, this is called the Direct Path API. The problem for any data integration tool is that there is no standard. It must be individually implemented for each database vendor.

Every data integration tool in the world supports all four levels outlined above – except for SAP DI, which does not even have an Oracle table loader, not to mention any other non-SAP database loading option. It allows users to execute single SQL statements (Flowagent SQL Executor) but in most cases, users write their own loaders, for example in Python, with support of the second level, array processing, at most. In contrast, SAP Data Services supports all, as do any other commonly used tools. Funnily enough, SAP Data Services was integrated into SAP DI as an engine – for reading data!

With this knowledge in mind, stating that SAP DI “integrates [..] massive data volumes [..] at scale” is definitely bold.

Machine learning models

A few years back, the SAP team responsible for machine learning provided its prebuilt models to SAP Data Hub. This also caused the name change to SAP Data Intelligence. ML projects suffered in two areas back then, both solved by SAP DI. Firstly, ready-to-use models for typical problems were not available. They had to be created manually. And secondly, the deployment of models, especially from test to production, was an issue. Today, SAP DI has models for image classification, OCR and more (complete list here) and each SAP DI graph can be moved to production, hence including ML models. The popular way to build ML models without SAP DI was Apache Spark, which supported none of the moving-to-prod options.

Today, Tensorflow is the most popular method for creating ML logic. Because SAP DI supports Python as a programming language, using Tensorflow is the most popular method here, too (here’s an example). Any Python runtime can be used for that, but moving to production is solved nicely in Tensorflow itself. During development, a model object is configured with all the different layers of ML methods and then trained. This model can be exported and contains all information to execute it anywhere. The model even adjusts to the environment it is executed on, e.g., training might happen on high-throughput GPUs, but the model is deployed on a smartphone and thus executes its logic on the hardware that device supplies.

From that point of view, SAP DI today is just a complicated, expensive way to run ML models.

Metadata management

Any data-driven project has lots of sources, targets, and transformations in-between. Managing the entire graph is the goal of metadata management. This is such a basic requirement that every data integration tool already incorporated that 15 years ago. SAP DI lags functionalities common in the industry.

Customers

Given the issues outlined above, it seems only reasonable to ask: Are customers even using the product? Finding reference customers is very hard. Fun fact, out of the three reference customers mentioned on SAP’s own product page, the first says, “Machine learning using SAP Data Intelligence Cloud will be deployed [..]”. The second says, “Based on a proof-of-concept project [..] plans to deploy the SAP Data Intelligence solution”. The third reference customer mentions SAP DI once without a practical context.

There are customers using SAP DI. However, the more interesting question is how they are using it and if there are better and cheaper options, both considering licensing cost as well as hardware and admin costs.

My biggest concern is that the situation as it is right now is not sustainable. The development costs for SAP are too high for the revenue it generates. But then again, this would not be the first product SAP deprecates.

SAP DI for data integration

Machine learning models

Metadata management

Customers

Links:

You may also like

About the author

Werner Daehn, rtdi.io

Add Comment