When working for SAP, the Hana Smart Data integration feature was an example where my team and I achieved that. At the time, customers believed ETL tools were good enough, but today SDI is the go-to option for any Hana Data integration project combining batch and real time.
My next goal was to combine data and process integration into a single tool in a (r)evolutionary way. The technologies to enable my vision were finally available allowing to cover both use cases with a single technology. However, it collided with SAP’s organizational structure, and my idea was therefore never picked up. Now, the topic of integration is in such high demand that it gets full attention by SAP’s top management. The results so far have not been very convincing, though. Worse, I work with multiple SAP customers that have solved this problem better themselves, using existing components in a clever way combined with open source technology that my company provides.
In the following article, I would like to take you on the journey these pioneering customers took as well as explain the reasons behind their decision and how we improved and further developed their initial ideas.
System vs. data integration
Looking at the SAP product lineup, there is a clear distinction between process integration and data integration. The first one combines applications, the second transfers table contents from one system to another. Furthermore, the language used is different: Applications talk about entities, like the Business Partner Object, whereas in data integration, it is all about the underlying data model storing these entities.
In a big data world, this separation just doesn’t exist. All products can deal with complex nested objects with high throughput. From this point of view, a table is only a particularly trivial object. This leads us to our first conclusion on the journey towards SAP integration: Forget about database tools and instead consider big data offerings.
Batch vs. real time
The historical reason for the distinction between process and data integration are technological shortcomings, primarily concerning performance. Batch tools are fast, real-time tools have low latency. From a logical point of view, real-time tools are a superset of batch tools. Why? Because the only option to get lower latency with batch tools is to seek changes in the source system more frequently. This puts a huge strain on the source system and therefore is only possible within limits.
A real-time tool does not have this problem, meaning batch-like processing is possible. This would look and feel as if the source system hadn’t registered any changes, up until a certain point – then a lot of changes happen for the next minutes. We need a real-time tool capable of handling mass data, though, which puts us into the big data world again.
Number of producers/consumers
When looking at the overall solution with the question, “Which systems are connected with each other?” in mind, it was more of a 1:1 relationship in the past. The SAP ERP data got loaded into an SAP BW system. Data from the time tracking tool got loaded into SAP HCM. And that is exactly how SAP tools are built. The assumption of a 1:1 connection was already questionable in the past; today, it is simply no longer valid. The SAP ERP data is needed in SAP BW, in a data lake, in Ariba, Salesforce, multiple Intelligent Enterprise apps and services, and many more apps and solutions.
In this case, traditional data orchestration makes less and less sense. It would be better if the various consumers are in control to take the data they require at a speed of their choosing – a data choreography, so to speak. In other words, instead of a central person – a conductor – controlling orchestration, there are different channels for data. Each source system publishes the change-data into them. The consumers simply take the required data from there.
Using Business Partner as an example, the ERP system publishes every new customer master version into such a data topic and the BW system reads all entries once a day. A tightly coupled online shop registers these changes with a latency in less than a second to stay up to date.
Apache Kafka
Combining all of these requirements, we naturally arrive at Apache Kafka. Which, if you think about it, shouldn’t come as a surprise – Kafka is now used by all major companies and has become the de facto standard. As Kafka is built for a big data world, it can easily deal with operational SAP data.
Apache Kafka consists of ‘Topics’ representing a data channel. By partitioning the topic, parallel processing – as required for dealing with massive amounts of data – becomes possible. And each record is backed by a ‘Schema’, holding the structure of the message. The SAP ‘Business Partner’ would therefore be a schema with fields like first name, last name, and a nested array of various addresses. In data integration, we would be dealing with the tables KNA1 and the ADRC address table individually. In process integration, it is one nested schema, modeled e.g. after the corresponding IDOC, oData, CDS or BAPI structures. Kafka can deal with both.
While creating a nested schema requires more work by the producer, this is overall cheaper and more efficient in a world where there is one producer serving many consumers.
Schema evolution/extensibility
The simple solution would be to hand over e.g. IDOCs into Kafka and not care about the rest. However, since we are in the big data realm, companies should at least consider embracing other concepts. One of these is related to changes in the schema – something that breaks all current data and process integration products. It neither makes sense to keep the producers and consumers in sync nor to force them to deal with multiple versions of similar schemas. The concept of schema evolution comes in handy as a way to modify schemas without breaking something.
A simple and obvious example: There are two source systems where Business Partner data can be entered and ten consuming applications. The SAP ERP system just got an additional Z-field. The SAP producer extends the schema with this additional field and assigns a default value of <null> to it. Hence the SAP producer can add the contents of this field. The other producer doesn’t know about the change, consequently producing data using the previous schema for the next 20 minutes until it does a periodical sync on the schema metadata. For this producer, switching to the new schema is no problem, either. This system does not have such a Z-field; hence it does not write a value into the structure, and the value remains the default. Everything continues running fine without any intervention.
The first time a consumer gets a message with the new schema version, it will use this schema for reading all data. Messages with the old schema can still be read. These messages do not provide a value for the additional field and so it will retain the default value as specified in the schema, the <null> value. Consequently, no negative side effects can be discerned.
How consumers deal with new fields is up to the consumer developer. An application consumer invoking an oData API will only use the necessary fields as it does not even have a parameter in the API to apply the additional value to. A data lake consumer, on the other hand, will automatically modify the target structure, extending it by the new field and loading all data.
In short, schema evolution is a technology to modify the structure over time by using various techniques like adding fields, providing aliases, and changing data types. It is also backed by a validation logic that only allows compatible changes.
Then there are cases where data are rather source-system specific and should not be part of the official schema. To still be able to read the data, every schema gets an extension area used for e.g. auditing. Example: The allowed gender codes are ‘F’, ‘M’, ‘?’ in the schema but the source system encodes this information as a number. It might make sense to provide the number as well by adding it into the extension area of the schema. Thus, we can later see that the gender was set to ‘?’ because the gender code was 999.
In addition, the schema contains metadata for later use: Which system produced the message? What transformations did the record go though, and what are their results? What is the data quality score of the record?
Message queue vs. Kafka transaction log
The above case, where the consumer got an additional field but had no option of using it yet, shows another problem: How do you get the already loaded data again after changing the consumer mapping?
Using message queues, the only option to process data again is to start at the source system. The same data is then sent again, and all consumers – even the ones who do not have any need to reprocess the data – must process it again. If the next consumer is adjusted, all the consumers need to process the data again. Pure horror. This simple fact is one of the main reasons why message queues were never used as widely as initially anticipated.
However, we said consumers should be in control of the read. In this case, the consumer must be able to re-read existing data again. No problem for Kafka. As a big data tool, it has no problems retaining the data for seven days (default) – or even forever, if requested. Thanks to the new tiered-storage feature, this is actually very cheap. All the consumer does is moving the read pointer to an earlier time and start reading from there.
This capability has many benefits, e.g. the developers can run a test multiple times, will get the same data, and will get data right away instead of having to wait for the first record being produced.
Future
If you are interested in an open, future-proof, flexible integration solution for SAP data and beyond, feel free to browse the page www.rtdi.io for ideas.
Thanks for the explanation.