How SAP Data Services Achieved Such Stellar Performance

I'm not sure if I'm alone in this, but I'm feeling that we are going backwards in the data integration market.

Many new tools gained traction recently, but they are worse than SAP Data Services (aka BODS) in almost every aspect. Or maybe BODS was just ahead of its time? To answer that question, I would like to primarily look at three key factors:

Throughput. A data integration tool’s most important property is speed. It will deal with massive amounts of data and every ounce of potential optimization must be utilized. Or would you readily use a tool that takes more than 24 hours to load the daily delta?
Developer efficiency. Of course, everything can be programmed or scripted manually. However, this takes time, and changes are very expensive. A data integration tool should make life easier, not harder.
Enterprise readiness. This covers documentation of the created logic, error handling, automation, multi-user development, moving to production, things like that.

Source and target connectivity

The amount of development effort that went into Data Services was immense. Regarding performance, the first question is how a tool interacts with connected systems. For a database it might seem simple – a JDBC/ODBC driver is used to execute SQL commands. That has to be good enough, right?

When reading data, two systems are involved, and both should be utilized while reading. Hence the bare minimum is using array fetches with every supported database. But even that might not be enough. Then all data is streamed through a single connection. The result of that would be to utilize a single CPU core on a large 128 CPU server.

Next: For large quantities, reading the data partitioned and in parallel is a must. If the table is large, it will very likely be partitioned already. If so, the DI tool should read each partition via a single connection (see e.g. Hana select with partition restriction on how to do that when reading Hana tables). Now, the entire system can be utilized until the network bandwidth will be the limiting factor.

For loading, there are the same concepts available – partitioning, array loads, multiple connections –, but databases often have a quick import method as well. This locks the entire table partition and imports data directly into the database file. Loading this way has limitations but if data must only be inserted, it is much faster compared to SQL commands. ODBC/JDBC does not provide any APIs for this as it is vendor specific. BODS is not using ODBC/JDBC, it rather implemented the low-level APIs for every single database vendor. In Oracle, this is the OCI Library and the direct path API.

RFI questions: Do you support…

array fetches?
reading data partitioned in parallel automatically?
array insert/update/delete?
vendor-specific, fast import APIs like Oracle’s direct path API?
partitioned parallel loading (with faster import methods)?

If the answer is no to any of these questions, the performance will be significantly impacted. In other words, what takes one hour with BODS takes ten hours with the other tool.

ETL/ELT

The first tools in the DI space were file based. A batch program extracts the data from the source into a file. Then the file is read and transformed into the desired target structure, often via multiple steps, and finally loaded. Looking at the processing pattern where reading is a synonym for data extraction, it would look like: EL-ETL-ETL-…-ETL-EL.

The data is read from the source, written as a file. The file is read, transformed, and a new file is written. This happens multiple times. Finally, the last file is read and loaded.

That the performance of such an approach would be unbearable is obvious. And the solution is simple as well: No files in between, instead the data are streamed through the transformation steps. This streaming has another positive side effect: The data are read, transformed, and loaded simultaneously. The consequence is, if the read takes one hour, the transformation one hour and the loading as well, the overall execution time is closer to one hour, not three hours if one step is performed after the other.

This would be a pattern like: E-TT…TT-L, or in short: ETL.

Database vendors and trivial data integration tools called that old-school technology and coined the term ELT instead. Data is extracted from the source, loaded into a database, transformed there using SQL statements. The main argument for it is that a database is much more powerful in terms of hardware and SQL optimization, compared to any ETL tool. But looking at the read/write steps, the ELT term is misleading. It should rather be: EL-ETL

Data are read from the source, loaded into a target database table. Then read using a SQL statement that does all transformations and loads the result into the final table. The second part would be an “insert..select..” statement. However, we all know the limitations of SQL. For even mid-complex transformations, multiple select statements must be executed or nested, each creating a temporary table inside the database. Leaving marketing aside, a typical transformation inside the database would almost always be: EL-ETL-ELT-…-ETL. This is more like the first version where processing steps happen one after the other and with lots of read/write operations.

Which one is better?

So, which one is better from a practical point of view? One can come up with examples for both in which area the one is better and the other sucks and vice versa. That is breeding ground for fierce, long-lasting war. To me, the objective answer is clear: Both use cases have their sweet spot.

If the data are in the target database already and the transformation can be done with a single, non-nested SQL statement, then executing this single SQL statement ELT-like would be faster for sure. In comparison, an ETL tool would read the data outside of the database and apply a trivial transformation just to load it back into the same database. That does not make sense considering performance.

With complex transformations, particularly ones that require multiple SQL statements to achieve the transformation, the ETL approach will be faster most of the time. And in the normal scenario, a mixture of both will be the fastest. Perform as many actions as possible in the source and target database and do everything the database is not well suited for in the ETL tool.

Example: Two tables should be joined, and the result compared with the current target table in another database. Only the few changed rows should be applied. It would be ideal to push the join in the source database, that’s what the database is very good at and has all the supporting structures, like indexes on the table, for. The table comparison requires two to three SQL statements instead of a single statement an ETL tool requires.

What does BODS do? Exactly that. It has an optimizer built in that tries to push down as many (sub-) operations as possible to the source or target database and does the rest in the engine. For simple cases, it generates an “insert..select..” statement and behaves like an ELT tool.

In other words, BODS is an ELT and ETL tool. Both concepts, plus a decision-making optimizer and the capability of breaking a dataflow apart into an ELT and ETL part, have been incorporated. The BODS team invested a lot of money to get the best of both worlds to its users for their benefit.

In-engine throughput

For the in-engine performance, the goal is to get into millions of rows per second. Obviously, every overhead required for getting the rows from one transformation to the next will impact performance significantly. Whenever the rows cross system boundaries, there must be a conversion of some kind.

The first conversion is when reading the data. It comes in one format, e.g. as SQL result set, and must be converted into the engine format. A SQL decimal converted into C++ data structure, a SQL date into a C++ date. This involves CPU power for the actual conversion and memory access time for the copy operation. This process takes nanoseconds, but with millions of rows per second, this trivial process alone takes 0.1 percent of the execution time. If these conversions happen between every single transformation and there are 100 transformation steps, the engine data copy operations are already responsible for 10 percent of the execution time without any of the actual transformation CPU times yet.

When implementing a streaming engine, the naïve approach is to take the row from the input, allocate memory for the output row, and then apply to all mappings. The first field is a 1:1 copy, so the content is copied to the output. The second column has a mapping of subtring(col2,1,1) to extract the first chart and copy that result to the output. Consequently, for one million rows per second and 100 transformations (and 1k byte row size), the process takes 100 GB/s memory bandwidth. That is, to put it conservatively, a lot.

The better approach is to move the row pointer only to the output. The input row is at address X, output is the same row. For 1:1 mappings, the input and output is the same, so nothing must be done. Column #2 is overwritten with the new value. With this approach, no memory copy is needed!

This approach works if the input and output have the same columns. Here, the optimizer comes into play again. Let’s assume the output has an additional column. What would happen if this additional column were moved to the very beginning, the point where the data are read from the source? It does not exist in the source and therefore its value will be zero. But the column itself will not be affected. The same argument applies if the output has less columns. Yes, the output row at address X has an additional column available, but it is not used when loading into the target database. With this trick, the condition “input row and output row must have the same columns” can be kept true for much more transformation types. This can be visualized when looking into the optimized execution plan that BODS generates.

The next thing the optimizer does is combining and splitting transformations. A trivial example would be three transformation steps that each only modify columns. All these transformations can be collapsed into a single transformation step. This results in extreme in-engine performance in BODS. The CPU power and memory bandwidth for the data movement is effectively zero, leaving all the resources for the actual transformation.

The worst a tool can do: Using different engines with different row formats and/or transferring the data across the network. For example, a tool that has a structured engine, a python operator, a Javascript operator, and so on. In such a case, the structured data must be serialized into a Json string, just so the Javascript engine deserializes the data into its internal format. Then, the Javascript logic applies its transformation, and the resulting Json structure is serialized into a Json string and handed over to the Python operator so it can deserialize it.

If there is a network involved between two operators, things go from bad to worse. Do you happen to know a data integration tool that is built that way? If yes, compare it with BODS! Funny enough, even Data Services has the option to split a data flow into multiple processes. It is a very effective way of slowing down a dataflow.

These performance numbers are less easy to get from tool vendors and they would be hard to compare. Better to evaluate them in a proof of concept, where a more complex scenario is implemented, preferably with SAP Data Services as a contender.

Developer efficiency

Another development I’m seeing is to go back to scripting. Transformations are implemented as Javascript code, entire dataflows written in Scala or Python, and executed as Spark jobs. At the surface, these are equivalent to BODS dataflows. I mean, a BODS dataflow is nothing else than a sequence of CPU instructions as well. The difference is the usability or, like I called it in the beginning, the developer efficiency.

As a first example, let’s use the partitioned reading. The source table is supposed to have 20 partitions, and each should be read via a single connection. Building a corresponding dataflow is doable – it would have the same source table plus the partition filter 20 times in the dataflow. Not nice, but alright. The troubles start if the number of partitions is changed. The database might even add more partitions automatically if, for example, a table is partitioned by year. The dataflow must still work and must read all data including the new partition and preferably utilize all 21 partitions now. The first requirement for reading all data is still doable, if the developer pays attention. The second one would require to manually add the 21^st partition and move the code to test and production.

In case of BODS, the user adds the source table just once and ticks the flag to read the table partitioned. Then the dataflow optimizer creates the modified plan with n readers each reading the data from one partition. Here, BODS could do a little better in aligning the partitions and the degree-of-parallelism, but this simple change never made it into the product for whatever reasons.

This kind of feature, where the user just sets a flag and the optimizer changes the dataflow into something that does logically the same but is optimized for performance, can be seen in the actual project when more complex requirements are to be implemented. Things like a lookup of a row in another table return not all rows, just the most recent one. Or when loading tables designed for slowly changing dimensions – this is just one click in BODS, compared to a lot of code the user has to maintain in other tools.

My point here is that having the option to implement something fancy is one thing. But if the costs are too high, it cannot be done due to the amount of time it would take. It is crucial to check how easy such advanced functionalities can be utilized.

Enterprise readiness

The final point is all about the environment of a data integration project. Error handling must be implemented, for example. In my experience, the dataflows and workflows are designed in a BODS project once and then they simply run. The product is very stable by itself and if something fails, next time the job is executed again, it recovers from the previous error automatically and, e.g., processes yesterday’s and today’s data together. Other features like AutoDocumentation allow users to create hundreds of pages of project documentation using the information the dataflows contain.

Is everything perfect, then? Of course not. Some pain points I have already listed above. The biggest issue I have is the development speed. All work seems to go into SAP Data Intelligence and the “What’s new in BODS” document in each release is just basic maintenance. Even SAP’s plan to integrate BODS into SAP Data Intelligence does not help, as it will allow to use the BODS reader and a query transform. Almost none of the abovementioned advantages of BODS can be utilized then.

So, I probably have my answer to the question I posed at the outset: “Or maybe BODS was just ahead of its time?” Yes, it was ahead of its time and still benefits from the concept.

Source and target connectivity

ETL/ELT

Which one is better?

In-engine throughput

Developer efficiency

Enterprise readiness

Links:

You may also like

About the author

Werner Daehn, rtdi.io

Add Comment