When SAP Hana started to gain traction around 2013, our problem was how to be successful with the Data Services product, the ETL tool of SAP, and the acquired Sybase Replication Server. Obviously, both tools have proper interfaces to Hana, but adding a new connectivity is just business as usual.
One issue was that the tools did not harmonize with the product philosophy of Hana. Hana is about virtual data models, real time and simple.
Data integration styles
From a technical point of view, the problem is that every type of data integration has pros and cons. None is superior. Batch ETL processing favors throughput over transactional consistency. For high throughput multiple, parallel and independent streams of data are needed, hence the data cannot be synchronized.
Real-time replication is all about low latency copies of data from the source to Hana. With the remote data in Hana always up to date, the user can execute queries at Hana speed and does not bother the usually busy source system. The approach of data federation is to avoid data copying all together. Why copy an entire table and keep it in sync, just because once in a while a record is needed? Hana should read the requested record from the source directly.
Each customer use case has a preferred approach. A one-time data migration project usually prefers high speed batch data processing. Keeping the master data constantly in sync between a cloud system and S/4 Hana is a perfect fit for data replication.
Where to perform transformations
One point often ignored by SAP marketing are the transformations. Regardless of what type of integration is used, the transformation of the source data into the target data model needs to happen somewhere. In an ETL tool, this is done in a graphical modeling environment providing many different types of specialized transforms. In data federation, the only option is to use calculation views for the transformations. And to transform the stream of real-time data, another tool is needed.
But if all approaches have pros and cons, it would make sense to design the transformation once and decide which method to use at activation. This was one of the unique selling points of Hana Smart Data Integration.
When I designed Hana SDI, I based it on Smart Data Access (Hana SDA), the data federation option of Hana. A remote source is created by the user, representing the physical connection between Hana and the source system. The tables of the source system are imported as virtual tables. These virtual tables are Hana objects with the same structure as the source table, but using Hana data types. They do not contain data as such, they simply point to the external data.
If such a virtual table is queried, the Hana optimizer checks what parts can be pushed to the source database, executes the statement there and Hana returns the resulting data set.
Separation of concerns
With SDI, the very same things happen. However, the adapter code is no longer part of the database. It is running inside a separate process, the Data Provisioning Agent, which can be installed everywhere, usually next to the source system.
This has multiple advantages: The fewer program code in Hana, the more stable the database becomes. Otherwise, the ODBC driver would need to be installed and executed by the database and might compromise its stability. Second, the company network can remain locked down. The Hana (cloud) database talks with the Data Provisioning Agent via https and the agent is using the e.g. Oracle driver to talk to the Oracle database within the corporate network.
Because of this strong separation, SDI could also provide an adapter SDK for everybody to build new connectivity. There is a file adapter which allows to query CSV files via SQL commands; a Twitter adapter to search tweets via Hana select statements; open source adapters available in Github – just to name a few.
Using Hana SDI
Real-time replication is an SDI extension of SDA. With the command “create subscription on <virtual_table> target table <hana_table>” the system will notify the adapter to capture changes in the source table and streams these to Hana. In Hana, an apply-process constantly merges these changes into the target table. Whatever source transaction modified the source data, the same data will be available within the same transactional boundaries in Hana.
Batch dataflows are as simple as executing large select statements against a virtual table, reading all contents and inserting the data into the target table. To support mass data movemement, some more features were added. For performance reasons, the select can be parallelized – one per partition – and powerful transforms modeled after Data Services transforms are also available.
Using this functionality is easy. The user creates a FlowGraph (hdbflowgraph), designs the overall transformation in a graphical UI, sets the real-time/batch execution flag and everything else is handled by Hana. One or multiple calculation scenarios (that is the internal runtime object when activating a calculation view) are created and the subscriptions for real time are created.
All good with SDI?
At this point, the first issues in implementation start to become visible, though. The user can choose between batch and real time, but where is the virtual transformation via a calc view? A Calculation Scenario is created, but there is no Calculation View to use it.
Also there is not one single FlowGraph UI; it was built multiple times. First in XSC for Hana 1.0. Then this was ported 1:1 to XSA. Then the UI was developed again from scratch to match the XSA peculiarities. None of these UIs are complete. In fact, in the last version, transforms are missing. And neither of these UIs is on par with Data Services when it comes to end user effectiveness.
Even in the adapters, these issues can be seen. SDI is an extension of SDA, so everything SDA provides, SDI can do. Then why did the SDA team develop the same adapters SDI did already provide? (Answer: For political reasons.) As a result, there are two Oracle Adapters, two Apache Hive adapters – two of nearly everything.
In my point of view, SDI is still the right product for Hana, and customer demand proves it. SAP seems to acknowledge that, too. The Hana Data Fabric is using the same philosophy but simplifies the usage even further. For SAP Data Warehouse Cloud, the SDI adapters and FlowGraphs play a key role.
Meanwhile, I started combining SDI and Apache Kafka to enable all use cases. The key advantage of SDI is it is Hana driven. This can also be a downside, however, as it means Hana reaches into the source systems at will. Customers often want to protect their systems and the easiest way is the customer providing the data proactively instead of Hana pulling it out. With the combination of SDI and Kafka, all data integration styles are possible: Virtual data models, batch and real-time streaming.