All run SAP’s Enterprise Resource Planning (ERP) applications. One look at the long list of SAP customers the University of Michigan maintains and we can see that a multitude of well-known, well-respected and successful organizations across all industries rely on SAP applications for their primary business processes.
Analytics with SAP data
Due to its central role in organizations’ primary business processes, it is safe to say that key analytical environments, including data lakes, data warehouses, or streaming data applications, use data residing in SAP. Off the shelf, SAP provides Business Warehouse (BW) for SAP data.
However, organizations often have requirements to include non-SAP data in their analytical environments, and they generally don’t extend BW with large amounts of external data.
With lots of data in the data warehouse and the need for scalable data processing resources, the cost of running the entire data warehouse in BW – nowadays often on the Hana in-memory database – is an important consideration.
Thankfully there are cost-effective, flexible data lake, data warehouse, and streaming data solutions readily available in the cloud, including:
- File-based solutions like Amazon S3, Azure Data Lake Store (ADLS), and Google Cloud Storage (GCS)
- Database technologies like Snowflake, Amazon Redshift, Azure Synapse Analysis (formerly Azure Data Warehouse), and Google BigQuery
- Streaming data solutions like Amazon MSK, a hosted version of open-source Kafka, Amazon Kinesis, Azure Event Hub, and Google Dataflow.
With SAP applications running on top of a relational database, there are many tools and techniques to integrate data from relational databases into data lakes, data warehouses, and streaming data applications.
Getting data out of SAP
How do you extract data from SAP applications so that you can build your cloud-based data lake, data warehouse, or streaming data application? It isn’t necessarily easy. SAP recommends using its proprietary Advanced Business Application Programming (Abap) language, or call code fragments in a language called Business APIs (BAPIs), as Remote Function Calls (RFCs).
Abap and BAPIs run through the SAP application servers, the second and typically most loaded tier in SAP’s three-tier architecture. Abap uses batch interactions, and (out of the box) BAPIs retrieve limited information and no additional columns that may have been added in your SAP environment as customizations.
For real-time replication from non-Hana sources, SAP provides (Sybase) Replication Server but it does not support pool tables.
Also, SAP Replication Server has had very limited enhancements since the Sybase acquisition in 2010, with Replication Server supporting predominantly transaction processing databases like Oracle, DB2, SQL Server and Sybase ASE, and as a target also Hana and Sybase IQ. Commonly-used modern analytical platforms like cloud-based file systems (S3, ADLS, or GCS), analytical databases (Snowflake, Redshift), or streaming data solutions (Kafka), are not supported.
For replication out of SAP Hana as a source (as well as for non-Hana sources), there is the SAP Landscape Transformation (SLT) Replication Server which uses a trigger-based approach to capture changes and hence impacts the transactions on the source. However, log-based Change Data Capture (CDC) is generally considered a superior approach for capturing changes.
Data replication from SAP
Log-based CDC from many commonly-used transaction processing databases, including SAP Hana, provides a strong alternative for data replication from SAP applications. CDC captures raw data as it is written to the source database transaction logs; it captures the data from the transaction logs in real-time, with minimal impact on the SAP application.
The technology integrates with the SAP dictionaries to retrieve up to date definitions for pool and cluster tables, including any custom Z-columns that may have been added to the tables. Cluster and pool tables are subsequently decoded downstream in the replication flow, away from SAP applications, without relying on Abap or BAPIs.
Log-based CDC can provide many benefits for SAP customers, including:
- Near-zero overhead load on the source SAP transactional database
- Lower latency between source and target
- Real-time data availability
- Improved data quality – log-based CDC provides zero change data loss – including deletes and transient updates
As businesses adopt cloud-based solutions, the need for real-time data to power business operations continues to grow. In order to be successful in the modern business environment, leaders need actionable insights based on the freshest, most accurate data. Log-based CDC can help mitigate the drain on resources that can occur when trying to extract SAP data for analytics.
With log-based CDC, essential SAP data is available when and how it is most needed.