Maksim Samasiuk - Fotolia
As the volume of financial data grows, organizations struggle to simplify data management and clear the way to better reporting and analysis.
In the SAP Press book, SAP Simple Finance: An Introduction, Head of LoB Finance and SAP Innovation Center Jens Krüger, Ph.D., gives readers a look at the inner workings of Simple Finance's HANA-driven applications, which are designed to streamline financial operations.
This excerpt, from Chapter 3: Removal of Redundancy, provides an in-depth explanation of one of the central concepts behind SAP Simple Finance.
3: Removal of Redundancy
Redundantly kept data -- that is, data derived from other data available elsewhere in the database -- is one of the big challenges of software systems.
You may wonder why you would store, for example, the sum of two invoice amounts separately if the same value can easily be derived by calculating it on the fly from the two invoices. In the past, such data redundancy was only introduced to increase performance, because traditional databases could not keep up with user expectations in light of billions of data entries. This came at the costs of significant effort to keep the redundant data consistent, increased database storage and more complex systems.
Now that SAP HANA improves performance radically, as outlined in the previous chapter, the need for redundancy vanishes. Based on a single source of truth, derived data can be calculated on the fly instead of being physically stored in the database. Hence, in the spirit of simplification, getting rid of redundancy is a key paradigm of SAP Simple Finance. We begin our exploration of the paradigms underlying the SAP Simple Finance model by looking at the removal of redundancy in this chapter.
This chapter first introduces the conceptual benefits of a redundancy-free system by contrasting it with the disadvantages of redundant data (Section 3.1). Section 3.2 demonstrates how SAP HANA and in-memory technology enable us to overcome redundancy and avoid its pitfalls. Then, in Section 3.3 we explore how SAP Simple Finance simplifies the core financials data model by nondisruptively replacing materialized redundant data with on-the-fly calculation. Finally, Section 3.4 highlights the immediate benefits of this simplification for companies switching to SAP Simple Finance.
3.1 Benefits of a redundancy-free system
Data redundancy occurs if data is repeated in more than one location -- for example, in two separate database tables -- or can be computed from other data already kept in the database. In general, all data that you can also derive from other data sources of a system is redundant. Redundant data is distinguished by the fact that you need to maintain its consistency. If you modify the data at one location, then you need to apply corresponding modifications at all the other locations in the database in order to keep the relationship intact and avoid anomalies. Redundant data is high-maintenance data, and it won't take care of itself.
Software engineers spotted the fundamental problems of data redundancy early on. In 1970, Edgar Codd introduced the fundamental relational model that underlies many of today's database systems. In order to reduce redundancy, normalization techniques such as normal forms are an integral part of the relational model.
But data redundancy has remained. We can distinguish four different kinds of redundancy within a database system, depending on the relationship of redundant data to the original data:
A materialized view stores a subset of data from one or several tables as duplicate copies in a second new, typically smaller table. It materializes the result of a database view, which is a stored database query, in order to provide direct access to the result. As such, the materialized view can be compared to an index into the base table. The query may also join several tables.
If the query of a materialized view aggregates tuples from the base table(s), then we speak of materialized aggregates as a special case. A materialized aggregate does not contain one-to-one duplicates, but nevertheless, the data is redundant, because it can be derived from the original data at any time by applying the same calculation on the fly.
Duplicated data due to overlap
Data may be duplicated if related information is stored in separate locations. Part of the data attributes overlap, whereas others are present only at one location. The overlap is then a source of redundancy. The different locations are often due to a separation of concerns, whereas the overlap again stems from performance reasons -- for example, to avoid joins that are costly in traditional database systems.
Materialized result set
Long-running programs that arrive at a certain output after several steps of calculation often store the result of their work as a materialized result set. Due to the complexity of the program logic, the result is not described as a query but is created by running the program, which then stores its output for fast future reference.
None of these instances of redundancy is necessary from a functional point of view. They have been introduced in the past into enterprise systems to improve query response times in view of slow, disk-based database systems, which made an on-the-fly calculation prohibitively expensive. In-memory technology now breaks down the performance barrier so that SAP Simple Finance removes all kinds of redundancy. This chapter covers the removal of redundancy with regard to the first two categories: materialized views and materialized aggregates. (We'll see these categories come up again later in the book. As described in Chapter 6, eliminating the need for long-running batch jobs makes materialized result sets obsolete in a real-time finance system. The Universal Journal covered in Chapter 9 describes a case in which SAP Simple Finance gets rid of duplicated data due to overlap between financial components.)
If you look beyond a single database, redundancy also occurs when duplicating the same or derived data in several database systems. In contrast to redundancy within the same database, this cross-system duplication may in some cases be wanted -- for example, for data security purposes. In other cases, it is truly redundant -- for example, in the case of data warehouses, due to missing analytical capabilities of traditional database systems. As outlined in Chapter 2, this kind of data redundancy on a system level is inherently superfluous, because SAP HANA combines OLTP [online transaction processing] and OLAP [online analytical processing] capabilities with superior performance.
Coming back to redundancy within a database, what are the benefits of having a redundancy-free system? Why is having a single source of "truth" in SAP Simple Finance preferable compared to the traditional world with materialized views that serve to improve the performance of read operations?
The fundamental problem with redundant data is the effort necessary to keep it consistent when the original data changes. If you materialize the total amount of all orders of a certain customer, then you need to update that figure every time a new order by that customer comes in. Materializing a figure that needs to be updated frequently -- for example, the balance of an often-used account -- may even lead to contention, because the balance entry needs to be locked on each debit or credit on the account.
Similarly, a materialized view containing a duplicated subset of accounting data -- for example, all open invoices -- needs to be reconciled whenever the original data changes. If an open invoice is cleared by a customer's payment, then it also has to be removed from the corresponding materialized view.
Regardless of whether this reconciliation needs to be done manually by each business transaction or is handled automatically by the database system, maintaining consistency leads to additional database operations on top of the actual modifications. Otherwise, your database had more than one truth -- which means that it was inconsistent. Getting rid of such reconciliation activities is the source of several benefits:
Overall, the architecture and data model of a redundancy-free system are significantly simpler. There are fewer dependencies and constraints that transactions need to take into account when modifying or analyzing data. As a consequence, transactions are also simpler. The overall simplification also manifests in a few other listed benefits.
Consistency maintained by design
Business transactions that modify data do not need extra work in order to maintain consistency. This makes for simpler programs based on a less complex data model with fewer dependencies. Overall, the architecture of a redundancy-free system is more natural: transactions record in the database what is happening by adding corresponding database records. Everything else is then provided as read-only algorithms on top of the data.
Because it is no longer necessary to modify redundant data in parallel, the number of database operations per business transaction diminishes. Only the essential operations to record the transaction occur, and no additional modifying operations are necessary. Fewer database operations and thus shorter duration of each transaction increase the throughput of the whole system so that it can handle more business transactions. Furthermore, materialized aggregates no longer have to be locked for updating, thus removing any contention issues.
Smaller database footprint
Redundant data takes up memory and hard disk space that is no longer needed in a redundancy-free system, which shrinks the overall database footprint. A smaller footprint reduces not only the requirements on the main database server but also in turn the disk space needed for backups. A smaller database footprint and higher throughput make a system more cost-efficient. As a consequence, a redundancy-free system has a lower total cost of ownership (TCO).
Flexibility in real time
A redundancy-free system also implies that it is no longer necessary to prebuild aggregates for performance reasons. Fast responses to analytics questions no longer depend on the availability of a materialized answer. Instead of being restricted to questions that can be answered based on what was foreseen when designing the data model of a system, users are given the flexibility to ask any question that can be answered in real time based on the original data. An in-memory redundancy-free system opens up this new level of flexibility with the performance expected from today's fast Web applications.
The obvious questions that arise from this list of impressive benefits are: Why haven't all systems always been free of redundancy? And, why wasn't SAP ERP Financials historically an exception either?
In the past, in a traditional disk-based database system the answer was performance reasons. Now, in an SAP HANA world, the answer is that there is no excuse left for redundancy, thanks to the speed of in-memory technology. In the past, query response times (especially involving aggregation) were too slow for productive use without materialization. The costs of a redundancy-free system were too high, even compared to the benefits. These days, the superior performance of in-memory database systems means that the fear of slow response times is a thing of the past, as demonstrated in the following subchapter. Figure 3.1 illustrates that, in contrast to the traditional database world, the benefits of a redundancy-free system clearly outweigh the costs, as we'll see applied to the case of SAP Simple Finance in Section 3.3.
3.2 In-memory technology removes redundancy
For decades, data redundancy has been reluctantly accepted in order to reach sufficient performance. The slow response times of disk-based database systems made a redundancy-free data model impossible. Because latencies and bandwidth of disk access are orders of magnitudes slower than in-memory access, calculating totals on the fly was prohibitively expensive.
But a new era is upon us. As outlined in Chapter 2, in-memory technology fundamentally challenges long-standing assumptions. In this section, we demonstrate that SAP HANA indeed enables us to get rid of redundancy without compromising performance.
These days, on-the-fly aggregation is feasible. In order to get rid of redundancy without disrupting businesses processes, on-the-fly calculations replacing materialized aggregates in an in-memory system have to perform at least as fast as users are used to based on materialization in a disk-based database system. The comparison is thus between disk-based with materialization on the one hand and in-memory with on-the-fly calculation on the other.
Let's take a closer look. For our demonstration, we'll examine a disk-based system with a typical disk latency of 10 ms (that is, accessing a random location on disk takes 10 milliseconds or 10 million nanoseconds). In comparison, in-memory latency is only 100 ns (0.1 ms). In other words, the database can make 100,000 random accesses in the same time as one disk access. The memory bandwidth in case of sequential reads is 4 MB per millisecond, per core. For the following example, we consider a simple server with a single CPU with eight cores. Modern server systems often connect several CPUs, each with even more than 10 cores, further increasing the processing power. For our example, we assume a company with 100 million accounting document line items in its database and 50,000 customers.
"Total sales in the current month to a particular customer" is a typical aggregate value of interest in financial operations. In the disk-based scenario with materialization, a materialized aggregate for this kind of analysis will contain one tuple per customer, per month that gives direct access to the answer to such an analysis query. Hence, a user can expect to receive an answer from a disk-based database with materialization after 10 ms, assuming the database system needs a single disk access to retrieve the value and ignoring any further processing for comparability.
In an in-memory system without materialization, the aggregate value is not precomputed as part of every business transaction. As a consequence, calculating the answer to the query takes more steps. Nevertheless, response times are faster, as we'll show. Since the calculation happens transparently for applications accessing the database, the calculation steps will not be noticed by users. To keep the explanations simple, the following sequence of steps assumes a sequential execution of the query. Another advantage of an in-memory system is the in-built parallelization, which further speeds up the execution of more complex queries. The steps are as follows:
1. Select all accounting document line items for the particular customer.
In a column table, the column representing the customer associated with each line item is stored in one continuous block of memory. To identify all items of a particular customer, the database system scans the whole column and keeps track of the positions where the customer number in question appeared. Thanks to the columnar layout, one such full column scan operates with the full bandwidth of 4 MB per millisecond and core. SAP HANA's built-in compression means that only the compressed integer representations of customer numbers need to be compared. Because there are 50,000 different numbers, the compressed representation of each customer number only needs two bytes (216 = 65,536 different values). Hence, the customer column for 100 million line items takes up 200 million bytes, or roughly 190 MB. With eight cores, scanning this attribute vector takes 6 ms (190 MB divided by 8 times 4 MB per second).
2. Apply further selections.
For each of the items from Step 1, the database next applies further selection conditions, such as the month. For each additional condition, this mandates a lookup in the corresponding attribute vector at every position identified so far. With on average 2,000 line items per customer, this leads to 2,000 random accesses in a worst-case scenario with entirely noncontiguous positions. Even in that case, this takes only 0.02 ms on a single core (2,000 accesses times 10 ns per access = 20,000 ns).
3. Add up the sales amount of all line items.
After all relevant items have been identified in the previous steps, the database retrieves the sales amount for each item and adds it to the result. Again, the database makes one random access per position in the attribute vector for the sales amount and resolves the dictionary-encoded value in one additional access to the dictionary. Even if Step 2 didn't exclude further positions, this requires 4,000 accesses and 0.04 ms on a single core.
For the total response time, we can ignore Steps 2 and 3 because they don't contribute significantly to the overall time. In total, on-the-fly calculation of sales to a particular customer in the current month takes approximately 6 ms in an in-memory database system -- less than the 10 ms for accessing the materialized aggregate in a disk-based database. Replacing materialization with on-the-fly calculation is thus entirely feasible in this example.
In the case of more complex queries, an in-memory system with on-the-fly calculation may even increase its advantage; for example, a similar query to the one just considered but for all months of the current year instead of only the current month would require up to 12 disk accesses to the materialized values stored per month (unless an additional materialized aggregate is introduced). Although the response time suffers a corresponding increase in the disk-based system up to 120 ms, the on-the-fly calculation only needs to adapt the further selection criteria, keeping the response time of the in-memory system almost the same, at roughly 6 ms.
We've already demonstrated the desirability of a redundancy-free system by highlighting the benefits. In addition, these calculations demonstrate its feasibility thanks to the faster performance in-memory. What is possible with fast and stable access times now would have been prohibitively expensive in a traditional disk-based system. Depending on the caching strategy, a disk-based database has to retrieve answers from disk, suffering long latency and slow bandwidth. Compared to that, an in-memory database offers not only faster but also more stable response times, because latency and bandwidth do not vary as they do in a disk-based system that frequently needs to go back to disk to retrieve data.
How does SAP Simple Finance optimize the new possibilities to create a redundancy-free system?
3.3 Simplifying the core data model
As outlined above, enterprise applications in the past needed to store data redundantly in order to meet performance expectations of their users in view of limited database performance. Applications that remain bound to traditional disk-based databases still experience these limitations. SAP Simple Finance is based on SAP HANA, so it makes use of the dramatically improved performance of an in-memory database. At the same time, its data model is a nondisruptive evolution of the SAP ERP Financials data model, removing any redundancy that has historically been necessary for performance reasons.
Looking at the data model of SAP ERP Financials, several instances of data redundancy quickly become apparent. The fundamental separation of different components (such as financial accounting, controlling, profitability and others) into separate table structures is a case of duplicate data. Chapter 9 describes how the integration of these previously separate components and data models into a Universal Journal enables radically new approaches to finance, in addition to the usual benefits of a redundancy-free system.
The data model of each of these components in turn contained data redundancy in terms of materialized views and materialized aggregates. To explain the changes on this foundational level (essentially the first steps toward a redundancy-free system, completed by the Universal Journal), we now take a closer look into the data model of Financial Accounting's General Ledger (G/L). The explanations similarly apply to the other components; they, too, have been simplified in the same spirit.
Let's take a closer look at Financial Accounting. While doing so, we reference the old tables from Financial Accounting for reasons of comparison; with the Universal Journal (see Chapter 9), a next step merges the data structures of hitherto separate components. The fundamental and essential data tuples of every financial accounting system -- besides master data -- are the accounting documents and their line items. The system records each transaction as an accounting document with at least two line items (one each for debit and credit entry) but potentially more.
Faced with millions or billions of accounting documents (headers, primarily stored in table BKPF) and their line items (table BSEG) and slow disk-based performance, SAP ERP Financials needed materialized views and materialized aggregates in order to provide sufficiently fast access to line items with specific properties or to aggregate values. For this reason, the core data model of Financial Accounting (as illustrated in Figure 3.2) contained, among others, six materialized views (three for open line items separated by accounts receivable, accounts payable, and G/L accounts, and three for cleared line items separated in the same manner) and three materialized aggregates for corresponding totals.
For example, in an SAP ERP Financials system the materialized view of all open accounts receivable line items (table BSID) contains a copy of each line item (with a subset of attributes) from the original table of accounting document line items that fulfills the following condition: the line item is open (that is, has not been cleared) and is part of the accounts receivable sub-ledger. Needless to say, when a transaction clears the item, it also has to delete the corresponding tuple from the materialized view of open accounts receivable items and add it to the corresponding materialized view of cleared accounts receivable items.
In contrast, in SAP Simple Finance, all of these materializations have been removed in order to take the first step toward eliminating redundancy. Instead, the corresponding tables have been replaced with compatibility views. From the accounting documents and line items as the single source of truth, any derived data can be calculated on the fly, most times with higher performance than in a traditional disk-based system. This includes, but of course is not limited to, the views and aggregates that have previously existed in materialized versions.
Figure 3.3 illustrates the resulting redundancy-free data model of Financial Accounting (before the Universal Journal) that is functionally equivalent to the previous data model. In the spirit of simplification, it consists only of the essential tables for accounting documents and for accounting document line items that record the business transactions. In addition, the compatibility views transparently provide access to the same information that was redundantly stored in materialized views and materialized aggregates before. These redundant tables are in turn obsolete, since SAP HANA calculates the same information on the fly. Appendix A lists the tables that have been replaced with compatibility views.
The compatibility views bear the same name as their historical predecessors to ensure that the changes are nondisruptive and do not require SAP customers to modify their custom programs. Any program -- be it part of the SAP standard or a customer modification -- that in the past accessed the materialized view of open accounts receivable line items is now seamlessly routed to the corresponding compatibility view. The compatibility view calculates the result for each query on demand -- without compromising performance, thanks to SAP HANA (as explained in Section 3.2). In this case, the view selects the open items belonging to the accounts receivable subledger directly from the original table of accounting document line items. Any additional selection conditions -- for example, a specific customer -- are immediately passed through to the query optimizer and integrated into the query execution plan.
As mentioned at the beginning of this section, the same approach applies to other components as well. For example, materialized aggregates on top of controlling documents are no longer necessary either, opening up the possibility to combine different accounting components in the Universal Journal. As outlined in Chapter 12, Section 12.4, cash management is another area with similar changes that remove data redundancy.
In summary, the SAP Simple Finance data model is now entirely based on line items, without any prebuilt materialized aggregates or other data redundancy. Not only is the data model simpler, but the program architecture is also simpler: the system "simply" records all business transactions as they happen. Everything else is being calculated on the fly by algorithms on top of the data. Without any negative effect on your existing investments in SAP systems, you immediately benefit from switching to SAP Simple Finance. Let's look at how.
3.4 Immediate benefits of the new data model
Removing materialized views and materialized aggregates from the financial accounting data model has an immediate positive impact on the transactional throughput of the system. In the case of SAP Simple Finance, posting an accounting document requires neither inserting redundant duplicates into materialized views nor updating redundant aggregate values. The corresponding effort and database operations to maintain consistency are no longer necessary.
As a consequence, the number of tuples inserted or modified during database operations was indeed cut by half according to experimental measurements. These experimental measurements are based on real-world data of a large SAP customer. Five hundred accounting documents were posted, each with six line items. Instead of 26,000 tuples affected by UPDATE, INSERT, and DELETE operations in a traditional SAP ERP Financials system already running on SAP HANA, SAP Simple Finance only needed to insert 11,000 database tuples into the tables of the financials component for the entire test -- a savings of more than a factor of two.
Of course, fewer tuples translate directly to less end-to-end transaction time spent posting a document. Instead of over 200 ms per document in SAP ERP Financials on SAP HANA, posting in SAP Simple Finance only needed 100 ms from end to end, down by a factor of two. As a consequence, the throughput of a system running SAP Simple Finance doubled in this scenario.
The experimental measurements by design did not even include the effect of contention that often appears in systems with data redundancy. Enterprise systems usually handle a lot of transactions in parallel. In the case of materialization, the concurrent aggregate updates in particular can lead to contention, because materialized aggregates have to be locked for updating. In the case of, for example, heavily used G/L accounts, the database system has to handle the otherwise parallel postings sequentially if they access the same G/L account in order to consistently update the totals for this account. This unfortunate situation can no longer occur in SAP Simple Finance, because all transactions simply insert tuples into the database, which does not require locks.
When considering the overall system architecture, the statement that in-memory databases with a column-oriented architecture are not as fast when it comes to modifying operations as they are for reading access no longer holds true. The measurements for SAP Simple Finance show that even the speed of modifying transactions is on par with, or better than, that of SAP ERP Financials running on a traditional database with row-based storage.
The database footprint is another area of improvement. Again focusing on the Financial component, the removal of redundancy in SAP Simple Finance alone has the potential to drastically reduce the database footprint. Memory previously occupied by redundant data, such as materialized views and materialized aggregates, is no longer needed. The database footprint of SAP SE's own internal system for Financials (the main productive SAP system within SAP) has been reduced by a factor of almost three; additional savings are possible by applying concepts such as data aging (see Chapter 7) so that, in total, a reduction by a factor of 14 is feasible. Calculations based on SAP customer data show equally impressive numbers -- for example, a reduction by a factor of 6.5.
Such a reduction in database footprint immediately translates to hardware savings and a lower TCO. In addition, SAP HANA also enables you to remove your separate data warehouse thanks to integrated OLAP capabilities. By doing so, you gain an additional factor of two across the whole footprint. The numbers shown in both of these cases are for a system already running on SAP HANA as the database. Factoring in additional savings in the database footprint enabled by SAP HANA's storage architecture and compression, the potential savings are even more impressive.
In summary, SAP Simple Finance demonstrates that in-memory technology makes it entirely feasible to remove redundancy from the Financials data model. Redundancy is no longer necessary for reporting or analysis purposes; these tasks can be run in high speed directly based on line items as algorithms, instead of relying on materialized aggregates prebuilt into the data model. As a consequence, users are no longer restricted to only those analytics questions that have been hardwired into the system. Instead, they are free to analyze data flexibly as needed with the performance they expect.
In this chapter, we highlighted the benefits you gain from the redundancy-free data model of SAP Simple Finance. Benefits fall both into the area of TCO reduction (increased throughput and lower database footprint) and entirely new levels of flexibility. Overall, the removal of redundancy in SAP Simple Finance yields a significant simplification of system and processes. We demonstrated the feasibility of these fundamental changes in terms of performance and highlighted some of the key improvements as seen in real-world environments.
We'll see these topics elsewhere in this book. The next chapter outlines the nondisruptive nature of the innovations in SAP Simple Finance, which allow you to switch seamlessly and benefit immediately from the advantages outlined here and throughout the book. Chapter 5 explores the ensuing possibilities in terms of flexibility based on the purely line item-based data model. In this chapter, we focused on the removal of redundancy in the form of materialized views and materialized aggregates. The removal of materialized result sets is explained in Chapter 6 as one of the benefits of a real-time finance system that doesn't require batch jobs. Chapter 9 outlines how further duplicate data stores are unified with the Universal Journal, which is the next big step toward a redundancy-free system that eliminates the need for reconciliation.
© 2015 by Rheinwerk Publishing. SAP Simple Finance: An Introduction / Jens Krüger. ISBN: 978-1-4932-1215-6
About the author:
Jens Krüger, Ph.D., heads the LoB Finance and SAP Innovation Center unit in the board area of Products & Innovation, where he reports to Bernd Leukert, a member of the SAP Executive Board.
SAP Simple Finance could spur HANA adoption
Digital innovation helps Indiana combat infant mortality
HANA brings "disruptive innovation" to SAP
Why does the SAP S/4HANA roadmap remain unclear?