Get started Bring yourself up to speed with our introductory content.

Today's data transformation tools only scratching the surface

Modern tools let users interact with and visualize data in new ways, but it's only the beginning of what's to come, writes Ethan Jewett.

Data transformation and preparation, data visualization, and business intelligence (BI) software are going through something of a sea change at the moment, even if it sometimes seems nothing much has changed in the last 15 years.

The change overtaking the industry appears to be in its early days, but is driven by persistent problems with IT agility, data quality and the lack of transparency in the systems that manage and display data. We are clearly moving in the direction of faster and more visual interaction with data, but at the moment we are only scratching the surface with regards to understanding and interacting with data, the implications for business, and resulting changes in how we structure our information systems..  I'll outline what has been changing, what we can expect to mature in the next couple of years and extrapolate on where we might be going next.

The old standbys

In current standard software products, data transformation operations like combining, filtering and fixing data are strictly separate from data visualization and analysis functions. Transforming or changing data is a task usually reserved for technical people and accomplished, process-oriented tools like SAP's Data Services and Business Warehouse (BW) and standard computer programming languages like Java or Python.

Data transformation
Figure 1: A data transformation process as displayed visually in most process-based tools. The user can't see either the data or the result of the join and lookup transformation on the data; only the flow of the process is visible.

The output of transformation tools (usually fairly static database tables) is the input for separate data analysis and visualization. Most tools, like SAP's Crystal Reports, allow users to run prepared queries to illustrate a single aggregated slice of the database. More advanced data analysis tools allow the user to navigate with some flexibility within the bounds of the pre-existing data set. Usually these more flexible tools appear as analytic tools (SAP's Analysis for Office or Design Studio dashboards), though there is no reason why these types of flexible but constrained analyses might not be useful within business process contexts.

Flexible data analysis
Figure 2: A flexible data analysis allows some control over the view of data displayed in visualization, but control stops at the visualization. The user has no control over the data itself.�

Some existing tools, usually billed as "self-service BI" or "data exploration," incorporate basic data preparation capabilities, usually using a process- or programming-based view of the data preparation stage. Tableau Software and QlikView were two pioneers of this approach, providing fairly advanced data visualization capabilities on a platform where the user was responsible for all data loading and preparation tasks. SAP's Lumira product follows in these footsteps, providing users a way to load new data, connect to existing data sets or join some combination of data sets, then visualize the data.

However, the strict separation of the visualization or analysis process from data transformation is a nagging weakness of all these existing tools. When do people realize there is a problem with data that needs to be resolved? When they are visualizing it or running analytic functions on it. So why not allow a user to fix the problem then and there?

What's next?

A different approach to data transformation more closely aligned to the actual structure of the data is emerging as a popular alternative. It dispenses with the process-oriented approach to data transformation, in preference for one more closely aligned with the internal structure of the data being processed. That approach is to display even very large data sets as spreadsheets and provide the user with data transformation options that are mapped onto the spreadsheet paradigm. This is not a new approach, but the cohort of tools (Open Refine, Data Wrangler, IBM's BigSheets) developed around 2010 to 2012,  were the first of this type of tool to gain widespread adoption.

The idea is that the spreadsheet or table is a pretty direct visual representation of the raw structure of many standard data formats. Showing a database table in a tabular format makes its structure and a small amount of the data in the table explicit. Given the proper tools, that structure and data can be manipulated in a way that is immediately visible in the spreadsheet view, and which can be mapped back on to the original data set.

Flexible, user-driven, visual transformation of data
Figure 4: The ideal tool will allow flexible, user-driven visualization of data and simultaneously allow changes to the underlying data through manipulation of the visual elements.

It appears that spreadsheet-driven data transformation has legs, getting good uptake in the form of OpenRefine, and receiving significant attention in upcoming products like Trifacta and Spark Cloud, the latter of which uses related concepts around tabular representations of data. This approach begins to address the severe lack of analytic and visualization tools integrated into the data transformation process, giving the people processing data the tools to assess and understand data as they change it. However, deep analytics and specialized visualization tools remain separate.

The future

The current trend is to make data transformation a more visual experience, making the results of data transformations on the data set itself more explicit and immediate. But, the job of extracting meaning from data is still left to more specialized interfaces, usually operating on aggregated slices of the full data set, and often featuring visual abstractions like charts and graphs.

However, there is a tension implicit in this arrangement: As stated already, understanding data and extracting meaning from it is an integral part of the process of transforming data. One can't really know how to transform a data set without understanding it, and it is usually in the process of extracting meaning from data that we find problems with the data that needs to be fixed, or realize that the data is incomplete for our purposes and has to be augmented with another data set. In other words, the process of visualization is exactly the point at which we want to be able to change the underlying data, but our tools prohibit us from doing this.

I expect that over the next five to 10 years, we will begin to see this tension addressed in earnest, with more products allowing editing or augmentation of data through the visualization interface.

Data editing in a spreadsheet
Figure 3: Editing data within a spreadsheet, where changes to the data and its structure are directly visible in the data transformation interface. Note that editing and analyzing data are still strictly separate.�

It is currently an active area of research, including in the Palladio research project, on which -- full disclosure -- I work as lead developer.

In some sense, the products based on the spreadsheet paradigm are one of the first mass-market implementations of this approach. Most likely, these products and others like them will continue to improve their visualization capabilities which maintaining the ability to change data through these visualizations. If visualization-focused vendors are paying attention, they will also start to incorporate data manipulation capabilities into their visualization tools. It will be interesting to see who manages to address this gap most quickly and comprehensively.

Next Steps

Read more about SAP Data Services and other data management tools

Learn how SAP helps companies manage big data

Choose between SAP HANA or Business Warehouse 

Dig Deeper on SAP Business Warehouse

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.


Sperry's MAPPER (now called BIS) did this in around 1980.
Yes, and didn't the implementation of Sarbanes-Oxley internal control documentation and testing lead to deterrning the usage of these tools?