Data Preparation is the process of gathering, combining, structuring and organizing data so it can be analysed as part of business intelligence (BI) and business analytics (BA) programs.
The components of data preparation include data discovery, profiling, cleansing, validation and transformation; it often also involves pulling together data from different internal systems and external sources.
Source: http://searchbusinessanalytics.techtarget.com/definition/data-preparation
But, isn’t that ETL? Yes and no – let me explain.
I believe the difference between Data Preparation and ETL relies on who does it and the tools used in the process.
Data Preparation is like ETL developed by Business / Data Analysts or Business Users. And the tools used to implement it in most cases do not require any coding skills.
Data Preparation is usually the initial step in analytics. When analysts do not know what they are looking for, but when they know where to start looking. Data Preparation exists so that the time required to bring data together into a meaningful and ready to be used dataset is minimal. Can you imagine the time needed to build or rebuild ETL when the user requirements for data, change so rapidly?
Data Preparation has been around for many years as the complement of many self-service data visualisation technologies that are by now widespread in most organisations.
Personally, I came to realise about Data Preparation, when I noticed that most of the tools provide a toolkit for business users to access data, to cleanse and transform it, and to put it together into a data model that can feed data into the visualisations. At the same time, I identified something I see as a flaw in these self-service BI tools, and this is that the data model containing clean and well-shaped data remains within the boundaries of the specific toolset and it is not available to any other systems.
Why Should You Care?
Next time you are designing a solution in a self-service data visualisation tool, and also, if you want to implement a Data Preparation strategy in your organisation, you should consider the following:
- Datasets (or Data Models) should be reusable: Minimise the cost and effort required to transform raw data into ready to use datasets. Avoid duplication of work, and allow users and systems to share previously prepared datasets.
- A central repository of datasets: To implement reusability, the prepared datasets or data models should exist in a single central place. This way, it becomes easier to manage, and everyone knows where to find the data they require for their data analytics.
- Use only approved datasets: Data Preparation tools put the power of ETL in user’s hands, but without governance, every user can create their version of the truth. Make sure only approved datasets are made available to other data consumers in the organisation.
- From Data Preparation to ETL: Data Preparation tools can have limitations when handling large amounts of records or when scheduling their automatic execution. When problems arise, use the Data Preparation steps as technical specifications for the implementation of a more robust ETL process.
- Secure your datasets: Be mindful of the users accessing the datasets and the activities they perform with it internally and externally.
Data Preparation helps individuals and organisations to add the necessary agility to the process of transforming data into information. It also allows IT and Business Users to collaborate in the process. IT by providing a reliable data platform where data is always available when needed; and Business Users by making sure the prepared datasets implement the business logic required to make the right decisions.