This article discusses the process of extracting, transporting, transforming, and loading data in a data-warehousing environment.
ETL is one of the main processes in Data Warehouse management. ETL stands for Extract-Transform-Load and covers a process of how the data are loaded from the source system to the data warehouse.
To allow your data warehouse facilitating business analysis, you need to load it regularly. For this, data from one or several sources needs to be extracted and copied into the data warehouse. One of the main problem in data warehouse is to bring together large volumes of data and provide a new unified information base for business intelligence. The very important thing is that ETL relate to a broad process, and not three clear-cut steps.
What happens during the ETL process?
The following tasks are the main actions in the process.
Extraction of Data
During this process, the required data is identified and extracted from different sources. Quite often, it is too hard or even impossible to define a specific subset of interest. That is why more data than required, has to be extracted. The relevant data will be identified later.
Another important part of the extraction process is the data validation. This process confirms that the extracted data have correct values in a given domain. If the validation fails, some data is rejected entirely or partially.
Depending on the source type, various transformations may occur during the extraction process. The size of the extracted data may vary from several KB up to gigabytes, depending on the business needs. In general, the goal of the extraction phase is to convert the data into a single format appropriate for transformation processing.
Transformation of Data
After the extraction process, data needs to be physically transported to the target system for further processing and analysis. Depending on the transportation way, some transformations may take place as well. The transformation phase uses a number of rules to transform the data from the source to the target. Data that do not require transformation may be transported directly to the source.
Cleansing of data is an essential part of the transformation process. It ensures that only proper data pass to the target.
The following transformation types may be required:
· Selecting only certain columns to load
· Converting coded values
· Encoding values
· Deriving a new calculated value
· Joining data from multiple sources
· Splitting a column into multiple columns
· Applying validation rules
At this phase, data is loaded into the target. This process may vary depending on the business requirements. Some data may be overwritten with new data sets. The process may be scheduled on a daily, weekly, or monthly basis. Other load scenarios assumes that data may be added in a historical form at regular intervals — for instance, hourly.
The Difference between ETL and Data Integration?
As described above, ETL is a data processing activity that performs data manipulations, usually on the fly, while extracting data from sources, transforming it, and storing into targets. Data integration is a general term utilized for any effort of combining data from multiple sources into a more unified view. It usually involves several operations, such as ETL, automation, monitoring and change management.
ETL is a form of data integration where data is transformed while transporting between sources and targets.
Why use cloud-based service for data Integration?