For more than 10 years, ETL software has been used to automatically extract data from heterogeneous sources, transform data and load data into a data warehouse. The return on investment for these tools was based mainly on development productivity gains and the dramatic reduction of maintenance costs. However, companies are realizing that many of today’s data integration projects often require more functionality than traditional ETL products can provide – such as the movement and transport of data (in batch mode or in real time) from heterogeneous sources to multiple targets, the update of a meta data repository, the quality control of data and the capability to integrate different applications in read/write mode (EAI).
To meet these challenges, many data integration vendors are extending or introducing products that address a much broader set of needs than traditional ETL software. However, despite the evolution of data integration software, a significant portion of projects is still completed either without using any type of software (in cases where the scope of the project remains small) or using only low cost data movement tools embedded within an RDBMS.
There are two main reasons why data integration tools are not more widely used: Data integration tools based on a traditional architecture are very expensive, and their proprietary languages present a steep learning curve. For these two reasons, many of the newer data integration players tend to offer products based on a SQL code generation approach. Why and how is code generation valuable?
The Engine-Based Approach
In the past, most data integration/ETL tools were proprietary-engine-based software. Indeed, the SQL languages supported by the RDBMS releases of the 90s were not rich enough to handle the complex transformation rules required by customers needing to perform data integration work or build a data warehouse. The only viable technical solution at that time was to build costly and complex proprietary-engine-based software located between heterogeneous sources and the target servers.
This approach, however, introduced the risk of creating a bottleneck between sources and targets. In these architectures, information is extracted from different sources, transformed in the engine using the proprietary language and then moved or loaded into the target database. This proprietary-engine-based approach typically encounters the following obstacles:
|