There are two approaches for loading and\or refreshing data from a source system to a data warehouse. The first approach is to load the full set of data every time, which is not efficient, especially for massive volumes of data. The second approach is to load only the changed data set as incremental data load. In order to do this, you need to have a mechanism for identifying the changed data set in the source table after the last data extraction so that only those changed data sets will be considered when extracting from the source table to load into the data warehouse. This method is known as change data capture (CDC).
CDC is a set of software design patterns used to determine and track data that has changed so that action can be taken using the changed data. It is a key data integration approach based on the identification, capture and delivery of the changes made to enterprise data sources. CDC solutions are a mainstay of data warehouse environments since capturing and preserving the state of data over time is a core data warehouse function. CDC can also be utilized in databases and other data repository systems.
Historically, data warehouses were built to be updated on a yearly, monthly, or weekly basis. In today’s business environment, enterprises can no longer wait for data to make its way to back-end data stores, they need to capture it directly from core transactional systems as it is being gathered. The data must be transmitted and processed at the time of collection, and integrated with historical information within business intelligence tools and applications. Consequently, the data warehouse has evolved from being a departmental reporting solution to a central repository of information that is key to active decision making as well as improved partner relationships. Integral to this change is the evolution of the data integration technology that incorporates both batch and real-time modes.
CDC reduces the time and resource costs of data warehousing programs. CDC ensures data synchronicity and facilitates real-time data integration across the entire enterprise. By detecting changed records in data sources and propagating those changes to the data warehouse, change data capture reduces the need for bulk load updating of the warehouse. CDC also enables real-time enterprise data integration processes, such that when a new entry is made to the database it is immediately propagated to related applications such as shipping, invoicing, etc.
CDC records INSERTs, UPDATEs, and DELETEs applied to SQL Server tables, and makes a record available of what changed, where, and when, in simple relational tables. These change tables contain columns that reflect the column structure of the source table, along with the metadata needed to understand the changes that have been made. When CDC features are applied on a database table, a mirror of the tracked table is created with the same column structure of the original table, but with additional columns that include the metadata used to summarize the nature of the change in the database table row. The SQL Server database can then easily monitor the activity for the logged table using these new audit tables.
Extract, transform, and load (ETL) processes are used to incrementally load changed data from the SQL Server source tables to the data warehouse or data mart. Because CDC captures changes made at a data source and applies them throughout the enterprise, it minimizes the resources required for ETL processes because it only deals with data changes.
Business today demands up-to-the-minute data. Latency requirements are shrinking at the same time as data volumes are increasing. Moving entire data sets via a batch-oriented integration method is no longer viable—your business has more data than ever and you need to be able to share that data with vendors, partners, and other internal teams and applications faster and more efficiently. Today’s successful businesses require a smarter approach to data integration.
Real-time change data capture is becoming the integration method of choice for businesses where multiple databases and applications must be kept in sync. CDC increases operational efficiencies, saving time and resources by eliminating redundant data transfer and reducing network bandwidth.
Centerprise’s CDC technology is a smarter way to ensure peak performance of your mission critical applications and database, so your information is as up to the minute as you need it to be.
Centerprise Data Integrator supports a variety of change data capture strategies, including both batch and real-time, enabling you to select the appropriate strategy for your environment and requirements.
Astera brings powerful data management and application integration solutions within reach of any organization. Astera's open source solutions for developing and deploying data management services like ETL, data profiling, data governance, and MDM are affordable, easy to use, and proven in demanding production environments around the world. For organizations looking to jump-start a big data initiative, Astera provides applications that accelerate data loading and other aspects of Hadoop setup by enabling developers and analysts to leverage powerful Hadoop technologies like Hadoop Hive, Pig, and Sqoop without having to write Hadoop code. Astera's ESB and data services infrastructure solutions extend proven Microsoft technologies like WCF and MSMQ to deliver affordable, flexible service enablement of distributed applications. To help enterprises improve operational performance, Astera also offers packaged solutions that support business process modeling and simulation as well as rapid development, testing, and deployment of process-oriented applications..NET, SQL Server and all Microsoft-related trademarks are the property of the Microsoft, and are used with permission.