Home | Articles
Use of Operational Meta Data in the Data Warehouse

by Mike Jennings

White Paper : Data Warehouse Appliances
A data warehouse appliance uses commodity processors and hardware optimized for BI and combines them with software built specifically for data warehouse-based analytics in a single box. This paper examines the value of Data Warehouse Appliances. Download

Most individuals with some level of involvement with data warehouses, from a technical or business perspective know of the term “meta data”. When asked to define “what is meta data” most of these individuals can reiterate the common definition “Data about data”. Still some have an even deeper understanding on this subject and the various categories of meta data. These categories often break into two areas, technical and business. Technical meta data is most often used by the technical analysts for development and maintenance of the data warehouse. Examples such as:

  • Physical table and column names
  • Data mapping and transformation logic
  • Source systems
  • Foreign keys and indexes
  • Security
  • ETL process names
  • etc.

Followed by business meta data, typically used by the end users of the data warehouse for access and navigation. Examples such as:

  • Business terms and definitions for tables and columns
  • Subject area names
  • Query and report definitions
  • Report mappings
  • Data Stewards
  • etc.

Many data warehouse projects employ the use of a meta data repository to capture, store, and provide access to this information for both business and technical user requirements. The repository serves as the information card catalog for the warehouse environment providing a table of contents, or index, to the real data stored in data warehouse. While meta data repository products exist in the market today, many of these tools do not interface easily with other products of the warehouse architecture (e.g., ETL, OLAP). Standardization efforts currently in progress, such as Common Warehouse Metamodel (CWM), should bring a greater level of integration bring all products.

Operational Meta Data

Even with a standardized meta data repository, a data warehouse architect can still add a further level of detail into their warehouse data model and ETL process designs through the incorporation of operational meta data. This technique can be used as an extension of the design and architecture of the data warehouse that provides processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information.

Operational meta data provides an additional bridge between the meta data repository and the data warehouse through the addition of physical columns into the design for ease of use by both business and technical users.

  • Operational meta data, unlike information stored in the meta data repository, is referenced at a row level of granularity in the data warehouse.
  • Operational meta data provides a detailed row level explanation of actual information content in the data warehouse.

This direct association to each row of information in the warehouse is operational meta data’s key design distinction.

Incorporation of Operational Meta Data

There are varieties of operational meta data columns that can utilized in the design of the data warehouse to enhance automated support. Depending on the business requirements of the warehouse project, inclusion of certain columns may make more or less sense to use. Incorporation of these fields would normally occur during the transformation of the business logical data model into the dimensional or data warehouse data model. Some of the more prevalent columns used are listed below.
  • Load Cycle Identifier
  • Current Flag Indicator
  • Load Date
  • Update Date
  • Operational System(s) Identifier
  • Active in Operational System Flag
  • Confidence Level Indicator

Load Cycle Identifier

One of the operational meta data columns a data warehouse architect can incorporate is the load cycle identifier. This column is a sequential identifier assigned during each load cycle to the data warehouse regardless of the refresh frequency. As a sequential identifier, it can be used to remove data from a particular load cycle run if data corruption or other data quality issues are discovered. The load cycle identifier is typically used in conjunction with a lookup or meta data repository table that’s describes other operational statistics about the load cycle.


Current Flag Indicator

The current flag indicator column is used to identify the latest version of a row in a table. It facilitates quick identification of the latest version of a row as compared to performing date comparisons. This flag is especially useful for managing the processing of slowly changing dimensions (SCD), type 2, where history of a production record is maintained.


Load Date

The most commonly used and understood operational meta data field in data warehouse designs is the load date column. It denotes when, date and/or time, a row of information was loaded into the data warehouse or in some business cases when the data was extracted from the operational source system. This snapshot date is used to maintain temporal integrity of the data in the warehouse as new information is added during each refresh cycle. The column can be referenced by warehouse administrators to identify candidate rows for archival or purge processing. End users can also use this column to reconcile and audit information in the data warehouse with the operational source systems.

Update Date

Another common operational meta data column is the update date. This column indicates when a row was lasted updated in the warehouse during a refresh cycle. This column, like load date, is used to maintain the temporal integrity of information in the data warehouse. It is routinely used in dimension table designs that implement SCD type 1, 2 or 3 processing methods to identify when the row was refreshed. The column, like load date, can be utilized in administration activities such as archival/purge processing or reconciliation/audit by end users.

Operational System(s) Identifier

One of the most useful operational meta data fields for both the warehouse administrator and the end user is the operational system(s) identifier. This column is used to track the origination source or sources of a data row in the data warehouse.

Example, in cases where a row of data was integrated from more than one operational source system, a column value denoting the combination of these systems can be assigned (see table example below). It can be used by end users that are questioning the quality and/or validity of data in the warehouse to trace back information to the operational source system that furnished the information.

In certain cases, this column can be used by administrators to identify and remove corrupt data loaded from a particular operational source system(s).

Active Operational System Flag

This flag is used to indicate whether the production keys in a dimension table are still active in the originating operational system. The active operational system flag provides intriguing analysis alternatives to queries posed to the data warehouse. This column can be used effectively in a variety of analysis activities to identify dormant data or data that should be constrained in reporting (e.g., out of bounds, products no longer supported, previous customers).

Confidence Level Indicator

One of the more controversial operational meta data fields is the confidence level indicator. This column is used to indicate how business rules or assumptions were applied during the ETL processes for a particular row of data. This field provides a means of measure to the end user as to the credibility level of a data row based on the transformation processing performed.

Often used to identify potential problems with data quality from operational source systems and facilitate in correcting these issues, each warehouse organization and/or project will vary in how the business requires ranking of this particular operational meta data field. The following table illustrates one approach to ranking.

Operational Meta Data Column Assignment

The importance of operational meta data, now being realized, the architect has gone forward incorporating these columns into the design of the data warehouse data model and the ETL processes. The value assigned to these columns is accomplished through various means. Some examples of the methods used for assignment of the various operational meta data columns include:

Load Cycle Identifier: The value for this column is assigned through insertion of a new row in a meta data table or lookup table added to the data model to capture operational statistics. The primary key on the table consists of a serial or sequence data type. The value is incremented during each refresh cycle to the data warehouse regardless of the completion status.

Current Flag Indicator: The value of this column is assigned through a comparison of data currently in the data warehouse versus data in a new load cycle. The two data files are sorted and then consecutively read to determine the current record. The current record has a “Y” value assigned to it while historical records are assigned an “N”.

Load Date: The value of this column is determined at the beginning of the load cycle. If a Load Cycle table is available in the meta data repository or has been added to the model, the value for this column can be extracted for this source. In some cases the value assigned may be the extraction date from the operational system versus the actual date the information was loaded into the data warehouse.

Update Date: The value of this column is determined, like load date, at the beginning of the load cycle but is applied only to historical rows.

Operational System Identifier: The value of this column is assigned based on which operational system provided the information to the data warehouse. If two or more sources are integrated to construct a single row in a data warehouse table, the value assigned should represent the multiple operational systems.

Active Operational System Flag: The value of this column is determined by comparing the natural or production keys of the operational system to data rows in the data warehouse. A value of “Y” is assigned to a record if it currently exists in the operational system.

Confidence Level Indicator: The value of this column is assigned during ETL processing based on conformance to the business rules associated with the data warehouse target table. If business rule validation indicates that transformation occurred correctly with no unknown or missing information the confidence level can be set high. If derived or estimated data is inserted during ETL processing, the confidence level may need to be set lower.

Operational Use Strategies

There is a variety of processing optimizations in data acquisition design, maintenance activities and end user reconciliation and auditing of information that can now be performed against the data warehouse through the incorporation of operational meta data into the architecture. Some examples of the tasks that can now be performed include:
  •  Extraction of current data
  •  Identification of rows for archive/purge processing
  •  Load cycle rollback
  •  Processing of slowly changing dimensions type 2 records

These are not the only tasks that can take advantage of the incorporation of operational meta data columns but demonstrates an initial starting set.

Extraction of Current Data

The typical data warehouse data model requires the use of the load date column in order to maintain a unique perspective of history for a row of data in a table. While referential integrity is maintained through use of this column, it causes a high degree of processing overhead when a user is attempting to determine which row in a dimension table is the most current. The RDBMS must determine, by sequentially reading each row of data in the table, which row has the most current load date.

An alternative to this brute force method is to utilize the current flag indicator column. Through ETL processing and comparison of natural or production keys, this column facilitates quick identification, for the business users, as to which row of information from the operational systems is the most current. The last row added to a data warehouse table for a particular production key is given an assignment of “Y” for the current indicator flag, while historical records are set to “N” or null. The business user can use this column in their queries as a constraint to retrieve only the most up to date information. For certain reporting requirements, RDBMS views can be established that are constructed constraining on the current indicator column, value of “Y”, automatically to avoid errors in queries by users.



Load Cycle Rollback

A monthly load cycle occurs against the data warehouse. During ETL processing of the load cycle, an error is detected in the database or, data from an operational source system is suspect of being corrupt or, some other data quality measurement issue is discovered in the data. Before the incorporation of operational meta data columns, the data warehouse administrator would have only limited methods of isolating and removing this corrupt or suspect information. Now the administrator can be more selective in their methods of removing the erroneous data from the database.

The first rollback method is to utilize the load cycle indicator to completely remove the last refresh cycle to the data warehouse. This can simply be accomplishing by constraining on the value of the load cycle in question and removing the affected rows in each table from the data warehouse. This same method can be further tailored, in certain circumstances, to remove rows from a particular load cycle for a specific operational system by further constraining on the operational system indicator in question.

For either method used, the rollback process will need to accommodate the re-assignment of the current indicator, once the suspect rows are removed from the data warehouse, for rows that were flagged with a “Y” prior to the last load cycle.

Archiving and Purging

The data warehouse has now been up and running with monthly load cycles being processed into successfully for some period of time. Through your business users or database monitoring tool you become aware of data that has become dormant over time or inactive for some other reason. Since this data is not currently being utilized on any type of regular basis, a decision is made to either archive it off to a near-line storage device or to purge it from the data warehouse for size and performance reasons.

The incorporation of operational meta data columns again provides the data warehouse administrator with a variety of options on how to isolate and flag candidate rows for archiving or purging. First, the load date can be interrogated to isolate rows of data from a particular period. This method again requires additional overhead of the RDBMS to analysis the load date. The second option is to constraint on rows of data from a particular set of load cycles. This method provides a more efficient means of flagging candidate rows and avoids tokenizing of a date column to identify a particular period.

Slowing Changing Dimensions (SCD) Type 2

Data warehouses or marts that make use of some variant of a star schema data model design can utilize the current flag indicator to aid in SCD type 2 processing for dimension table loads.

The established method for processing of SCD type 2 records is to sort both sources of information, operational system versus data warehouse, based on the production or natural key of the dimension table. The current flag indicator can be used to constraint which records from the dimension table will be compared to the current load cycle data being processed to the warehouse. Only dimension records containing a “Y” in the current indicator column are required to be compared to the load cycle data since they contain the most updated information for the specific production key(s).

A consecutive comparison is made between the two data files based on these key column(s). Current load cycle records with production key values not found in the dimension table are loaded as new rows. These rows receive new surrogate keys for the dimension and have their current indicator flag set to “Y”. The current flag setting is due to the fact that no other occurrence of the production key(s) combination exists in the dimension table to date.

Production key(s) found in both the load cycle data and the dimension table are further interrogated based on columns deemed relevant to track changes on by the business users. The two sources are sorted on by the key and relevant columns. A consecutive comparison is made against these files. Current load cycle records that have relevant columns that do not match their corresponding dimension table rows are flagged as new rows. These rows are inserted with new surrogate keys and have the current flag indicator set to “Y” due to an update made to a relevant column since the last load cycle.

Previously loaded dimension rows that have a matching production key(s) when compared to the current load cycle file but have differences in relevant columns are assigned a current indicator of “N”. These rows receive this setting due to the fact that new information on relevant columns of the dimension have been received in the most recent load cycle.

This same process of constraining on the current flag indicator and performing comparison on production keys between the dimension table and load cycle is repeated during each update process to the data warehouse.

Unique Questions Answered by Operational Meta Data

The incorporation of operational meta data columns into the data warehouse architecture allows administrative and data quality measurement questions to be easily answered. Some examples of the questions that can now be answered are:

  • What operational system(s) provided this particular row of information to the data warehouse? Purpose: Data quality measurement and publishing of ETL load cycle results.
  • When was this particular row of information last refreshed by the operational system? Purpose: Data quality measurement.
  • How many rows of information will be affected by an archive or purge of data? Purpose: Administrative and maintenance.
  • What is the growth rate for a table per load cycle? Purpose: Administrative, maintenance and publishing of ETL load cycle results.
  • What effective growth rate is being experienced from a particular operational source system on the data warehouse? Purpose: Administrative, maintenance and publishing of ETL results.
  • What is the relative confidence level of the data currently loaded into the data warehouse? Purpose: Data quality measurement.
  • What percentage effect on confidence level would be measured if dirty data from an operational system were corrected at the source? Purpose: Data quality measurement.

Summary

Operational meta data allows for detailed rows level explanation of actual information content in the data warehouse. The specific addition of these types of data columns into the dimensional data model and ETL processes can help business users with reconciliation and auditing of data quality issues. Additionally, data warehouse administrator’s benefit through the incorporation of operational meta data by gaining increased options and flexibility in accomplishing administrative tasks.

About the Author

Resource Center
Business Intelligence
Oracle #1 in Business Analytics According to IDC Research
BI Search
The Business Intelligence Search Engine has all the answers.
Business Intelligence Search Engine
Find all you need on The Business Intelligence Search Engine.
Add a Link Add a Link

  




Designing and Implementing Business Intelligence Solutions Using Microsoft SQL Server. Click for details.
Business Intelligence Solution Finder

What do you need?

Location of solution provider

What type of solution are you interested in?

Are you interested in a specific solution?                      


All product names are trademarks of their respective companies.
Copyright © ITNetwork365 - All Rights Reserved