Most individuals with some level of involvement with data
warehouses, from a technical or business perspective know
of the term “meta data”. When asked to define
“what is meta data” most of these individuals
can reiterate the common definition “Data about data”.
Still some have an even deeper understanding on this subject
and the various categories of meta data. These categories
often break into two areas, technical and business. Technical
meta data is most often used by the technical analysts for
development and maintenance of the data warehouse. Examples
such as:
- Physical table and column names
- Data mapping and transformation logic
- Source systems
- Foreign keys and indexes
- Security
- ETL process names
- etc.
Followed by business meta data, typically used by the end
users of the data warehouse for access and navigation. Examples
such as:
- Business terms and definitions for tables and columns
- Subject area names
- Query and report definitions
- Report mappings
- Data Stewards
- etc.
Many data warehouse projects employ the use of a meta data
repository to capture, store, and provide access to this information
for both business and technical user requirements. The repository
serves as the information card catalog for the warehouse environment
providing a table of contents, or index, to the real data
stored in data warehouse. While meta data repository products
exist in the market today, many of these tools do not interface
easily with other products of the warehouse architecture (e.g.,
ETL, OLAP). Standardization efforts currently in progress,
such as Common Warehouse Metamodel (CWM), should bring a greater
level of integration bring all products.
Operational Meta Data
Even with a standardized meta data repository, a data warehouse
architect can still add a further level of detail into their
warehouse data model and ETL process designs through the incorporation
of operational meta data. This technique can be used as an
extension of the design and architecture of the data warehouse
that provides processing optimizations in data acquisition
design, maintenance activities and end user reconciliation
and auditing of information.
Operational meta data provides an additional bridge between
the meta data repository and the data warehouse through the
addition of physical columns into the design for ease of use
by both business and technical users.
- Operational meta data, unlike information stored in the
meta data repository, is referenced at a row level of granularity
in the data warehouse.
- Operational meta data provides a detailed row level explanation
of actual information content in the data warehouse.
This direct association to each row of information in the
warehouse is operational meta data’s key design distinction.
Incorporation of Operational Meta Data
There are varieties of operational meta data columns that
can utilized in the design of the data warehouse to enhance
automated support. Depending on the business requirements
of the warehouse project, inclusion of certain columns may
make more or less sense to use. Incorporation of these fields
would normally occur during the transformation of the business
logical data model into the dimensional or data warehouse
data model. Some of the more prevalent columns used are listed
below.
- Load Cycle Identifier
- Current Flag Indicator
- Load Date
- Update Date
- Operational System(s) Identifier
- Active in Operational System Flag
- Confidence Level Indicator
Load Cycle Identifier
One of the operational meta data columns a data warehouse
architect can incorporate is the load cycle identifier. This
column is a sequential identifier assigned during each load
cycle to the data warehouse regardless of the refresh frequency.
As a sequential identifier, it can be used to remove data
from a particular load cycle run if data corruption or other
data quality issues are discovered. The load cycle identifier
is typically used in conjunction with a lookup or meta data
repository table that’s describes other operational
statistics about the load cycle.
Current Flag Indicator
The current flag indicator column is used to identify the
latest version of a row in a table. It facilitates quick identification
of the latest version of a row as compared to performing date
comparisons. This flag is especially useful for managing the
processing of slowly changing dimensions (SCD), type 2, where
history of a production record is maintained.
Load Date
The most commonly used and understood operational meta data
field in data warehouse designs is the load date column. It
denotes when, date and/or time, a row of information was loaded
into the data warehouse or in some business cases when the
data was extracted from the operational source system. This
snapshot date is used to maintain temporal integrity of the
data in the warehouse as new information is added during each
refresh cycle. The column can be referenced by warehouse administrators
to identify candidate rows for archival or purge processing.
End users can also use this column to reconcile and audit
information in the data warehouse with the operational source
systems.
Update Date
Another common operational meta data column is the update
date. This column indicates when a row was lasted updated
in the warehouse during a refresh cycle. This column, like
load date, is used to maintain the temporal integrity of information
in the data warehouse. It is routinely used in dimension table
designs that implement SCD type 1, 2 or 3 processing methods
to identify when the row was refreshed. The column, like load
date, can be utilized in administration activities such as
archival/purge processing or reconciliation/audit by end users.
Operational System(s) Identifier
One of the most useful operational meta data fields for both
the warehouse administrator and the end user is the operational
system(s) identifier. This column is used to track the origination
source or sources of a data row in the data warehouse.
Example, in cases where a row of data was integrated from
more than one operational source system, a column value denoting
the combination of these systems can be assigned (see table
example below). It can be used by end users that are questioning
the quality and/or validity of data in the warehouse to trace
back information to the operational source system that furnished
the information.
In certain cases, this column can be used by administrators
to identify and remove corrupt data loaded from a particular
operational source system(s).

Active Operational System Flag
This flag is used to indicate whether the production keys
in a dimension table are still active in the originating operational
system. The active operational system flag provides intriguing
analysis alternatives to queries posed to the data warehouse.
This column can be used effectively in a variety of analysis
activities to identify dormant data or data that should be
constrained in reporting (e.g., out of bounds, products no
longer supported, previous customers).
Confidence Level Indicator
One of the more controversial operational meta data fields
is the confidence level indicator. This column is used to
indicate how business rules or assumptions were applied during
the ETL processes for a particular row of data. This field
provides a means of measure to the end user as to the credibility
level of a data row based on the transformation processing
performed.
Often used to identify potential problems with data quality
from operational source systems and facilitate in correcting
these issues, each warehouse organization and/or project will
vary in how the business requires ranking of this particular
operational meta data field. The following table illustrates
one approach to ranking.

Operational
Meta Data Column Assignment
The importance of operational meta data, now being realized,
the architect has gone forward incorporating these columns
into the design of the data warehouse data model and the ETL
processes. The value assigned to these columns is accomplished
through various means. Some examples of the methods used for
assignment of the various operational meta data columns include:
Load Cycle Identifier:
The value for this column is assigned through insertion of
a new row in a meta data table or lookup table added to the
data model to capture operational statistics. The primary
key on the table consists of a serial or sequence data type.
The value is incremented during each refresh cycle
to the data warehouse regardless of the completion status.
Current Flag Indicator: The
value of this column is assigned through a comparison of data
currently in the data warehouse versus data in a new load
cycle. The two data files are sorted and then consecutively
read to determine the current record. The current record has
a “Y” value assigned to it while historical records
are assigned an “N”.
Load Date: The value of this
column is determined at the beginning of the load cycle. If
a Load Cycle table is available in the meta data repository
or has been added to the model, the value for this column
can be extracted for this source. In some cases the value
assigned may be the extraction date from the operational system
versus the actual date the information was loaded into the
data warehouse.
Update Date: The value of
this column is determined, like load date, at the beginning
of the load cycle but is applied only to historical rows.
Operational System Identifier:
The value of this column is assigned based on which operational
system provided the information to the data warehouse. If
two or more sources are integrated to construct a single row
in a data warehouse table, the value assigned should represent
the multiple operational systems.
Active Operational System Flag:
The value of this column is determined by comparing the natural
or production keys of the operational system to data rows
in the data warehouse. A value of “Y” is assigned
to a record if it currently exists in the operational system.
Confidence Level Indicator:
The value of this column is assigned during ETL processing
based on conformance to the business rules associated with
the data warehouse target table. If business rule validation
indicates that transformation occurred correctly with no unknown
or missing information the confidence level can be set high.
If derived or estimated data is inserted during ETL processing,
the confidence level may need to be set lower.
Operational Use Strategies
There is a variety of processing optimizations in data acquisition
design, maintenance activities and end user reconciliation
and auditing of information that can now be performed against
the data warehouse through the incorporation of operational
meta data into the architecture. Some examples of the tasks
that can now be performed include:
- Extraction of current data
- Identification of rows for archive/purge processing
- Load cycle rollback
- Processing of slowly changing dimensions type 2
records
These are not the only tasks that can take advantage of the
incorporation of operational meta data columns but demonstrates
an initial starting set.
Extraction of Current Data
The typical data warehouse data model requires the use of
the load date column in order to maintain a unique perspective
of history for a row of data in a table. While referential
integrity is maintained through use of this column, it causes
a high degree of processing overhead when a user is attempting
to determine which row in a dimension table is the most current.
The RDBMS must determine, by sequentially reading each row
of data in the table, which row has the most current load
date.
An alternative to this brute force method is to utilize the
current flag indicator column. Through ETL processing and
comparison of natural or production keys, this column facilitates
quick identification, for the business users, as to which
row of information from the operational systems is the most
current. The last row added to a data warehouse table for
a particular production key is given an assignment of “Y”
for the current indicator flag, while historical records are
set to “N” or null. The business user can use
this column in their queries as a constraint to retrieve only
the most up to date information. For certain reporting requirements,
RDBMS views can be established that are constructed constraining
on the current indicator column, value of “Y”,
automatically to avoid errors in queries by users.

Load Cycle Rollback
A monthly load cycle occurs against the data warehouse. During
ETL processing of the load cycle, an error is detected in
the database or, data from an operational source system is
suspect of being corrupt or, some other data quality measurement
issue is discovered in the data. Before the incorporation
of operational meta data columns, the data warehouse administrator
would have only limited methods of isolating and removing
this corrupt or suspect information. Now the administrator
can be more selective in their methods of removing the erroneous
data from the database.
The first rollback method is to utilize the load cycle indicator
to completely remove the last refresh cycle to the data warehouse.
This can simply be accomplishing by constraining on the value
of the load cycle in question and removing the affected rows
in each table from the data warehouse. This same method can
be further tailored, in certain circumstances, to remove rows
from a particular load cycle for a specific operational system
by further constraining on the operational system indicator
in question.
For either method used, the rollback process will need to
accommodate the re-assignment of the current indicator, once
the suspect rows are removed from the data warehouse, for
rows that were flagged with a “Y” prior to the
last load cycle.

Archiving and Purging
The data warehouse has now been up and running with monthly
load cycles being processed into successfully for some period
of time. Through your business users or database monitoring
tool you become aware of data that has become dormant over
time or inactive for some other reason. Since this data is
not currently being utilized on any type of regular basis,
a decision is made to either archive it off to a near-line
storage device or to purge it from the data warehouse for
size and performance reasons.
The incorporation of operational meta data columns again
provides the data warehouse administrator with a variety of
options on how to isolate and flag candidate rows for archiving
or purging. First, the load date can be interrogated to isolate
rows of data from a particular period. This method again requires
additional overhead of the RDBMS to analysis the load date.
The second option is to constraint on rows of data from a
particular set of load cycles. This method provides a more
efficient means of flagging candidate rows and avoids tokenizing
of a date column to identify a particular period.

Slowing Changing Dimensions (SCD) Type 2
Data warehouses or marts that make use of some variant of
a star schema data model design can utilize the current flag
indicator to aid in SCD type 2 processing for dimension table
loads.
The established method for processing of SCD type 2 records
is to sort both sources of information, operational system
versus data warehouse, based on the production or natural
key of the dimension table. The current flag indicator can
be used to constraint which records from the dimension table
will be compared to the current load cycle data being processed
to the warehouse. Only dimension records containing a “Y”
in the current indicator column are required to be compared
to the load cycle data since they contain the most updated
information for the specific production key(s).
A consecutive comparison is made between the two data files
based on these key column(s). Current load cycle records with
production key values not found in the dimension table are
loaded as new rows. These rows receive new surrogate keys
for the dimension and have their current indicator flag set
to “Y”. The current flag setting is due to the
fact that no other occurrence of the production key(s) combination
exists in the dimension table to date.

Production key(s) found in both the load cycle
data and the dimension table are further interrogated based
on columns deemed relevant to track changes on by the business
users. The two sources are sorted on by the key and relevant
columns. A consecutive comparison is made against these files.
Current load cycle records that have relevant columns that
do not match their corresponding dimension table rows are
flagged as new rows. These rows are inserted with new surrogate
keys and have the current flag indicator set to “Y”
due to an update made to a relevant column since the last
load cycle.
Previously loaded dimension rows that have a matching production
key(s) when compared to the current load cycle file but have
differences in relevant columns are assigned a current indicator
of “N”. These rows receive this setting due to
the fact that new information on relevant columns of the dimension
have been received in the most recent load cycle.
This same process of constraining on the current flag indicator
and performing comparison on production keys between the dimension
table and load cycle is repeated during each update process
to the data warehouse.
Unique Questions Answered by Operational Meta Data
The incorporation of operational meta data columns into the
data warehouse architecture allows administrative and data
quality measurement questions to be easily answered. Some
examples of the questions that can now be answered are:
- What operational system(s) provided this particular row
of information to the data warehouse? Purpose: Data quality
measurement and publishing of ETL load cycle results.
- When was this particular row of information last refreshed
by the operational system? Purpose: Data quality measurement.
- How many rows of information will be affected by an archive
or purge of data? Purpose: Administrative and maintenance.
- What is the growth rate for a table per load cycle? Purpose:
Administrative, maintenance and publishing of ETL load cycle
results.
- What effective growth rate is being experienced from
a particular operational source system on the data warehouse?
Purpose: Administrative, maintenance and publishing of ETL
results.
- What is the relative confidence level of the data currently
loaded into the data warehouse? Purpose: Data quality measurement.
- What percentage effect on confidence level would be measured
if dirty data from an operational system were corrected
at the source? Purpose: Data quality measurement.
Summary
Operational meta data allows for detailed rows level explanation
of actual information content in the data warehouse. The specific
addition of these types of data columns into the dimensional
data model and ETL processes can help business users with
reconciliation and auditing of data quality issues. Additionally,
data warehouse administrator’s benefit through the incorporation
of operational meta data by gaining increased options and
flexibility in accomplishing administrative tasks.
About the Author
|