Growth of Textual Information—The Bad News
The bad news about this wealth of textual information can be divided into two parts. First, there is no good way to find all the information and only the information we are looking for when we deal with large document collections, such as the World Wide Web. Second, even if we can find a fraction of what appears relevant, it is often too much to read, or even skim. There is just too much information.
Finding Information: It’s Not as Easy as It Used to Be
In conjunction with the growth of the Web has been the emergence of search engines. Now, search technology has been around as long as we have had documents stored in information retrieval (IR) systems but Web-based search engines are different from their IR counterparts.
First of all, traditional IR systems are often used to store relatively homogeneous sets of documents, such as news stories, legislation, case law records, and hazardous material safety data sheets. Web search engines confront documents of all kinds, including multimedia documents that span the breadth of human discourse.
Second, IR systems use centrally managed text databases, while the Web is the archetypal decentralized distributed system. There is no single place to go to find all the documents on the Web as shown in Figure 1.2.

Figure 1.2 The distributed, decentralized nature of the Internet makes it impossible to
work from a centralized starting point for indexing the Web.
Search engines work around this by moving from one Web page to the next using hyperlinks to find other documents, but this does not guarantee that every page is found. IR systems, on the other hand, have complete control over the contents of their text database and, therefore, can index the entire set of documents.
Another advantage that IR search engines have is that they can make assumptions about the terms that a user will use when searching for information. For example, if an IR database containing documents on transportation is presented with a query including the term fly, it can assume that the user is interested in air transportation. The query optimizer could expand the query written by the user to include related terms, such as flight, to improve the chances that all relevant documents will be found. A World Wide Web search engine does not have the luxury of working with a controlled vocabulary like that. On the Web, fly could refer to airline flights, insects, or zippers.
The breadth of the Web makes it more challenging to deal with than traditional IR systems, and to make things worse, we still have to live with problems that have plagued IR systems since their inception: poor precision and recall. Both of these problems deal with how effectively search engines answer a user’s query. Precision is a measure of how well the documents returned in response to a query actually address the query. If the IR system returns a large number of documents but most are irrelevant, then precision is low, but if most of the documents are relevant, then precision is high. Anyone who has run a query through a Web search engine only to be left scratching his or her head about how some of those hits could possibly have anything to do with the search has experienced poor precision firsthand. Recall is a related measure. In contrast to looking at the quality of what was returned, recall is a measure of what should have been returned. Since even the largest search engines cannot index every Web page, there will be missed pages for some queries, resulting in poor recall. Better indexing will increase recall, but it will also lead to more text that the end users will have to read through to find the information they are looking for. Pick your poison.
Beware What You Wish for: Finding Too Much Information
The problem we confront today is that we have too much text at our disposal. Imagine if we had databases filled with transactional data but no way to distill the information down into key pieces of information. Even before data warehousing and dimensional modeling were fully developed practices, decision support systems were created that could give a manager or executive an overview of even large transaction sets. Today, most managers cannot get an overview of the contents of a large document collection without enlisting the help of a researcher, analyst, or staff member to review a group of documents, identify the relevant ones, read the text, and summarize the findings. Would any
manager today have an employee sit down with a printout of transactions from an accounting system and a calculator and manually add the numbers to find out the state of accounts receivable? Of course not, but we go through a similar process when we need to find key information in text. We do not need to keep working with text in this way, and document warehousing and text mining are the keys to the solution.
The Document Warehousing Approach to the Information Glut
Document warehousing is one approach to dealing with a glut of textual information and is analogous to data warehousing as a method for dealing with large volumes of numeric data. Document warehouses distinguish themselves from data warehouses by the types of questions they are designed to answer. Data warehouses are excellent tools for answering who, what, when, where, and how much questions. They do not do so well with why questions, and these are document warehousing’s forte.
Data warehouses, in practice, are often internally focused. We use them to better analyze the operational information of our organizations and rarely include external sources of information. (Demographics data is one common exception.) Document warehouses, though, can gather and process text from any source, internal or external, and this is the key to the document warehouse’s ability to support strategic management that looks beyond the internal operation to the external factors that influence an organization. Of course, we could use external sources for data warehousing as well but the work involved in finding and acquiring relevant data in appropriate formats is many times outweighed by the marginal benefit of having the additional information.
There is a price to pay for this branching out though. One has less control over external sources than internal ones. Running an extraction, transformation, and load (ETL) process for a data warehouse will generate well-defined results. Dimensions will be updated, fact tables will grow, and errors will be logged. We may not know the details of the data that is loaded, but we will know its general form. Searching for content for a document warehouse through the Internet can lead to unexpected sources of text. Some downloaded documents may be irrelevant or, worse, inaccurate and misleading. As we shall see in the chapters ahead, one of the most important steps in document warehousing is controlling the document collection process.
Supporting Business Intelligence with Text
Document warehouses are repositories of textual information designed to support business intelligence and decision support operations. The exact nature of textual information actually maintained in a document warehouse can include:
- Complete documents
- Automatically generated summaries of documents
- Translations of documents in several languages
- Metadata about documents, such as authors’ names, publication dates, and
subject keywords
- Automatically extracted key features
- Clustering information about similar documents
- Thematic or topical indexes
Document warehouses, unlike document management systems, include extensive semantic information about documents, document groupings, cross-document feature relations, and other attributes designed to provide high-precision, high-recall access to business intelligence information.
Data warehouses, especially those built on dimensional models, provide aggregated views of large numbers of transactions. For example, we can easily find the number of laptops sold in the eastern region during the third quarter by a particular salesperson or the total gross revenue for a product line in one store during the Christmas season. Rather than starting with small units of information, such as a transaction, and aggregating along a predefined set of dimensions, document warehouses start with richly complex units—that is, documents —and extract information by applying linguistic-processing and text mining techniques.
Figure 1.3 shows how the data in a large number of transactions can be reduced to a few rows in a fact table and thus provide information, not just data, to end users.

Figure 1.3 Transactional data needs to be structured to provide true decision support information.
Now, document warehouses have an almost opposite problem. Documents have very high information content (usually). Unlike transactions in an online transaction processing system (OLTP), documents do not follow a relational structure. For example, a simple sales transaction might use a structure like that shown in Figure 1.4.

Figure 1.4 Normalized relational models are commonly used to structure transaction processing systems.
Document contents are often referred to as free-form text because we cannot fit the contents into a fixed relational structure, at least not a useful one. We can put an entire document into a binary long object and claim victory. This is fine if we are developing document management systems (which are essentially transaction processing systems anyway). It does not help in a decision support environment though. We need to understand what is inside those documents; we need to take them apart, dissect them linguistically, and then make explicit the essential information contained in the text.
We have set the stage for the need for document warehousing, and at this point it is worth developing a formal definition for document warehousing and discussing some of the implications of this definition.
|