BI Jobs
Home | Extracts   
Document Warehousing and Text Mining - Chapter 1

by Dan Sullivan

White Paper : Automating Data Flows to Your Business Intelligence System
This white paper outlines the issues faced by BI operational staff in maintaining high quality of BI information, and discusses technologies that have the potential to dramatically raise the reliability and quality of BI information, improve how BI teams use their time and resources to deliver rapid ROI and free resources to focus on answering business questions based on reliable and meaningful data. Download

For most of us, thinking about business intelligence brings to mind data warehouses, multidimensional models, and ad hoc reports. While these techniques and resources have served us well, they do not completely address the full scope of business intelligence (BI). BI provides decision makers with the information they need to understand, manage, and direct organizations. Unfortunately, we have only touched the tip of the information iceberg. To date, numeric and short character string information has been the sole grist for the BI mill. This so-called structured data excludes the most prevalent medium for expressing information and knowledge: text. Within text we find project status information, marketing reports, details of industry regulations, competitors’ advertising campaigns, and descriptions of new technologies in patent applications. We simply cannot get this type of detail from our traditional business intelligence systems.

The Need to Deal with Text

We need to expand the scope of business intelligence to include textual information. Now is the right time to do this, for a number of reasons. First, we now have tools at our disposal for analyzing text and extracting key information to create a document warehouse with distilled, useful business intelligence information. Steady advances in computational linguistics since the 1960s have left us with a wide range of tools for extracting key features, categorizing documents, indexing by topic as well as by keywords, automatically summarizing texts, and grouping similar documents. These tools are the keys to successful integration of text into the business intelligence infrastructure.

Second, the Internet and the World Wide Web are making vast amounts of information easily accessible. With the right tools we can find information about the financial, marketing, and technology plans of competitors. We can track changes in the legislative and regulatory environment that affect our industry and monitor the political and economic conditions of markets around the world. The range of topics that can be researched on the Internet is almost without limit.

A third reason for expanding the scope of business intelligence is that organizations have—since the dawn of commerce and centralized governments— depended upon writing systems of some form to record information. To this point in time, we accumulated 1,000 petabytes of data stored online in mainframes, servers, and client PCs, and that does not even include the Internet (Lycyk, 2000). A significant portion of this information is text-based, and organizations are beginning to realize the need to deal with this text from a decision support perspective. According to research by Survey.com, 81 percent of respondents expect to be supporting free-form text in the data warehouse by 2002 (Application Development Trends, February 2000). Our current means of dealing with, or ignoring, text are no longer sufficient to meet the needs of decision makers.

A fourth point to keep in mind is that successful organizations are not just driven by managing core operations such as selling products, tracking changes in quality control measures, or analyzing trends in cash flow. More and more, intangible aspects of organizations, such as knowledge about process management, patented technologies, and methodologies, are fundamental factors influencing the course of a business. “Increasingly, intellectual resources and not physical assets constitute the seeds of marketplace success” (Quinn, 1994). Managers and executives need to understand the competitive advantage created by their intellectual property as well as how the market responds to innovations by competitors. This kind of information is not available by looking at data extracts from transaction processing systems. It is however, available to those who know where to look and how to extract the key information.

Finally, decision makers think strategically. This means that they need information about what is going on outside the organization as well as inside, as depicted in Figure 1.1.

Organizations do not exist in a vacuum.

Figure 1.1 Organizations do not exist in a vacuum.

They need to understand industry structure and dynamics, which shed light on the competitive environment in which companies operate. Macroenvironmental analysis, another aspect of strategic management, examines the economic, political, social, and technological events that influence an industry. Monitoring the macroenvironment is no small task. As two researchers have noted, “[t]he problem for corporations is that monitoring the macroenvironment is a lot like studying geology—the subject is huge, usually ponderous, but sometimes precipitous; and much of what researchers would like to know is buried under something heavy” (Narayanan and Fahey, 1994). With the aid of document warehousing and text mining techniques, this type of analysis can at least be less ponderous and information far more accessible.

The tools are here, the raw information is available, and organizations have recognized the need for dealing with text. The only question now is, how do we do it? Document warehousing and text mining are the answer. This book will describe the tools and provide the techniques needed to begin mining the rich deposits of information and knowledge available from both internal proprietary document collections and external publicly available sources. This chapter will lay the groundwork for the rest of the book, beginning with a brief description of the wide range of text sources we have at our disposal. It will then examine the limits of the current common approaches to working with large document sources such as the World Wide Web and identify some key benefits of improving this process. Document warehousing and text mining are then introduced as alternate solutions to the problem of managing and using text-based information efficiently.

The term document is used throughout this book to refer to a logical unit of text. This could apply to a Web page, a status memo, an invoice, a Supreme Court opinion, or War and Peace. Documents can be as complex and long as a new drug application to the Food and Drug Administration or a simple as a short e-mail. Of course, documents are often more than text and can include graphics and multimedia content. For our purposes, we are only concerned with the textual elements of documents.

Growth of Textual Information—The Good News

It is difficult to overstate the impact of the World Wide Web on the dissemination of information. Individuals, businesses, nonprofit organizations, governments, and even terrorist groups have set up shop on the Web so they can effectively share information. Conservative estimates of the size of the Web are over one billion pages at the time of publication of this book. Estimates that include dynamically generated pages from Web databases run as high as 500 times that size.

Not only is the sheer volume of text growing, but the breadth and depth of information available makes ignoring the business intelligence value of text a dangerous proposition. Consider the fact that the Internet started as government-funded research project testing networking between computer science research centers.

After computer researchers, other scientists and engineers began using the Internet to share information. The introduction of the World Wide Web as a means for creating hyperlinks between scientific papers opened the door for others outside the strictly technical disciplines to take advantage of Internet. Now we have access to virtually all publicly available information because it is accessible via the Internet.

The range of topics covered by sources on the Web can meet the demand of just about any business. If you want to find the annual and quarterly reports of a publicly traded company, just go to the Securities and Exchange Commission Web site at www.sec.org. If you are wondering what is going on at the Human Genome Project, visit www.ornl.gov/hgmis/publicat/publications.html. If you need to assess the political and economic risks involved in expanding your business into Southeast Asia, the Federal Reserve Bank of San Francisco’s Center for Pacific Basin Studies at www.frbsf.org/econrsrch/pbc/index.html is a good place to start. If you are cleaning up a methlybenzene spill and need recovery procedures, try hazard.com/msds. Just about any topic of interest to a business, organization, or government has a place on the World Wide Web.

The Web is not the only growing source of text information. Corporate repositories and document management systems are growing as well. Documents that have never been printed are as important as, or sometimes more important than, their tangible counterparts. Few businesses could or would want to operate without e-mail. Legal departments use case management systems to track depositions, briefs, research, and other textual material. Engineers need access to specifications and project plans. Sales and marketing executives research market conditions and sales prospects and develop their own strategies that are then documented. There is no lack of the written word in the business world today.



  
  




  

Business Intelligence Solution Finder

What do you need?

Location of solution provider

What type of solution are you interested in?

Are you interested in a specific solution?                      


All product names are trademarks of their respective companies.
Copyright © ITNetwork365 - All Rights Reserved