What the Accountant Must Know About Data Warehousing

By Mark Levine and Koel Siegel

In Brief

A Resource for Analysis and Decision Making

A data warehouse stores information designed to assist in management decision making. Its major purpose is to query all database information scattered throughout an organization, making all information available and relevant to decision making. In order to transfer information between databases so that the data warehouse integrates all data components, program reformatting must provide for information transfer among database management systems. The authors identify the major characteristics of data warehouses and discuss their uses, advantages, and implementation issues. nformed decision making requires an understanding of all business components—products, customers, vendors, marketing strategies, manufacturing, and administrative alternatives—through quantifiable and comparable data. Technological advancements have made the adoption and implementation of data warehouses financially feasible. The use of these systems assists managers in deriving trends, preparing forecasts, and focusing strategic decisions by putting vast amounts of information at their fingertips.

What Is a Data Warehouse?

A data warehouse consists of a group of technologies that integrates the operational data of all departments into a single database. The information in the data warehouse is arranged by subject for easy use. A data warehouse is a database for large volumes of data used primarily for high level analysis and decision making rather than for operations. For example, a data warehouse could aggregate information about customer buying patterns to create incentives for additional purchases. Likewise, an analysis of purchasing schedules can determine whether batching material acquisitions to take advantage of quantity and timing discounts would reduce costs.

Data entering the warehouse require restructuring rather than simple consolidation. This occurs by denormalizing tables, adding data fields, and scrubbing dirty data to create uniform entries. In summary, a data warehouse extracts important historical data from operational databases and transforms it into formats compatible with analytical requirements and decision-making needs.

The teams of professionals that create data warehouses include—

  • data warehouse designer (creates database architecture)
  • database administrator (creates physical database structure, system performance requirements)
  • systems analyst/programmer (creates programs for data extraction, transformation, integration)
  • end-user analyst (responsible for ensuring that end-user needs and requirements are being met)
  • technical administration support individuals.

    To achieve the primary goal of strategic decision making support, W.H. Inmon et al., in Corporate Information Factory, identified the following elements that distinguish data warehouses from other computerized decision aids: They are subject orientated, integrated, time variant, nonvolatile, and contain a collection of both detailed and summary data.

    Subject orientation means that data in the warehouse are organized by subject (customers, products, vendors, orders, and transactions) rather than by function, as is commonly done in operating departments, such as payroll, general ledger accounting, or marketing. By virtue of its subject orientation, the data can be manipulated and analyzed over a specified period without affecting records in the operating departments’ systems.

    Information in a data warehouse is drawn from different applications and sources in various formats. In order to achieve integration, the data must be fundamentally altered to share commonality in one or more of the following areas:

  • Definitions
  • Layouts
  • Encoding and decoding structures
  • Key structures
  • Relationships.

    The fundamental operation of the data warehouse is to process historical information. The data in the warehouse is time-variant, accurate and reliable for a limited time, namely the length of time that the data is held in the warehouse. This period can be relatively long—up to 10 years for detail in archives—or as short as required.

    Nonvolatility of data means that real-time updating of records does not actually occur. Changes are processed in the form of a time-variant snapshot and each new snapshot is part of a series of changes resulting in an historical record.

    Data warehouses contain detailed as well as summary data. Detailed data reflect transaction occurrences that affect records such as customer accounts, vendor accounts, inventory, and sales. Summary data can consist of either profile records or public summary data. A profile record contains an assemblage, history, or summary of the customer’s activity, including transactional detail accumulated during each operational period. Public summary data, on the other hand, is circulated throughout the business entity for purposes of analysis and comparison (public scrutiny). Public summary data is derived by aggregating the departmental data about a particular subject. For example, the warehouse would aggregate data from various organizational units to prepare monthly, quarterly, or annual financial statements.

    Functionality

    A data warehouse enables decision-makers to construct strategies informed by the most realistic empirical data available: the past, cumulative, integrated experience of the organization. Data warehousing systems, driven by analytical requirements, permit investigative questions of detailed, historical data. For example, a review of purchasing patterns in past years could identify suppliers that have not offered sufficient discounts, providing the insight and analytical basis for renegotiating future prices.

    Without a data warehouse, management must first collect information from departments before addressing important decisions. Collecting information from separate functional areas runs the risk of combining data that may not integrate well for specific decision purposes. Data fully integrated and unified from all operational sources and designed for analytical purposes will lead to better business decisions. When data is accessed from a data warehouse, users can rely on the system to generate the same result all the time, even if the source data originated from incompatible data sources.

    A data warehouse condenses the technology needed to generate and maintain reports because of the information system restructuring that takes place during its implementation. A data warehouse fosters competitive advantage by allowing users to manipulate data in ways that were previously impossible, uncovering relationships that lead to projections and informed actions to improve competitive positions.

    The implementation of a data warehouse should significantly improve the productivity of IT staff by reducing time spent on programming to satisfy the special use requests of decision-makers. In addition, the use of a data warehouse enables managers to directly access current information, saving money and empowering employees.

    Comparing Data Warehousing to Related Technologies

    Data warehousing systems were designed as an alternative to the online transaction processing (OLTP) systems that are used in most business operating applications. OLTP systems generally are not end-user friendly and often require a significant amount of additional programming to make them satisfy the specific needs of a decision-maker (See the Exhibit).

    OLTP systems are devised to serve online, day-to-day operational processing requirements such as immediately updating a perpetual inventory system. The data in an OLTP system changes constantly throughout the business day. In addition, OLTP systems process small amounts of data, such as a sale, and update system records for the order, customer’s account, shipment manifests, and inventory.

    Data warehouse systems, on the other hand, update information periodically from all data sources to provide integrity for comparative analysis. Using data that is immutable, historic, and summarized at various levels, a strategist is able to query a data warehouse for very specific investigative purposes.

    For example, a company that has made significant changes to its product mix over the last 12 months may want to know how these changes have affected sales volume and customer base. The analyst would study information in the warehouse regarding the sales revenue for both old and new products over the past 12 months as well as comparative data from the previous 12 months. In addition, the system can generate a survey sample of customers that switched to the replacement products so that they can be reached for specific feedback. This information would be helpful in making modifications to advertising as well as in product planning and development.

    Data warehouses are updated through a periodic, planned schedule of batch processing. Although the lack of real-time updates in a data warehouse sometimes causes concern, its primary objective is to facilitate analytical, strategic decisions based on long-term trends in a myriad of business areas. Constantly changing data renders time-period analysis inconsistent.

    Uses

    The data warehouse can assist the internal audit staff in performing its watchdog function. The warehouse provides the information and analysis tools that enhance the performance of important control duties. In addition, the internal audit staff appraises the organization’s efficiency and guards against fraud. Data warehouses facilitate these duties by providing a rich data source for analysis and review and for developing fraud models. Using the data warehouse, trend and pattern analysis of specific accounts can reveal expected variations useful for monitoring. Any material aberration would be a cause for further investigation or identify fraud-sensitive areas for the application of alternative controls. Additionally, an analysis of standard budgets derived from a data warehouse could be compared to actual costs in order to identify inefficiencies.

    The data warehouse also enables customer or client profiling. For example, there are indications that the IRS plans to use data warehousing and data mining techniques to establish profiles of taxpayers for audit sensitivity purposes. Insurance companies may use similar techniques to classify individuals by underwriting risk. Marketers may use customer databases to determine individuals likely to respond to special sales incentives.

    Analyzing past repair and warranty recalls can determine the nature of problems associated with current products as well as how they might be improved. This would enable product development engineers to strategically refine current successful products as well as create new products that appeal to the changing tastes of their customers.

    Other applications of data warehousing include—

  • creating pricing models based on historical demand and inventory patterns
  • estimating the effects of price changes on demand in creating pricing layers, such as those in the airline and car rental industry
  • finding insurance claims fraud
  • generating credit risk scoring to, for example, assess an individual’s credit worthiness
  • creating custom-designed models (based on historical sales data, projected demand, strength of competitors, and other variables) for determining optimum pricing.

    Data Flow

    Sources of data. The sources of data for the warehouse are databases and systems in different locations and places in the organization. Each source database serves its own applications. Legacy systems can contain mainframe operational data stored in company network databases. Source data can be in departmental file systems, workstations, and proprietary servers. In addition, data can come from outside sources, such as commercial databases, databases shared with large vendors or customers, and the Internet.

    Moving data into the warehouse: Extraction, transformation, integration, cleaning, and loading. Because data comes from so many different, usually incompatible, sources, effective data extraction must merge all data into standardized formats and fields. Scrubbing the data ensures that any errors and inconsistencies are eradicated. It is then appropriately summarized so that it may be easily digested by end-users. Finally, the data warehouse must be refreshed periodically so that resident information does not become too stale.

    The data warehouse architecture includes a selection of software tools to access data and contribute to a successful decision support system. Functional software applications perform operations that extract, load, modify, and transform data for consistency and preparation for data warehouse processing. Data conversion software automates the process of extraction from heterogeneous sources, transfers source information to target data, and creates the programming codes to process it. Just before loading, source fields from source databases are identified, transformations to specified fields take place, and an overall depiction of the data extraction flow is laid out. The data is then loaded into the data warehouse.

    After loading, other functions that commonly occur include data scrubbing, data consistency, integrity verification, unmatched record identification, and creation of error listings and other exception reports. These tools, sometimes collectively referred to as a transformation engine, ensure that data integration occurs properly. The entire upload process can take place at one time or on an ongoing basis using data capture technology.

    The software tools perform the following functions:

  • Read the data from the legacy or departmental systems
  • Determine necessary changes
  • Create new load records
  • Transfer the data from the mainframe to the data warehouse servers
  • Transform and merge source data from temporary storage to data warehouse tables
  • Sort the data records so that new aggregate records may be prepared
  • Delineate and process exceptions
  • Index the newly loaded records.

    The sophistication of the loading, transforming, integrating, and cleaning process is a function of the compatibility of data sources coming into the data warehouse. It is important to monitor compliance breakdowns and problems that occur during the inflow process. There is software available for this purpose; however, a customized commercial package may be needed.

    Internal processing of the data in the warehouse. The processing of data in the warehouse consists of connecting formats in a way that is useful to end-users. For example, decision-makers might use predefined reports such as spreadsheets, charts, graphs, or maps. Sophisticated analyses that enable the user to maintain a specified report structure while varying the primary variables may be needed. For example, a nationwide report on inventory turnover should be easily reformatted by specific products and location.

    End-user programming languages. Currently, end-users can specify their desired information processing through structured query language (SQL) or online analytical processing (OLAP), which mitigates the limitations of SQL. OLAP and Relational OLAP provide end-users with both storage capability and high-speed access to data in the warehouse. They support advanced analytical processing, extensive slice-and-dice processing, and drill-down capabilities. They go beyond SQL by integrating multidimensional dimensional tools (MDD) that bring high-powered analytical functions into the database. MDD is a specialized technology that allows data storage in a cube-like array that may represent such business dimensions as product, customer, geography, location, or other variables. In addition, OLAP and other multidimensional tools allow answers to “why” or “what if” queries. For example, an executive may want to know how new products have influenced sales over the last five years (comparing year by year) and by location (comparing the entity’s seven locations in the company). These tools are also capable of analyzing trends to assist decision-makers in making strategic decisions.

    Archival of information and distribution of information to end-users. After data is refreshed, the archiving of historical data represents a significant warehouse function by ensuring system integrity. The historical data, which is no longer as important for analysis, should be saved to a storage medium such as a disk to provide an audit trail to reconstruct transactions.

    Metadata Overview

    Warehouse data flow (discussed in the previous section) is controlled using metadata, providing end-users with a means of properly managing, maintaining, and building the warehouse in a comprehensible way. When metadata is utilized properly, end-users can study the contents of the data warehouse and determine the availability of data in clear, nontechnical terms.

    In short, metadata is information about data and its flow through the data warehouse. It indicates what data is available in the system, where it is located, and the historical trail that the data took into the data warehouse. In addition, metadata allows users to see and understand the logic of the warehouse’s data model. It provides knowledge of the following specific considerations:

  • The data elements resident in the data warehouse
  • Their specific contents
  • The legacy sources the data was extracted from
  • The extent of cleansing, integrating, transforming, and summarizing that was performed on the data.

    In Data Warehouse, Don Mayer and Casey Canon identify technical and business categories of metadata. Technical metadata is used by the software that runs the data warehouses as well as IT administrators. It depicts the data contents in the source legacy systems, establishes the processing design of the warehouse, and describes its functions. It describes what processing has occurred in the data warehouse including the extraction from source systems, transformations, cleansings, and integrations as well as the updating of the schedules that control these processes. Business metadata, on the other hand, allows nontechnical business users to view the data warehouse contents, the tables and charts that are its information elements, how the information elements can be used, and timeliness of resident data. Most business metadata is supplemented with descriptions of its contents.

    Using metadata to administer data flow. Metadata controls the long-term administrative functions by managing data in the warehouse over its historical period. Because data enter the warehouse from numerous operational databases, the manager must be aware of all its sources, modifications, and transformations. Some of the kinds of information monitored by the metadata include a review of the data’s extraction, attributes and essential data, indexing information, number of rows in tables, usage statistics, and patterns of data aggregations. To accomplish this, the metadata manager should specify the following:

  • Well defined business and technical data variables
  • Sources of data, including operational systems and extraction and scrubbing software
  • Regular update schedules
  • Techniques for retrieving data from input sources
  • Procedures and policies that ensure that the data being processed is properly maintained and archived
  • Proper system documentation procedures.

    Every operational data system contains metadata. Metadata integration software is available to condense and eliminate many of the steps sometimes required for huge data warehouses. Unfortunately, success has been limited due to a lack of standards. Another solution is the use of metadata repository tools, which integrate, synchronize, and copy the metadata distributed throughout the data warehouse.

    In general, the key to success in the use of metadata is that users have access to this information as well as the knowledge and confidence to use it.

    Data Marts

    In some large organizations, the creation of a global, consolidated, efficiently operating database could be a goal, but designing and creating a single warehouse to successfully serve all needs would not be realistic. A large, single data warehouse requires so much processing power and storage that the cost of solving the queries of individuals could become very expensive. As more individuals attempt to use the system, competition for information could reduce access capabilities and create delays.

    One solution is to design data marts that service the needs of specific company departments with the possibility of integration into a single data warehouse in the future. Data marts satisfy the specialized needs of departments but are smaller and easier to use.

    A data mart is a subset of a data warehouse that aggregates the information needed by a particular department or company activity. The benefit of such an arrangement is that the information resident in one department’s data mart can be customized to the needs of others.

    One approach would be to initially build independent departmental data marts that later can be connected to a central company data warehouse. Another consists of building a central data warehouse and data marts simultaneously with the aim of eventual integration.

    Advantages of a data mart. There are several important reasons for implementing data marts:

  • They provide departmental users directly with the specific data that they need.
  • A department with its own data mart can customize the flow from the data warehouse to its specific needs.
  • Data may be accessed more quickly due to the reduced volume of transactions that are involved, improving response time.
  • Data customization can be done more expediently.
  • Per unit processing and storage costs for the data mart server are much lower than a data warehouse. Overall, the cost of data mart utilization is generally significantly less than that of a data warehouse.

    Types of data marts. Inmon et al. discuss three general types of data marts:
    1) those which contain subsets or summaries of the data in the data warehouse,
    2) those which utilize multidimensional online analytical processing (MOLAP), and
    3) those that utilize relational online analytical processing (ROLAP). The first category requires no explanation.

    The second category consists of data marts that use MOLAP. This multidimensional database uses the cube-like array structure for storing data and allows relatively easy implementation, quick processing response time, and calculation flexibility. MOLAP is ideal for financial applications in which detailed computations are required.

    In ROLAP, data is stored using a relational model. ROLAP runs better on large databases and performs a greater variety of functions than MOLAP systems. For example, ROLAP systems can process both detailed and summary data and support a myriad of information types. ROLAP data marts are compatible with hardware improvement, are software-friendly, and can be used for structured as well as creative processing.

    The choice of data mart or data warehouse technology should be predicated on the specific needs of the user. Nevertheless, many professionals believe that both do excellent jobs of analyzing the data.


    Marc Levine, PhD, CPA, and Joel Siegel, PhD, CPA, are professors at Queens College, New York.

    Home | Contact | Subscribe | Advertise | Archives | NYSSCPA | About The CPA Journal


    The CPA Journal is broadly recognized as an outstanding, technical-refereed publication aimed at public practitioners, management, educators, and other accounting professionals. It is edited by CPAs for CPAs. Our goal is to provide CPAs and other accounting professionals with the information and news to enable them to be successful accountants, managers, and executives in today's practice environments.


    ©2009 The New York State Society of CPAs. Legal Notices

    Visit the new cpajournal.com.