THE ROLE FOR WEB SEARCH ENGINES

THE ROLE FOR WEB SEARCH ENGINES

By Ingrid E. Fisher and L. Murphy Smith

In Brief

Two Sides to the Story

Buried in that vast and rapidly growing collection of documents on the Internet lies information of interest and use to almost every accountant. The trick is finding it. This is where search engines play their critical role.

To conduct an effective search, the CPA must understand the structure of the various search engines. Search engines do not always provide the right information, but rather often subject the user to a deluge of disjointed irrelevant data.

Search engines do not manage information, at least not in the conventional business sense. They do not, in fact, search the Internet when the search button is clicked. Their crawlers and spiders have done their work in advance based on their own criteria and categories they want to include in their database. Conducting successful searches is dependent on knowing how the engines work. This knowledge also helps to get a website noticed by the search engines.

Search engines provide Internet users a tool for locating and retrieving data from the World Wide Web, based upon keywords supplied by the user. All search engines share the following basic elements:

1) a spider (also referred to as a crawler or a bot) that goes onto the web and reads pages following hypertext links to other pages and sites on the web;
2) a program that configures the pages that have been read by the spider into an index; and
3) a second program that takes user-supplied keywords and searches the index, essentially a process of comparison and matching based on the engine's criteria, returning the user a set of results.
The results are usually ranked according to how closely they match the keywords, as defined by the search engine's set of variable criteria. Exhibit 1 shows a number of popular search engines that can be accessed on the web (acct.tamu.edu/smith/search.htm).

Behind the Scenes

An important point regarding search engines is that the user's search is a search of the engine's index and not of the web itself. The web search, performed by the spider, occurs earlier. The amount of the web crawled by the spider determines the size of the engine's library of web documents. No search engine is able to cover the entire web. Pages appear and disappear hourly.

The size of the web is mammoth, and it continues to grow at an exponential rate. Including 100% of the web is not possible or even desirable, as the quality of web-based documents varies from junk to highly respected, reliable, and relevant data. Search engines differ as to what percentage of the web they cover, their techniques used in obtaining coverage, and their selectivity in eliminating junk.

How frequently the engine's spider crawls the web will also influence the user's choice of search tool. Frequent crawling ensures that current documents are included in the index, and documents no longer available (dead links) are eliminated. This is of significant importance in an age where old information is quickly superseded by the new.

All search engines support single-word queries. The user simply types in a keyword and presses the search button. Most engines also support multiple-word queries. However, the engines differ as to whether and to what extent they support Boolean operators (such as "and" and "or") and the level of detail supported in the query. More specific queries will enhance the relevance of the user's results.

The final step is the search, locate, and match process itself. Location and frequency of the keywords' occurrence within the document are the most common criteria used by search engines in matching and ranking. Words located in the title of a document are awarded a greater weight, as are words located in HTML meta tags. Those located in subject descriptions and those located higher up (i.e., earlier) on the page are also more highly weighted. The frequent recurrence of the keywords results in a greater weight; however, frequency is subject to certain limitations.

Most search engines recognize and defend against a practice called spamdexing. "Spamdexing" refers to the practice of tailoring the design and content of a web document in an effort to cause search engines to index it favorably. The actual content may not be relevant. The most common practice is to simply overload the web page with common terms in the initial part of the document, in a way that is invisible to the user but readable by the search engine spider and index program. The detection of spamdexing by search engines will cause most to either omit the site from its results or to rank it at the bottom.

Variations on the Search Engine

A search engine is not the same as a "subject directory." A subject directory does not visit the web, at least not by using the programmed, automated tools of a search engine. Websites must be submitted to a staff of trained individuals entrusted with the task of reviewing, classifying, and ranking the sites. Content has been screened for quality, and the sites have been categorized and organized so as to provide the user with the most logical access. Their advantage is that they typically yield a smaller, but more focused, set of results.

The most significant limitation of a subject directory is the time lag involved in reviewing and categorizing sites. Yahoo!, the original subject directory and the most often consulted web search tool, has been criticized for this. A subject directory is also only as good as the classification scheme used by the directory's managers.

Gaining in popularity is the combining of a subject directory with a search engine to form a portal. A portal is intended to be an Internet user's one-stop point of entry to the web. Portals often provide the user with a number of specialized search engines (e.g., focusing only on news or sports), sometimes tied in to other sites. The appeal lies in the array of customizable and personalized features they can offer. For example, portals frequently offer a periodic status report on a stock portfolio, free web-based e-mail, and a customizable home page with a menu of favorite destinations. Portals often store and track a user's personal data, favorite topics, and frequent searches to provide personalized services (e.g., horoscopes, weather reports, or breaking news). The array of services will continue to expand as portals compete to build and keep audiences that can be sold to advertisers.

Basic Internet Applications

As previously discussed, no search engine covers the entire web. There are technical obstacles such as the inability to index frames, image maps, or dynamically created websites. There are also monetary constraints that prevent companies from investing in the power necessary to crawl the web both comprehensively and quickly, while simultaneously supplying the necessary storage space.

The number of pages indexed for some of the search engines previously described are as follows: AltaVista, 140 million pages; Excite, 55 million pages; HotBot, 110 million pages; Infoseek, 30 million pages; Lycos, 30 million pages; and Northern Light, 80 million pages. There are significant differences in terms of size across the engines. Size matters when searching for a highly specific piece of information or every piece of information available. In order to be truly comprehensive, it is prudent to use more than one search engine.

Just as important as the search engine's size is its freshness, or how often the spider crawls across the web to find new sites and eliminate dead links. For example, AltaVista, Excite, and HotBot all crawl regularly, covering a substantial number of pages.

The depth of query support provided increases the likelihood of finding exactly what is of interest. For example, HotBot provides many options for filtering and parsing a query through a series of drop-down menus and buttons. AltaVista and Northern Light also offer excellent support for query structuring. On the other hand, when searching for a broad topic, a subject directory like Yahoo! may be more efficient.

Most searches are conducted based on keywords. Although a single keyword can be used, the user will usually want to include at least two, in order to increase the relevancy of the sites located. Boolean operators can help specify the topic. For example, in order to obtain all sites that include either the word "income" or "profit," the words must be combined using the term "or." Search engines rely primarily upon string matching: a one-by-one, letter-by-letter, comparison. If a word is misspelled, the results will be irrelevant. String matching is incapable of recognizing meaning or context. It continues to be used, however, as it is the most efficient and inexpensive method available. Other means of matching are in development (e.g., visual mapping and natural language), but will be more costly and likely slower than simple string matching.

Some of the search engines are using measures of website popularity in their attempt to rank site relevance. Direct Hit, Inc., introduced a technology for measuring the number of user hits on a particular site as a reflection of its relevance. HotBot has currently partnered with Direct Hit and, after premiering it as an option, now offers its results whenever available. Global Brain, developed by a New Zealand team of programmers, uses click-through data in much the same way but takes it one step further by requesting users to register their country, age, occupation, favorite sport, and gender. Global Brain uses this information to construct profiles of the individuals and provide search results targeted to the profiles. Global Brain is not yet in commercial application.

Example of a Web Search

Three queries were made concerning accounting-related issues. Results from four major search engines are shown in Exhibit 2. The search engines used to test the queries were AltaVista, the largest; Lycos, one of the smallest; Yahoo!, the most popular; and Northern Light, a relative newcomer. For this test, only the top page of results provided by each engine was reviewed, generally 1025 websites, because the relevancy of sites typically drops off sharply beyond the first page.

Search 1. The first search was performed on the topic "revenue recognition," a relatively broad concept with a variety of applications in accounting. The search was structured as a simple query, without attempting to use the more advanced search options that most engines offer.

AltaVista, the first engine queried, located 4,567 web pages when "revenue recognition" was entered in quotation marks, which indicate the words are a unique term. The first 10 found were relevant to the topic. All were dated within the past two years. Of the 10 sites returned, four focused on a particular company's practices, three focused on computer and software industry issues, and four were of a more general nature. The search was done again without the quotation marks, and the results differed. (AltaVista has recently added a phrase-searching feature that should not require quotation marks.) This time, AltaVista returned 4,381 web pages with three new sites included in the top 10 results. None of the additional sites were of particular note; however, this demonstrates that it is wise to structure and test your query in several different ways in order to be sure optimal results are obtained. Finally, the query was structured as a natural language question: "What is the most recent revenue recognition ruling issued by the SEC?" AltaVista returned an astonishing 35,481,882 web pages, indicating that it was unable to link the words. This should serve as a caution against overrelying on new features that may not work the way one expects.

The second search tool consulted was Yahoo! The "revenue recognition" phrase, enclosed by quotation marks, returned 856 pages. The top ranked site identified by Yahoo! was a posting to a newsgroup from 1996 that happened to mention revenue recognition several times in the header. This high ranking was most likely attributable to the way in which the mailer was configured. The most recent site, which addressed SEC challenges to current revenue recognition principles, was ranked twentieth.

When the identical query was made to Lycos, the top 10 listings were shown without any indication as to how many pages had been located (most engines begin their results page by providing this particular piece of information). Only the first five sites actually related to revenue recognition. The listings were also, on average, the least current of any of the engines tested.

Finally, "revenue recognition" was input to Northern Light, which located 5,170 items. Northern Light also automatically provided a series of "custom search folders," such as current news, SEC, educational sites, and accounting, which offered a way to reclassify the findings and narrow the search. By selecting the accounting option, the results were cut to 836, with a second set of folders offered. All of the sites returned were relevant. The second search provided the most timely results of all four engines. Northern Light also included several legal cases addressing revenue recognition practices.

Northern Light presents the user with a series of folder tabs labeled simple, power search (providing advanced search options), publication search, industry search, and current news search. Northern Light provided a very user-friendly interface, with the feel and facility of a subject directory as well as the utility of a true search engine.

Search 2. The second query tested contained the words "accounting" and "divorce." The way to ensure that both words appear in the document, for all four search engines, is to enter the words with a plus sign: +accounting +divorce (using a minus sign would exclude that word). AltaVista returned 15 pages. Of the top 10 results, only the first seven were truly relevant. Lycos was unable to locate any sites at all, probably due to its significantly smaller size index. Yahoo! was able to locate 3,858 sites, but of the top 10, only eight were relevant to both words. Northern Light returned eight items, all highly relevant; however, the final two items dated back to 1995.

Search 3. The third and final query was the most specific, an attempt to locate an accountant with a financial planning practice in the Albany, N.Y., area. The input was structured as follows: +accountants +"financial planning" +"albany, new york." AltaVista located 7,512 web pages, thus indicating its inability to link all the words. Of the top 10 returns, not a single site was located in Albany. Lycos was unable to provide any results. Yahoo! located 1,102 sites; but of the top 10, none were located in Albany. Northern Light found 308 items, and of the top 25, 11 were websites of professionals within the Albany geographic area.

Getting the Information Out

Accounting professionals establishing a web presence do so with the expectation of being found. How an accounting firm might go about setting up a website was described in a November 1998 CPA Journal article, and the sample website remains available:http://www.richmond.edu/ ~walden/cpaweb/.

Since a potential client may be using a search engine to locate "us," then the manner in which search engines search and index has implications for website design.

First, the site should not present barriers that would prevent the site from being searched and indexed by the engines. That means the use of frames, image maps, and passwords should be avoided. Second, the site should be designed so as to increase the likelihood it will be found by the crawler. That translates into effective use and strategic positioning of keywords in the site's title as well as early in the document's body. For example, the CPA designation, areas of specialization, and location of the practice should be positioned in the title or first few sentences. Take care not to overuse keywords, as the site may be rejected for spamdexing. Tables should be carefully placed, because generally search engines read them first, followed by text immediately surrounding the tables.

Another important consideration is how quickly and easily the user will be able to access the site. In general, "simpler is better." Yahoo! is the most frequently visited site on the web, and there are lessons to be learned from its design. The website's background is gray, not white, and easy to read. The site uses no Java applets and does not require plug-ins, which take time to load and may not be supported by all users. Also to be considered are survey results that indicate people looking for information do not wish to be entertained.

The focus should be on attracting the right visitors: the accountant's target audience. Mere numbers of visitors at the site translates into an investment in more expensive server capabilities to handle the volume. Phrasing should be concise and precise to attract the proper audience.

Site descriptions can be submitted to most search engines. There are also companies that handle the distribution and submission of site descriptions to all of the major search engines for a small fee (http://www.register-it.com is one example). This streamlines the listing process by requiring only one description to be submitted to one location.

The Future of Search Engines

Not all search engines are going to be successful. To date, Yahoo! is the only one to turn a profit. If a sustainable business model is not eventually found, these companies will fail. The current strategy is to fold a search engine into a larger portal site. An increasing number of personalized services (e.g., paging services, weather reports, chat rooms) are being added by the portals to increase the likelihood of the user logging on and staying put. Portals are becoming one-stop web organizers. America Online and Yahoo! currently dominate the portal race. However, Lycos's recent acquisition of HotBot (which, for the time being, will be run independently), Disney's acquisition of Infoseek (which will power a Disney portal called the Go Network, http://www.go.com), and newcomer Northern Light (with its extensive document databases and excellent search support) are examples of different solutions to the positioning question. *

Ingrid E. Fisher is a lecturer at the State University of New York at Albany. She can be reached at I. Fisher
L. Murphy Smith, DBA, CPA, is assistant department head of the department of accounting at Texas A&M University. He can be reached at L. Murphy Smith and his website is http://acct.tamu.edu/smith.

The CPA Journal is broadly recognized as an outstanding, technical-refereed publication aimed at public practitioners, management, educators, and other accounting professionals. It is edited by CPAs for CPAs. Our goal is to provide CPAs and other accounting professionals with the information and news to enable them to be successful accountants, managers, and executives in today's practice environments.

Visit the new cpajournal.com.