he Welch Company (0B(s0p12h0s0b3T

40 Davis Court #1602; San Francisco, CA

94111; 415 781 5700 (0B(s0p12h0s0b3T

Memorandum

From:   Rod Welch April 19, 1991
To:     File

Subject:   SDS Marketing; Product Recognition
Review Byte Magazine article re "paperless office"


This is a review of an April, 1991, Byte Magazine article beginning on page 156. I began the review April 18, 1991, per SDS 115602, 00101.

THE PAPERLESS OFFICE ==================== This is a feature article consisting of 7 parts to explain how productivity of management work can be improved by implementing a "paperless office." The main thrust is that documents can be scanned into a computer and then used more efficiently than by the current technology of copying paper and passing it around physically.

Personal Computers Produce a Lot of Paper --------------------------- One of the ironies of the personal computer revolution is the amount of paper it has produced. Back in the mid-1970s, many people thought that one of the effects of personalization of computer technology would be the obsolescence of paper.

The effect of computers, of course, has been to increase the amount of paper we put out. When it is easy to create decent-looking output, there is little reason to exercise restraint.

[Comment: as well, when language is easy to revise (i.e. "edit"), the tend- ency is to do so rather than accept minor errors, with the effect of EXTEND- ING the time required to produce documents, or at least not permitting the reduction in the amount of time required to produce them that would result, if prior standards of quality were accepted.]

This has created its own problem: the care, feeding, storage, and retrieval of all those paper documents. It has also created a new technology called "docu- ment image processing" (a.k.a. the paperless office) to handle the mass of paper documents we've buried ourselves in.

The central technology of the paperless office is document image processing (DIP). Paper documents are scanned into image files stored (usually) on optical disks. The documents are then available for retrieval, viewing, and printing.

[Comment: the most common solution to avoiding paper is to use E-Mail like IBM's Profs, or AT&T's Mail systems. This presents even more problems than the paper, because at least with paper some effort at classification and filing can take place, while E-Mail does not offer even rudimentary means to save and retrieve according to subject.]

It is easy to get carried away by technology, and DIP is no exception. In "The Dark side of DIP," Christopher Locke brings into focus the problem of indexing as it relates to the storage and retrieval of stored images. Locke stresses that indexing documents is not a low-level task but rather one akin to know- ledge engineering. He is also leery about assigning documents using ridged classification schemes; he views an archive as a vein of information to be mined, not a static, lifeless entity.

A completely paperless office is not yet a reality and may never become one. regardless, the myriad ways that computer technology lets us view and link documents will enable us to see information in new and different ways.


The World of Documents ====================== Article by Gerald P. Michalski beginning on page 159; VP of New Science Associates, a tretainer market research firm based in Southport, Connecticut.

Need for SDS ------------ On page 160 ... Imagine a research laboratory that wants to move from keeping records in spiral-bound notebooks to an integrated system that catalogs the activities of the researchers and aids them in creating innovative products. Laboratory work has changed a lot from the early days when all notes and calculations could be kept in notebooks. Now, researchers have computer models on disk.

A successful laboratory system might rely on multimedia databases to store data, tests of various kinds, electronically scanned photographs, and data streams representing the results of experiments or test tabulations. The researchers might need full-text access to each other's research papers, and the ability to comment on each test and circulate those comments to others in their field or company.

Hypothetical DIP Application How SDS Can Be Applied ---------------------------- A hypertext information environment would help to cross-reference all the rele- vant works and provide a browsing mechanism to help bring previously unseen relationships to the fore. Such a system would probably require high-resolu- tion displays, much mass storage, and high-bandwidth communications. If infor- mation to be stored on the system adhered to a common set of storage standards at the file and data-stream level, it would vastly improve the ease of inter- change among tools.

Few research labs can afford to move to such a system today. Most are working to provide some of the functionality where they sense their priorities lie.

Sometimes people are not sure what document they want. They want to bump into things, learn, and get new ideas. What they require is a rich browsing environment, and that's what a well-authored hypertext system is.


Catch the Wave of DIP ===================== Article by David A. Harvey, a computer journalist in Houston, Texas, who specializes in the technology and implementation of optical devices.

Quantity of Paper being Used ---------------------------- On page 173... One of the challenges along the road to the integrated desktop is dealing with the piles of information that exist in paper form. American business alone produces close to 1 trillion pages of paper a year. ...enough to blanket the surface of the earth, with some to spare. In an age when concerns about deforestation, the erosion of the rain forest, and the mounting solid- waste-disposal crisis have become reality, finding ways to conserve paper has become imperative.

Cost of Manual "Filing" ----------------------- ...about 3 percent of all documents are incorrectly filed or lost, and the average cost to recover a document is around $120. Finally, the average executive spends a grand total of about four weeks per year waiting for documents to be located. You know how it goes -- you spend 20 minutes pawing through the file cabinet and a half hour looking through the various unsorted piles in the storeroom.

In the end, the goal of a thoroughly paperless office may remain unattainable. People will forever continue to scribble random notes on paper, generate ideas on index cards, and so on. But given the state of the art of DIP, people can now move closer and closer to that goal.

[Comment: the information cited as comprising a requirement generating paper, is actually the most susceptible to avoiding paper -- random notes and ideas on index cards can be managed as effectively as documents in SDS.]


The Dark Side of DIP ==================== Article by Christopher Locke, Director of Industrial Relations at Robotocs Institute, Carnegie Mellon University.


Need for Subject Indexing ------------------------- Page 193 ... when an imaging system contains thousands or millions of pages, it rarely reveals the whole truth. The issue here is not accuracy but the re- trieval of only those pages that are relevant to your immediate needs. If the system holds back crucial information, then, in effect, it lies.

Information systems don't lie intentionally, but all information-retrieval systems tend to lie by omission: They simply have little way of knowing what they contain relative to your queries. In document-imaging systems this deceit by omission is an inherent attribute of the technology that can cost organiza- tions millions of dollars after all the glamour wears off.

This fatal flaw involves the definition of the contents of document images so the relevant information can be recalled on demand. The technical term for this challenge is indexing, and it applies to any form of computerized informa- tion retrieval. This seemingly straightforward concept is also employed in the back pages of any respectable reference book.

While it may seem tangential to note that works of fiction don't have indexes, this fact is very much to the point. Rightly or wrongly, publishers assume that novels will be read linearly; that is, users (readers) will become familiar with the contents of a novel by processing (reading) its pages from front to back.

This assumption doesn't hold for non-fiction; potential readers may simply want to check a single fact in a 600 page book or scan several relevant pages in search of some highly specific information. Few people have the time or the dedication to wade through an entire text for so little return. A table of contents helps, but usually not enough. Thus, the index was conceived.

Good Indexing is Difficult -------------------------- Creating an index is no laughing matter. Given a 600 page book on artificial intelligence, 10 different indexers will likely produce substantially different indexes. Because the process requires selectivity, and because of different background knowledge, different indexers notice and emphasize different key words, phrases, concepts and relationships.

Indexers generally do not standardize on the same indexing terms. For example artificial intelligence, expert systems, intelligent machines, knowledge-based programming, and automated reasoning may or may not refer to the very same thing. Without a careful consideration of context and a good deal of specific knowledge of the field, it would be hard to say. although there are only five terms in this example, there could easily be a dozen that were fundamentally synonymous. The 10 indexers aren't likely to settle on a single term. some would use one term, some another.

An individual might not decide to use just one term -- grouping all related references under, say, artificial intelligence -- but rather list each term in separate locations in the book's index, with differing page references for each.

Understanding that such an approach would needlessly confuse readers, an experienced indexer would choose a single term under which to supply page references to all related concepts -- say, again, artificial intelligence -- but would then list the other synonyms in their alphabetical index locations along with a pointer to this primary term -- for instance, "Expert Systems." Indexers must accommodate large, heterogeneous readerships whose members may be inclined to look for any one of many possible phrases when using an index to locate material about a specific concept.

Two fundamental ideas underlie the Library of congress Subject Headings. the first is to guarantee (or at least encourage) consistency in the selection of indexing terms. The problem here is often called vocabulary control, and logically, the solution is called controlled vocabularies. Such agreed-upon lists of valid terms guide subject indexers in classifying a book's subject matter consistently, so multiple indexers will not apply multiple synonyms at random.

Subject Definitions & Structure ------------------------------- The second fundamental idea behind the Subject Headings is to establish not just terms but categories of terms, and relationships among these categories. "See also" references point to what are technically called related terms (RT). In addition, Subject Headings classification includes pointers to broader terms (BT), which indicate more inclusive subject categories, as well as to narrower terms (NT), which name more specific subcategories.

What begins to emerge here is a quasi-hierarchical representation of a field of knowledge. Although it differs radically from a synonym list, this sort of extended conceptual taxonomy is often called a thesaurus.

While "See also" pointers and the approved terms they indicate may be the product of arbitrary decisions, "See also" pointers entail some knowledge of the domain under consideration. In fact, the subject analysis that librarians perform to create these categories and relationships is strongly akin to what the AI literature calls knowledge engineering.

These and related library techniques have evolved over untold thousands of worker-years of deep experience in document-collection management. Chances are good, however, that you would really rather not wade into all this. Here is where document-imaging systems come in, right on cue. High technology promises to solve all your problems while at the same time enabling you to bypass such complexity.

These techniques exist because they are crucial to effectively organizing and categorizing information for retrieval.

Most product literature, and even trade-press reporting, makes little mention of indexing, as it relates to document imaging, except to acknowledge that the process requires "manual" effort. but it's not referring to the kind of manual effort required to type 80 words per minute. Rather than a task that can be handed off to low-cost temporary help, indexing of this sort requires knowledge engineering skills.

Part of the problem -- but only part -- is that after a document has been cap- tured in an imaging system you don't actually have the real document anymore; what you have is pictures of pages.

[Comment: this is significant because documents themselves can be scanned by leafing through the pages to accidentally bump into headings and keywords that are extremely meaningful when viewed. This cannot occur when informa- tion is in a computer. The user must think of a term and ask for it.

The SDS Subject Index is helpful in this regard, because it prompts the user with keywords and ideas; plus, users can quickly try various information structures (i.e. hierarchies) to find their topic.]

Standard Indexing Systems Inadequate ------------------------------------ To index records in on-line bibliographic systems some type of consistent information is placed into structured database fields: author, title, journal, and so on. For books, such information might include Dewey decimal classifi- cation. Providing only this type of information puts a tremendous burden on searchers. It assumes that you know exactly what you're looking for, which is seldom the case. More often, you want to know about something and are not even sure what that something should be called. It might have 50 possible expres- sions or perhaps only one that is so new or technical you've never heard the term used.

Indexing Methods ---------------- Subject indexing implies the existence and skilled use of a structured thesaur- us of related concept categories into which the elements of a controlled vocab- ulary have been appropriately placed. It is crucial that the indexer selecting these subject headings have sufficient understanding of the document itself, the field of knowledge that forms its context, and the methodology for applying subject headings.

Document abstracts are another important method of identifying documents so there is a meaningful opportunity to retrieve information fully and quickly. Abstracts summarize a document in a succinct and cogent paragraph. The full text of the abstracts is indexed so nearly every word is a retrieval hook.

[Comment: SDS applies these concepts by providing a tool to quickly yet carefully create a catagorical hierarchy of subjects. This enables the most knowledgeable people in the organization to invest their time to create the Subject Index, and for everyone to both contribute to and apply the system.

Letting each person maintain a personal Subject Index in tandem with a com- mon Subject Index, insures that when individuals use the system they are drawing on their own perspective.]

Some on-line information services are moving toward full text systems. The initial enthusiasm these systems inspire can give way to massive frustration for all the reasons given so far: no controlled vocabulary, no thesaurus, no clues. The problems that give rise to this frustration are technically termed "recall" and "precision." Either too many documents are retrieved, and/or retrieved documents are not about the needed subject.

Challenge of Creating Subject Index ----------------------------------- What documents are about simply can't be captured by casually jotting a few words in a database header or even by dumping all the discrete words the documents contain into an inverted file index. Even for full-text document data bases, higher-level tools are required to perform the type of sophis- ticated semantic analysis prerequisite to the construction and maintenance of an adequate thesaurus. More important, such tools must be put into the hands of intelligent and knowledgeable people who are not adverse to long hours of difficult intellectual work.

The American Library Association and Special Library Association are good sources for locating such people.

Epistemology ------------ The core issue is epistemological: How do we know something? What constitutes knowledge? Although raising such questions has always been seen as unbusiness- like, if not downright flaky, look for this to change. Despite the glut of verbiage about the information age -- information society, information economy, information anxiety, and so forth -- something has definitely changed from the time when the focus of business was almost exclusively on product.

Pattern Recognition ------------------- A principal objective in document management is to track potential yet uniden- tified problems and opportunities, and either head them off before they become critical or take advantage of them before the competition catches on. This objective requires "pattern recognition" of such a high caliber that it is often referred to as intuition and considered a mystical skill. Yet people do this sort of thing all the time, and more could do it better and more often if they were given software tools that enabled and encouraged deeper exploration of documentary evidence.

Effective pattern recognition requires flexible information classification sys- tems. If you only have 10 keywords -- or make it 20, 50, 500 -- it will never be adequate. You will always be caught short if you can't quickly restructure or reindex information in light of more recent intelligence. Often this need emerges not from new data, but merely from a new way of perceiving and inter- preting existing information.

This requires sophisticated conceptual maps that you can continually debug and incrementally extend to make sense of the words, which are themselves simply tokens of some presumably real territory beyond the document.