000306; Colloquium at Stanford

Colloquium at Stanford
The Unfinished Revolution
Memorandum
Date: Mon, 06 Mar 2000 18:54:46 -0800
From: Eric Armstrong
eric.armstrong@eng.sun.com
Reply-To: unrev-II@onelist.com
To: unrev-II@onelist.com
Subject: DKR: Collab Doc Rqmts, v0.2
Requirements for a collaborative-documents system, version 0.2. Some material has been added, but not a whole lot has changed.

Version History
---------------
0.2 Refinements
0.1 Initial Version

Overview
========
This is a lengthy document aimed at adducing the
requirements for a subset of an eventual Dynamic
Knowledge Repository (DKR). The subset described
is for a collaborative document system, which Doug
describes as an "Open HyperDocument System" (OHS).
The goal of this document is to show how such a system
fits into a DKR framework, detail its requirements,
and point to a couple of extensions that move it in
the direction of a full DKR.

 


This document has the following sections:

  * Long-Range Goals
  * Motivation
  * Starting Points
  * General Characteristics
  * Operational Requirements
  * Summary of Data Structure Requirements
  * Future: Using an Abstract Knowledge Representation

History

  * v0.2 Thoughts from UnRev-II discussions and other additions
  * v0.1 First Draft

 


Long-Range Goals
================
A fully functional DKR will need to manage many different
kinds of things:

   * documents
   * abstract knowledge representations
     (and inference engines)
   * predictive models
   * multimedia objects
   * programs of various kinds
     (search engines, simulations, applets)
   * data
     (spreadsheet files, database tables)

It is likely, too, that different kinds of problem will required
information to be organized in fundamentally different ways. For
example, a DKR devoted to the energy problem might have major headings
for the problem statement, real world data, tactical possibilities,
strategic alternatives, and predictive models. On the other hand, a DKR
devoted to building the next-generation DKR might have sections for
requirements, design, implementation, testing, bug reports, suggestions,
schedules, and future plans.

Since the general outline of a DKR seems to depend on the problem domain it is
targeted for, it seems reasonable to focus attention on the elements they have
in common.

This set of requirements will focus on what is perhaps the major common
feature: Documents -- in particular, Collaborative Documents.

Other important areas that will need attention include the integration of
multimedia objects (including animations, simulations, audio, video, and the
like) as well as the critical functions of abstract knowledge representation,
inference engines, model-building functions, and the integration of other
executable programs. But here, we'll focus on Collaborative Documents.

 


Motivation
==========
A wide variety of email and forum-based discussions occur
on a host of topics every day. In each of these discussions, important
information frequently surfaces, but that information is hard to capture
where you need it.

 


Document production systems, on the other hand, simplify the task of
creating complex documents but make it hard to gather and integrate
feedback.

For example the DKR discussions have identified several possible
starting points for such a system. That kind of feedback occurs
naturally in an email system, as opposed to a document production
system, but each of the pointers was buried in a separate email. It
required lengthy search to gather them together (below), and the list
may not even be complete!

To act as a foundation for a DKR, a Collaborative Document System (CDS?)
needs to combine the best features of:

  * Directory tree / outlining programs
  * Hypertext (links and formatting)
  * XML (inline references and other features)
  * Email systems
  * Forums and Email Archives
  * Document Database
  * Versioning Systems
  * Difference Engines
  * Search Engines

 


Starting Points
===============
In the DKR discussion, we've seen pointers to several possible starting
points for such a system. Those are contained in
the References post, in the

Bootstrap
 section. (They many possible
starting points listed in the post desperately need short synopses and
evaluations.)

 


General Characteristics
================
The lengthy list of starting points, the difficulty of creating it, and
the rapidity with which it goes out of date, combine to suggest several
obvious requirements for the system: It needs to be composed of
information nodes that are hierarchical, mailable, linkable, and
evaluable (more on those subjects in a moment).

Each of those requirements leads in turn to other requirements. The
major requirements are listed here and explained below:

General Functional Requirements

  * Hierarchical
  * Revisable
  * Versionable
  * Mailable
  * Multiple-Containment
  * Distributed
  * Administratable
  * Differencable
  * Linkable
  * Link-Typable
  * Evaluable
  * Collaborative
  * Attributive
  * Accelerative

General Systemic Requirements:

  * Open
  * Extensible
  * Secure

DKR Requirements

  * Firewalled
  * Didactic (a teaching device)

The next three sections discuss those requirements in greater detail.
Following that, there are three shorter sections:

  * Operational Requirements -- Highlights
  * Data Structure Requirements
  * Future: Using an Abstract Knowledge Representation

 


General Functional Requirements
=====================
These are the general requirements for how the system must operate, to
be effective.

Hierarchical
------------
This document, like the list of starting points mentioned eariler, is
heavily hierarchical in nature -- as are most technical documents. These
facts further underscore the need for a hierarchical system.

For example, this email message should exist in outline form. It should
be easy to add and remove entries to various sections: for example, the
list of starting points given above.

However, the hierarchy should function using XML-sytle "entity
references" that copy the target contents into the displayed document,
"inline". That permits multiple references to the same node. The result
is effectively a lattice of information nodes, where any one view of it
is hierarchical.

Revisable
----------
Although "hard" links to objects will be needed at times, in most cases
the link to the "Requirements Document" should be a "soft" link -- that
is, an indirect link that points to the latest version. That means never
having to worry about looking at an old version of the spec.

Versionable
------------
Each node in the hierarchy needs to be versioned, so that previous
information is available. In addition, the task of displaying
differences becomes essentially trivial.

Mailable
---------
It must be possible to "publish" the whole document or sections of it by
"posting" it. It must also be possible to create replies for individual
sections, and then "post" them all at one time.

Multiple-Containment
--------------------
At a minimum, every node in the system has two hierarchies descending
from it. One is a list of content nodes that comprise the hierarchical
document. The other is a list of reviewer comments. (Some comments will
be specific to the information in that node, others will be intended as
general comments for that section of the document.)

Other sub-element lists may found to be desirable in the future, so the
system should be "open-ended" in allowing other sublists to be added,
identified, and accessed.

Distributed
-----------
Rather than using a central "repository", the system should employ the
major strengths of email systems, namely: fast access on local systems
and the robust nature of the system as a result of having redundant
copies on many different systems. The system will be more space
intensive than email systems, but storage costs are dropping
precipitously, and future technologies paint an even brighter picture.

Administratable
----------------
To mitigate the short-term need for storage space, it should be possible
to set individual storage policies. For example, a user will most likely
not want to keep previous versions of any documents they are not
personally involved in authoring.

It must also be possible to add names to the authoring list. Name
removal should probably be limited to the original author. For those
cases when the original author is no longer part of the system, it
should be possible to make a copy of the document and name a new primary
author.

Differencable
--------------
When a new version of a document arrives, differences are
highlighted. Old-version information becomes accessible through links
(if saved). Differences are always against the last version that was
visited. If a section of the document was never visited, the most recent
version of that section is displayed on the first visit. If several
iterations have taken place since the last visit, the cumulative
differences are shown. (Again, node-versioning makes this user-friendly
feature fairly trivial.)

   Starting Points
   --------------.
   XMLTreeDiff at IBM Alphaworks (Lars Martin)

Linkable
---------
Clearly support for web links is desirable, as shown by the links to the
various possible starting points in the References post. [Note: Each of
those should be evaluated against this requirements list, and used to
modify these requirements.]

Link-Typable
------------
It should be possible to define types for links. At a minimum, that
means indicating whether link traversal should occur in the same window
or in a new one. Other uses are likely to be found for link types,
however, including XLink-sytle "lists of links".

For material that is included "in line" in the original document, typing
implies the ability to choose which kinds of linked-information to
include. For example, in addition to the current version, one might
choose to display previous versions and/or all commentary.

Evaluable
----------
The many possible starting points in the References list highlights the
need for evaluablility. It should be possible, not only to reply with a
comment on any item in those lists, but also to add an evaluation, much
as Amazon.com keeps evaluations for books. That feature is arguably
their greatest contribution to ecommerce, and the DKR should make use of
it. It should also be possible to order list items using relative
evaluations. That lets the most promising starting point float to the
top of the list.

Not all lists should be ordered by evaluation, however. For example, the
sequence of requirements has been chosen to provide the most natural
"bridge" from one to the next. So evaluation-ordering must be an option.

Ideally, it should also be possible to "weight" an evaluation, perhaps
by adding a "yay" or "nay" to an existing evaluation.

When displaying an evaluation, where evaluators can choose a value from
1..5, it might make sense to display the average, the number of
evaluations, and the distribution. A distribution like
  10 2 1 2 10
for example, would show a highly polarized response, even though the
"average" was 3.

  Starting Points
  ---------------

  *  Architecture for Internet searching, categorization, and ranking
            http://www.cs.sunysb.edu/~maxim/OpenGRiD/

Collaborative
-------------
The system must increase the ability of multiple people, working
collaboratively, to generate up to date and accurate revisions.

For any given document, there are several classes of interaction:

  * receive
  * comment
  * suggest
  * author

The first group consists of people who receive the document and do
nothing else with it. (Just trying to be complete here.) The second
group consists of people who send back comments on different sections.
That feedback will typically be used in future versions.

The 3rd group consists of people who suggest an alternative wording or
organization. Those "suggestions" take the form of a modified copy of
the original. One of the document authors may then agree to use that
formulation in place of the original, or may simply keep it as
commentary.

The 4th group consists of the fully-collaborative authoring group. The
original author must be able to add other individuals to the document,
or to subsections of it. (An author registered for a given node has
authoring privileges throughout the hierarchy anchored at that node.)

Attributive
----------
Every information node that is created should be automatically
attributed to it's author. When a new version of a node is created, all
of the people who sent comments should be contained in a "reviewer"
list. When a suggestion is accepted, the author of the suggested node
should go into a "contributor" list in the parent node and be added to
the "author" list for the current node. It should be possible to
identify all of the reviewers, contributors, and authors for the whole
document and for each section of it.

Accelerative
-------------
When new versions of a document are created, material would be included
by pointing to it, keeping attributions intact. The system must
accelerate that process. It should be possible to start a new document
in one of two ways:

  * Copy the original document intact to create a new version
    of it. (Deletes and rearrangements then affect the new
    document, while the original version remains intact.

  * Create a document and designate it as the "target" so that,
    as you review other documents, selecting parts of it and
    issuing the "copy" command automatically stuffs it into the
    target.

 


General Systemic Requirements
====================
These are requirements for the system as a whole.

Open
------
The system must be "open" in the sense that a user is not constrained to
using a particular editor, email system, or central server. The
specifications for interaction with the system should be freely
available, along with a reference implementation to use as a basis. As
much as possible, conformance with existing standards (XML, XHTML, HTTP,
email) is desirable. (The tricky decisions, of course, will be between
required features and standard protocols that don't support them.)

Extensible
----------
The server and client systems that implement the DKR must also be fully
*extensible*. In other words, the same characteristics of hierarchy,
versioning, and revisability (use of most recent version) that apply to
the documents must apply to the system itself.

That extensibility can be accomplished with a "dispatch table" that
names the class to use for each kind of object that needs to be created.
In conjunction with open sourcing, that architecture allows a user to
extend (subclass) an existing class and then use the extended version in
place of the original. In addition, upgrades can occur dynamically,
while the system is in operation, while allowing for modular downgrades
when extensions don't work out.

   Starting Points
   --------------

   * Warner Ornstine's Cords/Plugs/Sockets Architecture
            http://extende.sourceforge.net/arch.htm

 


Secure
=====
Security in such a system becomes an issue, unfortunately. The system
should employ whatever mechanisms exist or can be constructed to help
prevent trojan horse attacks, back door attacks, and other security
breaches in an open source system.

For example, Christine Peterson described Apache's process
as having something like 45 reviewers, 3 of whom reccomend
the inclusion and none of whom object, before new code is
added to the system.

 


DKR Requirements
=============
These additional requirements begin to move the system towards a DKR.

Firewalled
---------
With respect to security, there is also the issue of "firewall"
capability. The DKR must allow professionals in many different
organizations to contribute and share knowledge. That knowledge may
largely be in the form of published papers and the means to
locate and access them, but it represents a high-degree of
inter-organizational co-operation, at the level of the individual
professional.

The DKR will also be handy for individual projects, though. The
mechanisms will support collaborative designs and "on demand" education
as to corporate procedures, for example. But that information must
remain *inside* the firewall, inaccessible to
competitors.

In the ideal scenario, it will also be possible to "publish" information
stored in the inner repository at strategic times, rather like
publishing a technical paper that gives the design of the system. But
until then, the firewall must remain intact.

 


Didactic (DKR)
----------------
Eventually, the system must become a *teaching* tool. It must follow the
concept of "Education on Demand", intelligently supplying the user with
the information needed, and educating that user, whatever their initial
background. (Within reasonable limits.)



 


Operational Requirements -- Highlights
======================================
This is an outline of functional operations for the system:

  * Editing

    --Add, change, delete, move nodes
    --Copy nodes
      ..node alone, current-version subtree, whole subtree
    --Link (indirect, "soft" links, and direct "hard" links)
    --Automatic versioning
    --Automatic attribution

  * Email

    --Post
      ..Increment version number for future edits
      ..Deliver to group via server

    --Receive
      ..Automatically diff against last visited version of
        each node
      ..Highlight diffs
      .."Go to next unread" feature

  * Atrribution

    --New node:    author=currUser, lastEditor=currUser
    --Copy node:   all lists unchanged
    --Modify node: lastEditor=currUser
    --Copy text:   new node created, all lists copied
    --Paste text:  Author-list + Contributor list from the
                   clipboard node merge into the contributor
                   list for the current node

      NOTE:

      This is a highly imperfect solution to the attribution
      problem. Copying a single word out of a very large node
      stands to create a highly-inaccurate contributor list.
      On the other hand, creating a new node and pasting all
      of the text from the old one would drop attributions
      altogether.

      A better alternative, if feasible, would be attributions
      attached to every phrase in the node. That requirement
      creates a third category of containment for the node,
      consisting of the text that makes it up. When originally
      created, there would only be one long phrase, and it's
      author. When others make changes, the text would be
      broken up into segments. That's the same architecture
      most editors use internally, anyway, but it would require
      storing a lot more information, putting it together to
      display the node, and taking it into account when copying
      and pasting.

  * Phantom Nodes

    --Since it is possible to receive comments on nodes that
      have been deleted from the current (not yet published)
      draft, the system must maintain "phantom" nodes that
      can be used to collect such comments.

    --Phantom nodes are invisible until a comment is received.
      Theoretically, they can disappear once the current version
      is posted (since future comments will be on that version).
      In practice, though, there The comments
      themselves are always stored under the original node.

    --As an alternative, the system could operate like the CRIT
      system, where such comments go to the end of the document.

  * Trash Bin

    --Each node needs a trash bin that collects nodes which
      are deleted from under it. Trash bins are never emptied,
      except by explicit action requiring multiple explicit
      confirmations.

  * Distributed Editing Control

    --The comment/version-publishing system means that locks
      are not required for single-author documents. But for
      multiple authors to collaborate, it must be possible to
      prevent editing conflicts.

    --One possibility is to implement distributed locks.
      The major issue there is handling communication outages.

    --An equally viable possibility may be to allow
      simultaneous edits and detect their occurrence
      when a new version is received. The competing
      versions can then be displayed side-by-side
      along with user-selectable merge options.

    --Detection of competing versions may require something
      other than simple version numbers. Or perhaps the
      versionID would consist of the version number combined
      with the ID of the current writer.

    --TrashBin nodes must maintain a pointer to the phantom
      that is left behind after deletes, or to the location
      at which to create such a phantom.

  * Version Identification

    --A monotonically increasing version#, combined with the
      ID of the most recent editor *should* be sufficient to
      identify changes in a node. It may be that a timestamp
      works better, though. Even a timestamp will need to be
      combined with the most-recent-editor-ID, though, to
      identify competing versions created by different authors.
      (Although matching a millisecond-timestamp is improbable,
      it is not impossible.)

    --The version number for a node would be the maximum of
      the version numbers for all content subnodes. When
      edited, the new version number would either be a timestamp
      or the parent version# + 1. (All parents would then be
      adjusted.)

    --TimeStamps probably make more sense, since edits using
      the algorithm above will make the version# "jump around"
      quite a bit.

    --In either case, a more "user-friendly" version number is
      needed for the document as a whole.

    --The system needs to account for a "hierarchy of versions"
      of at least two levels. The first level is for a set of
      documents. (All documents for version 2.0 of the system,
      for example.) The second level is the version of the
      document itself. (Version 3 of the 2.0 Requirements Doc).
      (How deep should it go? Large subsections might have
      versions, as well. Below that?)

 


Data Structure Requirements
===========================
Each node in the system should be able to track the following
information:

  * Unique identifier (so links always work)
  * List of Content sub-elements
  * List of Comment sub-elements
  * List of elements comprising the content-text,
    with attributions (if implemented)
  * Version-identifier for the node
  * Version-identifier for the content sublist
  * Author list
  * Contributor list
  * Reviewer list
  * Last editor
  * Evaluation list
  * Evaluation summary
  * Distributed Lock (unless Competing Versions is chosen)
  * Trash Bin
  * isPhantom identifier
  * pointer to own phantom

 


Future: Using an Abstract Knowledge Representation
==================================================
A hierarchical system is created from only two relationships:

   * Containment
   * Ordering

If progress is made in the pursuit of abstract knowledge
representations, it may be that the whole of collaborative document
system may well migrate into a knowledge representation, using those two
relationships. The document management system would then be a subset of
a much larger knowledge management repository.

 


One wonders what such a system will look like after it begins to be
extended with thousands of additional relationships.

It boggles the mind.
Sincerely,
Eric Armstrong
eric.armstrong@eng.sun.com
From:	Eric Armstrong
	eric.armstrong@eng.sun.com Reply-To: unrev-II@onelist.com