000423; Eric Armstrong: Towards an Atomic Data Structure

Eric Armstrong
eric.armstrong@eng.sun.com

Memorandum

Date: Sun, 23 Apr 2000 21:50:50 -0700

From: Eric Armstrong
eric.armstrong@eng.sun.com
Reply-To: unrev-II@egroups.com

To: unrev2

Subject: Towards an Atomic Data Structure

Background: Object-Oriented (O-O) Terminology

A "class" is a template you use to construct an "object". The class defines the behaviors that members of the class are capable of, and defines the data items that distinguish one member of the class from another. Individual objects use the behaviors defined by the class, and have specific data values that make them (usually unique) "individuals". The Car class, for example, might define behaviors for start and stop. It also might include data types for model, color, and acceleration characteristics. So a Red Ferrari would have one set of data items, while a Green Edsel would have another set. But both would have start and stop behaviors.

Text Nodes

The fundamental unit of a DKR is an item of information. Since the ideal in writing is to have "one idea per paragraph", an "information node" can be thought of as a paragraph of text. Headings stand apart from other text, as well, so a heading is a special (short) paragraph, or information node.

Node behaviors are defined in a class (object template). Every text node must contain an attribution -- a pointer to the author, or an identifying string. A copy of that node may be edited, which suggests the need for a split operation, for example. After node is split into one or more fragments, and edit operation could replace some fragments or insert new ones that have a different author. Some of the operations appropriate to a node might therefore include split, delete, replace, and insert.

Note that when the node is split, two objects exist where one did before. Every node must therefore be capable of being the root of a subtree. Although it may start out life as a simple node that contains or points to an item of text, it must also be capable of pointing to a list of text elements. (That list might also include markup elements, like HTML bold tags: .) Since each item in that list may itself point to a list of subitems, the resulting structure is a tree.

Categories

Since every node will contain some kind of content (text initially, but eventually media objects), it is relatively clear that the fundamental class should provide operations for split, delete, replace, and insert, and that it should allow for a tree-shaped substructure.

However, nodes may also acquire other subelements, such as comments. So a node needs the ability to serve as the root of multiple subtrees.

In addition, some categories require special behaviors and data structures. For example, if a node is "rated" then it needs the ability to acquire a subtree consisting of evaluations (ratings). It also needs the ability to "average" the evaluations it has received for presentations.

Dynamic Behavior

Following classic O-O principles, it is tempting to construct separate classes for evaluations, comments, and information nodes. But that process moves us further away from identifying an "atomic" structure. More importantly, it runs into the problem that node-categorization (classification) is a dynamic process.

If node is not "rated", then it should not be possible to add evaluations to it. If it is "rated", it should be possible to add evaluations, and to have the average of those evaluations summarized as the node's rating. Once a rating has been added, the node can no longer be unrated. But until then, the node could be switched at will back and forth from "unrated" to "rated".

That dynamic classification poses problems for a static object oriented class structure. Unless some language system exists that allows behaviors to be added on and taken away in a dynamic classification process, the only alternative is to build all of the behaviors into the fundamental class -- making the identification of an "atomic" data structure all the more important.

Extensibility

If the behaviors are defined in a single class, then adds to the system by extending that class. To put that class into effect, the system must be designed to create nodes using a "factory method".

To create a new information object, then, you don't merely use an existing class to make one. Instead, you ask the factory method to create one and hand it back to you. Runtime parameters can then configure the "factory" to tell it which class to use. So, if the BasicNode class is the standard class for an information node, and if you create an ExtendedNode class, then the factory would be instructed (via a command line switch or configuration file) to use the ExtendedNode class when constructing an information object.

Versioning

The basic structure for a node, then, is that it contains a pointer to a previous version of itself. For versioning to be useful, however, it must be possible for old links to acquire the newest version. That requires an indirect link -- a "virtual node". At a minimum, then, the data structure must allow for two atomic types: The virtual node that points to the latest version, and the node itself.

Some actions like edits, rearranging sublist items, or deleting those items would produce a new version. It should be possible to perform multiple operations of that kind without having a separate (persistent) version number. When "published", the node would have the next sequential version number, regardless of the number of changes. (For "undo", however, multiple non-persistent "revisions" would be kept, so that changes can be backed out. When published, all but the last revision would be removed.)

Data and Sublists

The node must contain, at a minimum, sublists (or subtrees) for content (text nodes), for comments (text nodes), and evaluations (text nodes with a rating). It may need to keep an author-list (people who are authorized to perform direct edits). It also needs a list of the categories to which it belongs. (Implementing categories as lists provides fast searching, as demonstrated by the Traction system). And it needs a substructure list. (For a heading, for example, the "content" would be the text of the heading while the "substructure" would be subheadings and paragraphs.)

A node therefore contains a variety of sublists, and at least some data items. The data items include a rating slot (for rated nodes), a version identifier, and a pointer to the previous version. (Alternatively, one of the sublists could be a version list.) A reference count would also be a good idea, in case nodes wind up with no links at all, so they can be removed. A pointer to the previous revision (during editing) would also be needed, until the node is published.

To make the structures extensible, the data items may well be kept in a tuple, where the nature of the tuple depends on the type of the node. (A text node, for example, would have a text string and an author identifier.)

Types

To make an "atomic" data structure, it would ideally be possible to construct a node that contains a list of of subtrees. Each list would be identified as, for example: content, structure, comment, evaluation, categories, and authors. Every such node would be capable of having its own list of sublists. Even if only one sublist was present, the result would be a tree.

It should also be possible to add lists dynamically. That allows a "question" node to have a sublist of "alternatives", for example, and for each alternative to have a sublist of "arguments" and/or "evaluations".

The question is: What is the best way to represent those types? They could be kept as a value in the node. (Then the list of sublists would contain pairs: type, list.) Another possibility is to keep them in a list header for each list. A third possibility is to link to them, the same way that a node's category sublist entries link to individual categories.

Another question: Do nodes need types, as well? For example, the list of arguments could have arguments for and arguments against (pro/con, plus/minus). Or possibly the arguments should be kept in separate lists? But that would make reconstruction of the original chronological sequence difficult, although it would expedite plus/minus summaries. [Overall, it seems desirable to add "type" as a data item in a node.]

If a node contains a type, and a node contains a list of sublists, then any node can be a list header. It only needs a type value that identifies it as a content list, or a structure list, etc.

Atomic Structure

The basic atomic structures, then, might look like this:

VirtualNode

pointer to Node (most recent version)
Node

type prevVersion data (tuple) sublists
A node of type "Info" would have multiple sublists, including content, structure, etc. A node with one of those types, on the other hand, would have only one sublist. So a "Content" node would have a single list containing "Text" (zero sublists) or "Markup" nodes (one sublist with zero or more entries of type Text or Markup).
Sincerely,

Eric Armstrong
eric.armstrong@eng.sun.com

From:	Eric Armstrong
	eric.armstrong@eng.sun.com Reply-To: unrev-II@egroups.com