Colloquium at Stanford
The Unfinished Revolution


Memorandum

Date: Sat, 29 Jan 2000 22:21:01 -0800

From:   Eric Armstrong
eric.armstrong@eng.sun.com
Reply-To: unrev-II@onelist.com

To:     unrev-II@onelist.com

Subject:   Source Code in XML: Data and Editing Requirements

Here is a modified version of the requirements I originally produced for the source in XML project. The gist of it is that it can be handled without using any special node types. (This is a good thing. Otherwise, you wind up defining a language that is equivalent to the original -- and which must stay in step with it.) As a result, the "smarts" in the system migrate to the XML-to-plain-text filter, until such time as an XML-aware compiler is constructed.

Apologies in advance for the fact that this should probably go to an as-yet uncreated DKR working group sublist, instead of the main list. (I suspect tonight's traffic will help motivate such a subgroup...)


-------Original Message-------

Subject:   Double Containment

For a literate style of programming in a hierarchical system, comments need to "double contain" both other comments and code. You want the ability to collapse the hierarchy so you see only comments, and expand it at certain points to see the code tucked under them. That style produces a more readable, better documented program, because code that can't be collapsed into a comment sticks out like a sore thumb.

At the same time, some comments are large, so they need to take advantage of the tree structure and the ability to collapse things, as well. So a comment needs the ability to "contain" both code and comment elements.

One way to do that is create a meta-node that contains a comment substructure and a code substructure. But that makes the editing task very difficult -- the meta node must be invisible, and the comment node must appear to the user to be the parent of the code. That requirement in turn implies the need for an "Editor Stylesheet" to tell the editor what to do with the nodes in the system.

But it makes a lot more sense to use existing standards, and avoid creating new ones. So it would be more desirable to use a more natural mapping of source text to an XML source tree.

At the same time, we want to avoid creating different nodes for every language construct. That ties the editor to a particular language and creates an editor that is harder to learn and harder to use. It is therefore relatively clear that the right tree structure for storing source code will be basically language-independent, so that any XML editor the developer is familiar with can be used.


Single Node Type

The bottom line is that the XML source tree really needs only *one* kind of node. Call it "line", or "code" or something. The smarts are all embedded in the XML-to-text processor, which means that different processors can handle different languages.

At this point, the smarts should handle end-of-comment marks, braces, and semi-colons. Parentheses should be ignored, at least for now.

Semi-colons are easiest. One is supplied at the end of an entry, if it doesn't already exist. (Checking for comments embedded in the line is the only tricky part.)

For if statements and the like, braces the filter should expect to supply braces, if they don't already exist. Again, working around comments is the tricky bit. But the structuring information makes it possible to identify endpoints.

An entry that starts with // is terminated at the end of line automatically. If the output is wrapped, multiple //'s are supplied. What becomes an entire paragraph of explanation is therefore easily marked as a comment with two characters.

Since we know where // comments end, we can terminate an entry that starts with /* at the end of its sublist. Subentries with /* should be converted to // when generating output for the compiler, making it trivial to comment out entire blocks of code.

When generating output for putting back into a source control system, though, it's not clear that such a "destructive" implementation is ideal. We'd have to convert the semantics we recognize easily ("/*" == "comment-out-entire-block") into something which has the equivalent effect when read by a person, and which can be deconstructed into the original form when filtered back into the editor.

The really interesting comments are the Java /** comments. Since they can be up to a page or more in length, they too should terminate at the end of the sublist so that the advantages of the hierarchy accrue to them.

On input, each /** entry needs to become a CDATA section so that any embedded HTML tags are ignored. All subentries must become CDATA sections as well. It will take some fancy HTML parsing to identify the right subelements to create. (Personally, I would prefer source code to have links to documents. But Java was designed in a world ruled by flat-text editors, rather than HTML or XML, so keeping the comments in place seems like the right idea.)

This strategy eases the development of the editor, though it complicates the input and output processing. Most of all, though, it allows most any editor to be used, for most any language, given the appropriate filters. (To find errors, though, the output processor will have to record line numbers in the structure, and the editor will need the ability to go to them.)


Structure Disconnect Issues

The system won't be "bulletproof". Two "structure-disconnect" issues remain. The advantage of hierarchies is automatically keeping things together that belong together. There are a couple of cases, though, where a separation can occur -- always by user action, of course, but an undesirable separation nonetheless.

The first issue is that an else-clause can be disconnected from the if clause, if it is coded as:


    + if ...
        ...code...
     +else
        ...code...

You could then move the if-clause and leave the else-clause behind. Still that's an easy error to diagnose and fix, and it's one you could make with a plain text editor, as well. Then, too, the whole issue can be sidestepped by virtue of coding style:


   + if (expression)
      + //then 
         ...code...
      + else // 
         ...code...

Both clauses now tuck under the if-statement nicely, and each is explained as well, for a bit more literacy in the program.

The other issue is the separation of /** comments and the code they attach to. The strategy outlined above produces code like:


    + /** Big Method

           - This method returns an integer
          - that is the next prime number
          - starting from its input argument...

    + int nextPrime(int n)

          - // Implement Erasthones' Sieve
          - for (int i=0; ...)

              - // Is i prime?
             - ...

When compacted, it is clear that two elements exist side by side which clearly belong together:


    + /** Big Method
    + int nextPrime(int n)

That makes it possible to move one and leave the other behind. But I suspect we can live with that. (In a hierarchical system, you quickly learn to collapse the view before making selections, to avoid exactly this kind of problem.)

Again, coding style might also step to the rescue...


    + // Big Method

        + /** This method returns an integer....

       + int nextPrime(int n)

Now the documentation and code are side by side, but both are contained under a common heading.


Compensating for Manual Terminators

The one hairy part of the XML to plain text translator is compensating for manually-entered terminators -- especially */ and ending-brace.

An existing language parser can be modified to generate SAX events, and then plugged into Sun's DOM parser to generate a DOM from a plain source.

However terminators are handled can be reflected in the data structure that results. It is a simple matter to drop the semi-colons, end-comment marks, and braces while parsing. The structure already implies where they are needed, and they can be automatically supplied on output.

The problems begin with the editing. There is nothing to *prevent* a user from an entering an ending-brace, for example. At least that is true in the normal XML editor. But to disallow and ending-brace in source code and yet allow it in a comment, for example, is to require a grammar-sensitive editor. Experience shows that such beasts are too limited to become widespread, often unweildy, and often terribly constraining.

So the user can't reasonably be prevented from entering terminators any more than they can be preventing from typing bad code. That means the XML-to-plain-text translator has to be alert for the problem -- which greatly increases the complexity it has to deal with. (I'm open to other solutions. But that's the only one I see at the moment.)

Sincerely,



Eric Armstrong