[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

Version Compare/Merge: Definition of Terms



The algorithm is actually embarassingly simple once you 
view the problem through the right set of concepts. Hence 
this message is a series of definitions of terms. These 
terms map more or less directly into the objects and 
operations used in the algorithm. Most of the terms are 
pretty self-evident; however, the definitions of "movement 
runs" and "composite runs" are worth reading carefully. The 
concept of the "movement run" is the closest thing to a 
conceptual breakthrough in this analysis.
 
The list of definitions here is pretty long; the 
algorithm itself is ironically rather short. You can 
probably skip the definitions and read the algorithm and 
get a good idea of what's going on if you're in a hurry; 
the algorithm itself is in the message following this one.
------

A RUN is a sequence of objects, often sequences of 
characters, often smaller runs, which are correctly 
sequenced by some criterion (such as, they are in the 
sequence they appear in when viewed in the final document). 
Two objects are ADJACENT inside a run if they are next to 
each other. 

-The ORIGINAL DOCUMENT is a run of characters sequenced 
as they appear in the document at the beginning of the 
transformation.

-The FINAL DOCUMENT is a run of characters sequenced as 
they appear at the end of the transformation.

-A PRIMITIVE RUN is a run of characters that have the 
same sequence in both the original document and the final 
document. The SIZE of a primitive run is the number of 
characters it contains. Primitive runs come in 3 flavors: 
there are PRIMITIVE MOVES, which appear in both the 
original and final documents; there are INSERTIONS, which 
appear only in the final document (since they don't appear 
in the original document at all, they do have the same 
sequence in the original document, sort of :-). And there 
are DELETIONS, which appear only in the original document.

Please note that primitive runs are approximately the 
kinds of objects which the backend returns to the 
application in its comparison operation. Primitive runs are 
therefore the starting material for the transformation.

One way in which these primitive runs differ from 
Xanadu is that I don't account for vcopies, i.e., the case 
where a single run appears in several places in the 
document. These can be handled with a simple add-on to the 
algorithm here, and are discussed in the section on 
deficiencies.

- The WORKING DOCUMENT is a run of runs that includes 
all the primitive runs from both the original and final 
documents, i.e., both insertions and deletions are 
included. The working document is transformed by successive 
operations, from a state in which the primitive moves and 
deletions are all sequenced according to the original 
document, to a state in which the primitive moves and 
insertions are all sequenced according to the final 
document. 

A pair of runs in the working document are PROPERLY 
ADJACENT if 

     - they are adjacent in both the working 
document and the final document, AND if
     - they are in the same sequence in the working 
document, as they are in the final document.

Such properly adjacent runs are ripe to be melded 
together into a single run, because they will remain 
together for the duration of the transformation.

To construct the working document in its initial state, 
start out with all the primitive runs from the original 
document in their proper sequence. Now insert the primitive 
insertions from the final document by placing each 
insertion properly adjacent to the larger of the 2 
primitive moves which would be its neighbors in the final 
document.

Please note that the initial state of the working 
document is exactly the document which should appear in 
Tapestry's version compare/merge display.

-A MOVEMENT RUN is a run composed of primitive runs; 
the movement run has 3 interesting properties:

     1) Neither of its neighbors are insertions or 
deletions. 
     2) Neither of its neighbors are properly 
adjacent to it. 
     3) It contains at least one primitive move run.

In other words, a movement run is a run which must 
somehow be moved with respect to both its neighbors in 
order to reach the arrangement found in the final document. 
The entire working document can, and must, be broken into 
movement runs. In a simple case, "A B C" becoming "B C A", 
the primitive runs are A and B and C, while the movement 
runs are (A) and (B C).

The SIZE of a movement run is the sum of the sizes of 
its primitive runs.

To transform the working document into a sequence of 
movement runs,  run through the following operations until 
none of these operations is still possible:

     - Combine each insertion with the larger of the 
two primitive moves which are its neighbors in the final 
document, to form a proto-movement-run.
     - Combine each deletion with the larger of the 
two primitive moves which are its neighbors in the 
original document, to form a proto-movement-run.
     - Combine all pairs of proto-movement-runs 
which are properly adjacent.

For example, in the original document "A del1 B del2 C" 
being transformed into final document "B insert1 C A", the 
working document initial state might be "A del1 B insert1 
del2 C", and the movement runs might be (A) (del1 B insert1 
del2 C).

Please note that once you have created movement runs, a 
valid (though poor) description of transformations would be 
to state that every movement run had moved to a new 
location.

It's also worth noting that, following the above 
algorithm for constructing movement runs, insertions and 
deletions are "attracted" to the largest primitive moves. 
With this clustering, movement runs will tend to be either 
very large, or very small. This will be used in the 
algorithm to maximize the extent to which small objects are 
moved large distances, following the heuristic 
specification.

- A COMPOSITE RUN is a run of movement runs which have 
been rearranged such that they are properly adjacent. The 
SIZE of a composite run has 2 components: first is the 
number of movement runs it contains. Second is the sum of 
the sizes of the contained movement runs. A composite run 
is always considered to be "larger than" a single movement 
run. One composite run is larger than the other if the 
first part of the size is greater; if the first part of the 
size is the same, it is larger if the second part of its 
size is greater.

Please note that the size of a composite run reflects 
the number of movements required to compose it: the number 
of movements required is equal to N-1, where N is the 
number of movement runs it contains.  This complex 
definition of size reduces the frequency with which moved 
objects will be moved again, thus reducing the amount of 
nesting.

Note also that composite runs do NOT contain other 
composite runs. If two composite runs are melded together, 
the resulting composite run contains all the movement runs 
from both of the originals.

- A REVERSED SEQUENCE is a series of movement runs 
which are in inverted order compared to their sequencing in 
the final document. "A B" transformed to "B A" is the 
simplest reversed sequence. "A B C D E" transformed to "E D 
C B A" is a more complex example. Reversed sequences of up 
to 3 elements are very common in editing; higher order 
reverses are rare. Indeed, it is surprising just how much 
common editing can be described with reverses of 2 or 3 
elements, as you may discover while reading the examples 
later (as an aside, all reverses of any length can be 
described as a series of 2-component reverses, though again 
this gives intuitively poor results).

- The ANCHOR is the largest single run in the working 
document. The anchor may change every time the working 
document is transformed.

If you understand all these terms, you are now equipped 
to encounter the mysteries of version comparison and 
merging. Prepare yourself for a breathtaking experience  
:-)

--marcs