[Date Prev][Date Next][Thread Prev][Thread Next][Author Index][Date Index][Thread Index]

The Equality of Decisions



The Equality of Decision, or "An Inquiry into the Nature and
			Curses of 'this->hashForEqual()'"


(Phil, apologies for making a hash of your title)

One of the most innocent seeming of our early decisions in the xpp
semantics has turned out to be surprisingly interesting and subtle.
Up in "tofux.hxx" we find:

CLASS(Heaper,Tofu) {
  DEFERRED:
    virtual BooleanVar isEqual (Heaper * other) 	RUN_TIME_DEFERRED_DECL;
    virtual unsigned long hashForEqual ()	 	RUN_TIME_DEFERRED_DECL;
    ...
};

Here is my understanding of what these requests mean, and what the
ramifications are (the following in some form should be added to the
documentation):


		Part 1:  The Equality of Decisions

If "a->isEqual(b)" returns TRUE, then "a" is guaranteeing that "a" and
"b" may be treated as the same object.  I.e., it is guaranteed that
they currently have the same behavior, and that they will continue to
have the same behavior.  For side-effect-free objects, this will
typically mean that they are the same type and have internal contents
which "mean" the same thing.  E.g., for Regions, a->isEqual(b) iff a
and b are over the same coordinate space and include exactly the same
set of positions.  For objects with internal side effects, isEqual
will typically be an EQ style identity check.  For example, two
different MuTables which at the moment have the same contents should
answer FALSE to isEqual, as they are not guaranteed of being the same
in the future.

Whereas Lisp and Smalltalk (and probably several others) have two
notions of equality (Lisp's "EQ" vs "EQUAL", Smalltalk's "=" vs "=="),
in xpp we can get away with having only one.  (Historical note: our
one notion of equality is based on a suggestion of Gerald Sussman as
to what EQ should mean.  However, he never adopted this notion into
Scheme.)  To translate between these other notions and ours:

		When Lisp says:		xpp says:
				
On Values	(EQ a b)		----
		(EQUAL a b)		a->isEqual(b)


On Mu's		(EQ a b)		a->isEqual(b)
		(EQUAL a b)		a->asImmu()->isEqual(b->asImmu())

(Where by a "value" I mean a side-effect-free object such as a Region
or Orgl, and by a "Mu" I mean an object with internal side effects
such as a MuTable or a Bert.)

The hyphens above represent our stand that the ability for EQ and
EQUAL to give different answers for values in Lisp is a mis-feature in
the semantics of Lisp.  It prevents one from implementing many
memoizing or caching tricks in a way that is guaranteed to be
semantics preserving.  

(For example: let's say that we notice that we are frequently unioning
the same pair of regions several times in a row (just because of some
peculiarity in our algorithm).  We may decide to put in a cache of the
most recent pair of inputs and the resulting output.  If a new
unioning operation matches our cache, we just return the output again.
To do this in a semantics preserving way in Lisp, we might need to
copy our output each time.)

The "asImmu()" above really represents a set of messages:
"asImmuTable()" sent to a ScruTable, "asImmuSet()" sent to a ScruSet,
"orgl()" sent to a Hand, and "stamp()" sent to a Bert.  To understand
what this is about, consider: How does one explain "(EQUAL a b)" on
two Mu objects in Lisp?  We say "They are EQUAL if they have the same
contents.".  Notice that in talking about it one refers to the
contents of the Mu, but in the Lisp code this concept never appears.
The complex expression in the above table can be read: "Are a's
current contents the same as b's current contents?".  Being able to
reify (make into an object) the contents of a Mu and interact with it
directly is just another application of the Orgls n' Berts idea at the
xpp level.

A major benefit is that our various container classes that look
something up based on equality (such as sets & tables) only have one
notion of equality to worry about.  By contrast, Lisp has *LOTS* of
hair to parameterize lots of functions wrt which notion of equality to
use (take Common Lisp.  Please).  Smalltalk, although it has much less
hair, does have both "Dictionary"s (based on "=") and
"IdentityDictionary"s (based on "==").  Smalltalk's "Dictionary"s also
do not work (and probably cannot be made to work) when a key changes
it's internal state so that it no longer hashes to the same value.
More on this below.

Caveats, Exceptions, Subleties, and Inconsistencies:

1) On some Mu objects, we do also have a message "contentsEqual" such
that "a->contentsEqual(b)" === "a->asImmu()->isEqual(b->asImmu())".
This is provided for efficiency.  As it's semantics are defined as
above, it is no serious threat to our cleanliness, although it might
seem to be.

2) There are actually two ways in which two pointers ("a" and "b") to
two different instances of the Region which "mean" the same thing are
different than a two pointers to one Region.  1) Twice as much storage
is taken up, and 2) the effect "delete a;" has on the status of "b".
These facts support Dean's argument that "delete" (as well as
"destroy()") happens at a very different semantic level than other
messages to an object.  When you "delete" an object, you are making a
statement about storage, not about regions.  When you add various
kinds of supposedly semantics preserving optimizations, you may in
fact break many delete operations.  These issues do not come up in
Lisp or Smalltalk (having neither explicit deletion nor finalization),
but they do come up in xpp.  Note that in code which is dealing with
object storage instead of objects (such as code which is deciding
whether to "delete a;"), it is perfectly valid to say "a == b".  This
DOES NOT MEAN that the above matrix should read "a == b" where is
currently has hyphens.

		When Lisp says:		xpp *DOES NOT* say:
				
On Values	(EQ a b)		a == b

because the storage concerns about which one might inquire "a == b"
don't occur in Lisp.  If there were a reasonable way to do so, we
might prefer to say "a->storage()->isEqual(b->storage())".

3) Notice above that we said: "If 'a->isEqual(b)' returns TRUE, then
'a' is guaranteeing ..." as opposed to just "If 'a->isEqual(b)'
returns TRUE, then ...".  Remember that we are coding a server
intended to simultaneously serve multiple clients who don't
necessarily trust each other.  Let's say that "a" points at a local
object (whose behavior we, therefore, wrote & trust), but that "b"
points at a proxy for a remote object (which we therefore do not
trust).  In the first case, we are depending only on equality
determining behavior which we trust, but if we commute the expression,
it means something quite different.  

3a) Actually, this is only the case if we've declared "isEqual" to be
a PROXY: message.  If we haven't done so, then even if "b" is a proxy,
"b->isEqual(a)" will execute locally with code we've written and
trust.  Also, this isn't an issue in the backend currently at all,
because "isEqual" cannot be a NOWAIT message, and so cannot be sent
from our server to a client until sometime after 1.0 when we have
multi-thread support.  (Right now our server runs as a single thread,
so it can never make a request of a client for which it needs to wait
for a response.)



		Part 2: An Inquiry into the Nature and 
		    Curses of 'this->hashForEqual()'

The specification of "hashForEqual()" started very simple.  Given that
"a->isEqual(b)" is TRUE then "a->hashForEqual() == b->hashForEqual()".
But even this simple specification turned out to be surprisingly hard
to meet well.  (Note that one could meet it poorly by always having
hashForEqual() return 3.  Doing better is clearly not a "premature
optimization".)  

Let's say that we are representing regions over an ordered coordinate
space.  We might choose several implementation classes: a singleton
region, an interval, and an interval-union (indeed, the region classes
used to be much like this).  Should we require regions which mean the
same thing to be represented the same way?  If we intersect [1,5) and
[4,7) can we answer [4,5) or must we answer singleton(4)?  In the
absense of the above spec for hashForEqual(), there would be no
problem.  Regions could compare for equality if needed by checking
(a->minus(b)->isEmpty() && b->minus(a)->isEmpty()).  Equality is easy
because you have both representations available at the time of the
check.

Given the constraint that ([1,5) intersect: [4,7))->hashForEqual() ==
singleton(4)->hashForEqual(), it is natural to say that values (side
effect free objects) must always simplify to a canonical form.  Note
that we said "canonical", not "simplest".  There isn't necessarily a
unique simplest form (this is the case for CompositeRegions and
Filters).  Once we are forced into this part of the design space,
several benefits follow.  First, we can now compare equality by simply
checking for equivalent representations.  I suspect it is a good
tradeoff to spend somewhat more time simplifying during object
creation in exchange for less time during operations on the object.
Second, when there is a specialized representation (such as
singleton), the code for the less specialized representation (such as
interval) need not work for the more specialized case.

	Meanwhile, back at the ranch...

Unfortunately, the "simple" specification for "hashForEqual()" that we
started with above isn't even sufficient to be able to build a hash
table.  The problem being that "a" can change its hash value as long
as "b" changes likewise.  Should "a" be a key stored in a hash table,
then once its hash value changes it will have been stored at the wrong
place.  We need to specify some kind of constancy of hash value for
the "same" object over time.  We felt that our specification should
also allow the following implementation for a Mu object:


BooleanVar MuFoo::isEqual (Heaper * other)
{
    return this == other;
}

unsigned long MuFoo::hashForEqual ()
{
    return (unsigned long) this;
}


which we felt provided sufficient constancy for hash tables.  (Let us
refer to objects which use the above methods as EQObjects.)  The
specification we had in mind was something like "An object (and any
object which isEqual to it) maintains the same hash value (according
to 'hashForEqual') only inside a single address space over a single
run of a program."  This has implications for Stubbled code for (for
example) an ImmuTable which is both a COPY object and internally a
hash table.  When it is transmitted between address spaces it should
use the SENDER: & RECEIVER: feature to send all its key-value
associations and rehash on the receiving side.  Similarly, if an
ImmuTable is stored on disk during one run of a program and read back
in during a later run, the same SENDER: and RECEIVER: feature will
save the day.  Now that we are actually starting to *use* disks in an
incremental fashion (via SnarfPackers and URDI), we realize things
aren't quite so simple.

The problem being (of course) that inside one address space, during
one run of a program, and object can be written to disk, flushed out
of core, and read back into core in a different location, changing its
hash value according to the above method.  To solve this, we need to
get into some of the particulars of Sheep, Flock, and Shepherds.

Objects migrate to and from the disk in groupings called "flocks".
The flock is the minimum granularity at which objects are purged from
core and restored from disk.  A flock consists of exactly one
shepherd, and an arbitrary number of sheep (as long as the entire
flock is guaranteed of fitting in a single snarf?  Is this true?
Michael?  Dean?  How can we be confident this condition is
satisfied?).  A "sheep" (what is the singular?) is simply a "COPY(X):"
object which needn't be at all cognizant of the disk.  The single
shepherd in a flock is "representative" of the flock as a whole--the
only inter-flock pointers are to shepherds.  

This means that the only on-disk pointers to a sheep must be from
objects within the same flock as this sheep.  If a sheep is an
EQObject there is no problem, because a hash table that points to it
must be within the same flock.  Given that hash-indexed structures
rehash when being read back in off disk, the flock will always be
using valid hashes during its stay in core.

What of a shepherd which would normally be an EQObject?  The shepherd
can leave core even while there are valid pointers to it, and come
back in at a different location afterwards.  However, a core-pointed-
to shepherd which leaves core must leave a shepherd stub for itself
behind (so it can be faulted back in).  (Note that a shepherd which
leaves core, and for which there are no currently remaining core
pointers doesn't directly present these problems.)  Depending on how
the mechanics of shepherd stubs work (about which I don't know the
details--Dean?  Michael?), there may be some way better than the
following for dealing with Shepherd::hashForEqual().  However, the
following is sufficient:


CLASS(Shepherd,Heaper) {
    ...
    Shepherd ();
  COPY(Shepherd):
    int /* or any simple integral type? */ sequenceNumber;
    ...
};  

Shepherd::Shepherd ()
{
    static int nextSeqNum = 0; 
	/* or rather, in order to write this in translatable */ 
	/*  Smalltalk, make this a Smalltalk class variable */
    sequenceNumber = nextSeqNum++;
    ...
}

BooleanVar Shepherd::isEqual (Heaper * other)
{
    return this == other;
}

unsigned long Shepherd::hashForEqual ()
{
    return sequenceNumber;
}


(The above code should not be taken as endorsing any position in the
"Shepherds should be classes" vs "Shepherds should be properties"
controversy.  Shepherds are classes above only for pedagogical
purposes.)

Although the "sequenceNumber" variable is much like the sequence
numbers we currently have for debugging purposes, there are two major
differences: 1) They are only allocated for Shepherds, not for every
object.  If flocks are significantly bigger than an int, they don't
cost very much.  2) We are not using sequenceNumber as a definitive
notion of identity.  If the number wraps around and we get a repeat,
it is simply a hash collision, not a disaster.  We still depend on the
actual memory address to resolve true equality.  Even though the
memory address changes from time to time, it stays stable during the
execution of the above isEqual method.


			Conclusion

Didn't someone once say that "A is A" is self evident?  Or was it that
"A is A" is a necessary condition for Objective-Oriented Philosophy?