Notebooks on Language: Sattah and Tversky: "Additive Similarity Trees" (1977)

Monday, June 18, 2012

Sattah and Tversky: "Additive Similarity Trees" (1977)

In this paper, Shmuel Sattah and Amos Tversky describe a FORTRAN program that can produce visual representations of similarity data. It constructs a tree in which similar things are grouped together, and branch heights show differences. The algorithm is applied to example data sets, and its faithfulness as a representation of the original similarity matrix is compared to other visualization techniques.

The paper owes a lot to the mathematical legwork done by Peter Buneman. I'll try here to reconstruct the algorithm that Sattah and Tversky propose as clearly as possible and without much commentary.

Translations Between Distances and Graphs

All distance measures can be represented by a graphs. We can always simply draw a line between every pair of objects in our domain and assign the right "length" to the line.

However, not all kinds of graphs can represent all distance measures on a domain, and not all graphs can be drawn on paper in a way that maps arc "lengths" to distances on the paper. It is thus a substantial task to identify the classes of distance measures that can be represented by certain kinds of drawings.

In order to investigate this question, a number of graphs classes can be defined, and arithmetical conditions can then be given which selects the exact class of distance matrices that the graph type can represent. The most interesting cases are tree representations (cycle-free graphs) and totally connected graphs (corresponding to embedding the domain in 2D space).

Additive Trees and Ultrametric Trees

There are two types of tree that can represent certain classes of distances measures well, ultrametric trees additive trees.

An ultrametric tree is a rooted binary tree in which the distance between the root and any particular leaf always is the same. It can thus be represented with the root on top and the leaves all at the same level.

An additive tree is an unrooted binary tree with arbitrary arc lengths.

As can be seen from the definitions, the set of ultrametric trees is a subset of the set of additive trees (given that we forget about the roots). The best additive tree will thus always have a better fit to a distances measure than the best ultrametric tree.

Arithmetic Conditions

There are two inequalities which accurately describe the sets of distance matrices that can be translated into trees of the two kinds. These are the ultrametric inequality and the additive inequality.

A distance matrix satisfies the ultrametric inequality if the following holds for all A, B, and C:

d(A,B) ≤ max{d(A,C), d(B,C)}

It fulfills the tree inequality (or additive inequality, or four-point condition) if the following holds for all A, B, C, and D:

d(A,B) + d(C,D) ≤ max{d(A,C) + d(B,D), d(A,D) + d(B,C)}

For completeness, compare these two conditions with the familiar triangle inequality:

d(A,B) ≤ d(A,C) + d(B,C)

Logically, the relations between these conditions are:

The ultrametric inequality implies the tree inequality.

The tree inequality implies the triangle inequality.

The ultrametric inequality is thus the most restrictive of the three, and the triangle inequality the least restrictive. (Four distinct points on the number line will satisfy the tree inequality, but not the ultrametric inequality. The four code words 00, 01, 10, and 11 will, using Hamming distance, satisfy the triangle inequality, but not the tree inequality.)

Representation

We can then state the two representation theorems that are the real point of interest:

A distance matrix can be represented as an ultrametric tree if and only if it fulfills the ultrametric inequality.

A distance matrix can be represented as an additive tree if and only if it fulfills the tree inequality.

Additive trees are thus the more flexible kind of representation, as also noted above. (See Peter Buneman's 1974 paper on the topic.)

As Sattah and Tversky note, any constructive proof of the additive representation theorem will, per definition, also be a little machine that can be used to translate distance matrices into trees. However, they are interested in a method that will construct a tree from any matrix and do well when the additive inequality is satisfied or almost satisfied. I am not sure whether their algorithm produces perfect representation when the tree inequality is perfectly satisfied.

The Construction Algorithm

Sattah and Tversky's algorithm for creating an additive tree from a matrix of distance data contains two parts, a construction method and an estimation method. The construction method chooses the shape of the tree, and the estimation method choose the arc lengths.

The construction method can be seen as a bottom-up lumping procedure, such as, for instance, Huffman's algirthm for constructing balanced trees. If we think of a single data point as a tree with only one node, then the method can further be thought of as a recursive algorithm for joining more and more subtrees together until everything is connected.

The idea is roughly this:

A set {x,y,z,w} of four points can be partitioned into two pairs in three different ways. The one that gives the lowest sum of within-pair distances is the "best." We thus run through all quadruples and give 1 point to a pair {x,y} every time it occurs in a "best" partition.
Now sort the pairs according to the number of points they have gotten and start by picking the pair {x,y} with the highest score. From this pair, you produce the new subtree z = ½(x + y), and then remove all other pairs that contain x or y. Continue creating new trees this way until your ordered list of pairs is empty.
You have now produced a new list of trees, half the size of the original domain. Repeat the aggregation procedure above with these new subtrees as inputs until there are less then three subtress left on the list (in which case, you can just join these, and you're done).

I presume that the distance between two trees z = ½(x + y) and c = ½(b + a) should be computed as the average of the distances between one point in {x, y} and one point in {a,b}; at least, the distance between a point a and a tree z = ½(x + y) is the average distance between the point a and a point in {x,y}.

James Corter, in his PASCAL implementation of the algorithm, uses a variation in which the best of the three partitions gets 2 points, and the second-best gets 1 point. I am not sure what the differences are between this method and the original.

An Example

Consider this (pretty arbitrarily constructed) matrix of distances between five objects:

	A	B	C	D	E
A	0	1	2	3	4
B	1	0	4	3	2
C	2	4	0	5	4
D	3	3	5	0	1
E	4	2	4	1	0

Since there are 5 elements in this domain, we have to consider 5 different quadruples and in each case choose the "best" partition. For instance, in considering the quadruple {A,B,D,E} we have to compare the following three statistics:

d(A,B) + d(D,E) = 1 + 1 = 2
d(A,D) + d(B,E) = 3 + 2 = 5
d(A,E) + d(B,D) = 4 + 3 = 7

Since the first of these numbers is the smallest one, the partition of {A,B,D,E} into {A,B} and {D,E} is the "best" one.

Going through all quadruples this way yields the following list of optimal partitions:

Quadruple	Pair 1	Pair 2
A,B,C,D	A,C	B,D
A,B,C,E	A,C	B,E
A,B,D,E	A,B	D,E
A,C,D,E	A,C	D,E
B,C,D,E	B,C	D,E

The two pairs {A,C} and {D,E} thus both receive a score of 3 points, while the four other pairs mentioned in the table each receive a score of 1 point each. The remaining four pairs receive a score of 0 points.

The list of candidate pairs thus begins:

{A,C}, {D,E}, {A,B}, {B,C}, {B,D}, {B,E}, {C,D}, {C,E}, ...

Following the algorithm, we then aggregate A and C into the tree F = ½(A + C). This leaves us with the list

{D,E}, {B,D}, {B,E}, {C,D}, {C,E}, ...

We consequently create the subtree G = ½(D + E).

This exhausts the list of pairs. We can then feed the three subtrees F, G, and B into the algorithm and start over, but since there are less than four points left, we simply join them into a tree, and we are done:

The Estimation Algorithm

It is not entirely clear to me how Sattah and Tversky assign lengths to the arcs of the additive tree once it has been constructed.

It is clear that all distances depend linearly on the arc lengths, so for data sets that satisfy the tree inequality, we just have to solve some linear equations, which is doable even if cumbersome. However, if the data set is not capable of being fitted onto a tree, it is less clear how we choose the one that minimizes the mean squared error (which is the statistic the Sattah and Tversky use as a success criterion). However, a set of Lagrange multipliers ought to do the trick.

Sattah and Tversky, however, seem to use some other method in mind (p. 328). They claim that individual arc lengths of the least-square estimate can be computed from (i) the number of points each end of the arc (ii) the average distance between a point in one end of the arc, and a point in the other. However, they don't describe the method that falls out of this observation further.

Vladimir Makarenkov and Pierre Legendre (2001), though, seem to spell out a method in detail. Their algorithm is based, as far as I could see, on adjusting heights from the bottom of the tree first, moving upwards.

Notebooks on Language