% EXTENDED ABSTRACT DESCRIBING THE STANFORD GRAPHBASE --- PRELIMINARY DRAFT
\magnification\magstep1
\baselineskip12pt
\parskip3pt
\font\sc=cmcsc10 %use lower case as (Monthly)
\def\happyface % new experimental version (DEK, November 88)
{{\ooalign{\hfil\lower.06ex % a smiley face
\hbox{$\scriptscriptstyle\smile$}\hfil\crcr
\hfil\lower.7ex\hbox{\"{}}\hfil\crcr
\mathhexbox20D}}}
\def\display#1:#2:#3\par{\par\hangindent #1 \noindent
\hbox to #1{\hfill #2 \hskip .1em}\ignorespaces#3\par}
\def\disleft#1:#2:#3\par{\par\hangindent#1\noindent
\hbox to #1{#2 \hfill \hskip .1em}\ignorespaces#3\par}
\def\TeX{T\hbox{\hskip-.1667em\lower.424ex\hbox{E}\hskip-.125em X}}
\def\biba{\par\parindent 40pt\hangindent 60pt}
\centerline{\bf The Stanford GraphBase: A Platform for Combinatorial
Algorithms}
\bigskip
A highly portable collection of programs and data will soon be
available to researchers who study combinatorial algorithms and data
structures. All files will be in the public domain, and usable with
only one restriction: They must not be changed! A~``change file''
mechanism will allow local customization while the master files stay
intact.
The programs are intended to be interesting in themselves as examples
of ``literate programming.'' Thus, the Stanford GraphBase can also be
regarded as a collection of approximately 30 essays for programmers to enjoy
reading, whether or not they are doing algorithmic research. The
programs are written in {\tt CWEB}, a~combination of \TeX\ and~C that
is easy to use by anyone who knows those languages and easy to read by
anyone familiar with the rudiments of~C. (The {\tt CWEB} system is
itself portable and in the public domain.)
Four program modules constitute the {\it kernel\/} of the GraphBase:
{
\biba
{\sc gb\_$\,$flip} is a portable random number generator;
\biba
{\sc gb\_$\,$graph} defines standard data structures for graphs and
includes routines for storage allocation;
\biba
{\sc gb\_$\,$io} reads data files and makes sure they are uncorrupted;
\biba
{\sc gb\_$\,$sort} is a portable sorting routine for 32-bit keys
in linked lists of nodes.
}
\noindent
All of the other programs rely on {\sc gb\_$\,$graph} and some subset
of the other three parts of the kernel.
A dozen or so {\it generator modules\/} construct graphs that are of
special interest in algorithmic studies. For example {\tt
gb\_$\,$basic} contains 12~subroutines to produce standard graphs,
such as the graphs of queen moves on $d$-dimensional rectangular
boards with ``wrap-around'' on selected coordinates. Another generator
module, {\sc gb\_$\,$rand}, produces several varieties of
random graphs.
Each graph has a unique identifier that allows researchers all over
the world to work with exactly the same graphs, even when those graphs
are ``random.'' Repeatable experiments and standard benchmarks will
therefore be possible and widely available.
Most of the generator modules make use of {\it data sets}, which the
author has been collecting for 20~years in an attempt to provide
interesting and instructive examples for some forthcoming books on
combinatorial algorithms ({\sl The Art of Computer Programming},
Volumes 4A, 4B, and~4C). For example, one of the data sets is {\tt
words.dat}, a~collection of 5-letter words of English that the author
believes is ``complete'' from his own reading experience. Each word is
accompanied by frequency counts in various standard corpuses of text,
so that the most common terms can be singled out if desired. {\sc
gb\_$\,$words} makes a subset of words into a graph by saying that two
words are adjacent when they agree in~4 out of~5 positions. Thus, we
can get from {\tt words} to {\tt graph} in seven steps:
\disleft 30pt::
{\tt words, wolds, golds, goads, grads, grade, grape, graph.}
\noindent
This is in fact the shortest such chain obtainable from {\tt
words.dat}.
A dozen or so {\it demonstration modules\/} are also provided, as
illustrations of how the generated graphs can be used. For example,
the {\tt LADDERS} module is an interactive program to construct chains
of 5-letter words like the one just exhibited, using arbitrary subsets
of the data. If we insist on restricting our choices to the 2000 most
common words, instead of using the entire collection of about 5700, the
shortest path from {\tt words} to {\tt graph} turns out to have
length~20:
\disleft 30pt::
{\tt words, lords, loads, leads, leaps, leapt, least,}
\vskip-5pt
\disleft 30pt::
{\tt lease, cease, chase, chose, chore, shore, shone,}
\vskip-5pt
\disleft 30pt::
{\tt phone, prone, prove, grove, grave,
grape, graph.}
Several variations on this theme have also been implemented: If we consider
the distance between adjacent words to be alphabetic distance, for
example, the shortest path from {\tt words} to {\tt graph} turns out
to be
\disleft 30pt::
{\tt words} (3) {\tt woods} (16) {\tt goods} (14) {\tt goads} (3)
{\tt grads} (14) {\tt grape} (3) {\tt graph},
\noindent
total length 65.
The {\tt LADDERS} module makes use of another GraphBase module called
{\sc gb\_$\,$dijk}, which carries out Dijkstra's algorithm for
shortest paths and allows the user to plug in arbitrary
implementations of priority queues so that the performance of
different queuing methods can be compared.
The graphs produced by {\sc gb\_$\,$words} are undirected. Other
generator modules, like {\sc gb\_$\,$roget}, produce directed graphs.
Roget's {\sl Thesaurus\/} of 1882 classified all concepts into 1022
categories, which we can call the vertices of a graph; an arc goes
from~$u$ to~$v$ when category~$u$ contains a cross reference to
category~$v$ in Roget's book. A~demonstration module called {\sc
roget\_$\,$components} determines the strong components of graphs
generated by {\sc gb\_$\,$roget}. This program is an exposition of
Tarjan's algorithm for strong components and topological sorting of
directed graphs.
Similarly,
world literature leads to further interesting families of undirected
graphs via
the {\sc gb\_$\,$books} module. Five data sets {\tt anna.dat}, {\tt
david.dat}, {\tt homer.dat}, {\tt huck.dat}, and {\tt jean.dat} give
information about {\sl Anna Karenina}, {\sl David Copperfield}, {\sl
The Iliad}, {\sl Huckleberry Finn}, and {\sl Les Mis\'erables\/}; as
you might expect, the characters of each work become the vertices of a
graph. Two vertices are adjacent if the corresponding characters
encounter each other, in selected chapters of the book.
A~demonstration program called
{\sc book\_$\,$components} finds the blocks (i.e., biconnected
components) of these graphs using the elegant algorithm of Hopcroft
and Tarjan.
Another module, {\sc gb\_$\,$games}, generates graphs based on college
football scores. All the games from the 1990 season
between America's leading 120
teams are recorded in {\tt games.dat}; this data leads to ``cliquey''
graphs, because most of the teams belong to leagues and they play
every other team in their league. The overall graph is, however,
connected. A~demonstration module called {\sc football} finds long
chains of scores, to prove for instance that Stanford might have trounced
Harvard by more than 2000 points if the two teams had met---because
Stanford beat Notre Dame by~5, and Notre Dame beat Air Force by~30,
and Air Force beat Hawaii by~24, and \dots~, and Yale beat Harvard
by~15. (Conversely, a~similar ``proof'' also ranks Harvard over
Stanford by more than 2000 points.) No good algorithm is known for
finding the optimum solution to problems like this, so the data
provides an opportunity for researchers to exhibit better and better
solutions with better and better techniques as algorithmic
progress is made.
The {\sc gb\_$\,$econ} module generates directed graphs based on the
flow of money between industries in the US economy. A~variety of
graphs can be obtained, as the economy can be divided into any number of
sectors from~2 to~80 in this model.
A~demonstration program {\sc econ\_$\,$order}
attempts to rank the sectors in order from ``suppliers'' to
``consumers,'' namely to permute rows and columns of a matrix so as to
minimize the sum of entries above the diagonal. Again, no good
algorithms for this problem are known; two heuristics are implemented
for comparison, one ``greedy'' and the other ``cautious.'' Greed
appears to be victorious, at least in the economic sphere.
The highway mileage between 128 North American cities appears in {\tt
miles.dat}, and the {\sc gb\_$\,$miles} module generates a variety of
graphs from~it. Of special interest is a demonstration module called
{\sc miles\_$\,$span}, which computes the minimum spanning trees of
graphs output by {\sc gb\_$\,$miles}. Four algorithms for minimum
spanning trees are implemented and compared, including some that are
theoretically appealing but do not seem to fare so well in practice.
An approach to comparison of algorithms called ``mem counting'' is
shown in this demonstration to be an easily implemented
machine-independent measure of efficiency that gives a reasonably fair
comparison between competing techniques.
A generator module called {\sc gb\_$\,$raman} produces ``Ramanujan
graphs,'' which are important because of their role as expander
graphs, useful for communication. A~demonstration module called {\sc
girth} computes the shortest circuit and the diameter of Ramanujan
graphs.
Notice that some graphs, like those produced by {\sc gb\_$\,$basic} or
{\sc gb\_$\,$raman}, have a rigid mathematical structure; others, like
those produced by {\sc gb\_$\,$roget} or {\sc gb\_$\,$miles}, are more
``organic'' in nature. It is interesting and important to test
algorithms on both kinds of graphs, in order to see if there is any
significant difference in performance.
A generator module called {\sc gb\_$\,$gates} produces graphs of logic
circuits. One family of graphs is equivalent to a simple {\sc risc}
chip, a~programmable microcomputer with a variable number of registers
and a variable number of bits per word. Using such a ``meta-network''
of gates, algorithms for design automation can be tested for a range
of varying parameters. A~demonstration module {\sc take\_$\,$risc}
simulates the execution of the chip on a sample program. Another
meta-network of gates will perform parallel multiplication of $m$-bit
numbers by $n$-bit numbers or by an $n$-bit constant; the {\sc
multiply} module demonstrates this network.
Planar graphs are generated by {\sc gb\_$\,$plane}, which includes
among other things an implementation of the best currently known
algorithm for Delaunay triangulation.
Pixel data can lead to interesting bipartite graphs. Leonardo's {\sl
Giaconda\/} is represented by {\tt mona.dat}, an array of pixels that
is converted into graphs of different kinds by {\sc gb\_$\,$mona}.
A~demonstration routine {\sc assign\_$\,$mona} solves the assignment
problem by choosing one pixel in each row and in each column so that
the total brightness of selected pixels is maximized. Although the
assignment problem being solved here has no relevance whatever to art
criticism or art appreciation, it does have great pedagogical value,
because there is probably no better way to understand the
characteristics of a large array of numbers than to perceive the array
as an image.
This lecture might well have been called ``Fun and games with the
Stanford GraphBase,'' because the demonstration programs are great
toys to play with. Indeed, the author firmly believes that the best
serious work is also good fun, and we shouldn't apologize if we enjoy
doing research.
The Stanford GraphBase is now being beta-tested, and it should be
released in 1993. A~book about it, containing in particular all the
programs together with indexes and typographic aids to the reader,
will also be published in 1993. A~module called {\sc gb\_$\,$save}
converts GraphBase graphs to and from an ASCII format that
readily interfaces with other systems for graph manipulation.
\bigskip
\rightline{\sl ---\vtop{\hbox{Donald E. Knuth}
\hbox{Stanford University}
\hbox{March 31, 1992}}}
\bye