Computer guidance
The Hype about Hyperlinks
One
The Al Problem. as ifs called - of making machines behave close
enough to how humans behave intelligently - ... has not been solved.
Moreover, there is nothing on the horizon that says, I see some light.
Words like "artificial intelligence." "intelligent agents." "servants" - all
these hyped words we hear in the press - are restatements of the mess
and the problem we're in.
We would love to have a machine that could go and search the Web.
and our personal stores. knowing our preferences. and knowing what
we mean when we say something. But we Just don't have anything at
that level.
Michael Dertouzos. Director, Laboratory for Computer Science, MIT'
Successful retrieval of information is the primary goal of most
Web users. According to a Pew Foundation report: "Search
engines are highly popular among Internet users. Searching
the Internet is one of the earliest activities people try when
they first start using the Internet, and most users quickly feel
comfortable with the act of searching ... 84% of internet users
have used search engines and, on a given day, 56% of those
online use a search engine. " 2 As everyone who has searched on
the Web knows, the power of search engines has changed
dramatically in the past decade. To understand the current situ
ation and to anticipate future developments we need to under
stand the problems involved in providing quick and reliable
searches, how it was done a decade ago and how it is done now.
W hen I finished the manuscript of this book in 1999, the
people whose judgment I trusted were deeply pessimistic
HUBERT L. DREYFUS
On the Internet
Second edition
I� ��o�!��n���up LONDON AND NEW YORK
OJ c L
2 c
QJ
..c
c: 0
0
about the future of information retrieval on the World Wide
Web. The issues they raised are still relevant, although, as we
shall soon see, their pessimism is not. In this second edition I
will retain a shortened and lightly edited version of my opening
remarks from the first edition up to the point where the then
current understanding of the problem of search became his
tory, and the attitude of the reliable researchers changed
almost overnight. Then, in the new material that makes up the
second half of this chapter, I'll explain what is possible now,
how it became possible, and, based on these new develop
ments, I'll predict where search is going from here. In 1999 I wrote:
The Web is vast and growing exuberantly. At a recent count, it
had over a billion pages and it continues to grow at the rate of
at least a million pages a day. 3 (It is characteristic of the Web
that these statistics, as you read them, are already far out of
date.) There is an amazing amount of useful information on
the Web but it is getting harder and harder to find. The prob
lem arises from the way information is organized ( or, better,
disorganized) on the Web. The way the Web works, each
element of this welter of information is linked to many other
elements by hyperlinks. Such links can link any element of
information to any other element for any reason that happens
to occur to whoever is making the link. No authority or agreed
upon catalogue system constrains the linker's associations. 4
Hyperlinks have not been introduced because they are
more useful for retrieving relevant information than the old
systematic ordering. Rather, they are the natural way to use the
speed and processing power of computers to relate a vast
amount of information without needing to understand it or
impose any authoritarian or even generally accepted structure
on it. But, when everything can be linked to everything else
without regard for purpose or meaning, the vast size of the
Web and the arbitrariness of the links make it extremely dif
ficult for people desiring specific information to find the
information they seek.
The traditional way of ordering information depends on
someone - a zoologist, a librarian, a philosopher - having
worked out a classification scheme according to the meanings
of the terms involved and the interests of the users. 5 People can
then enter new information into this classification scheme on
the basis of what they understand to be the meaning of the
categories and of the new information. If one wants to use
the information, one has to depend on those who developed
the classifications to have organized the information on the
basis of its meaning so that users can find the information that
is relevant given their interests.
Since Aristotle, we have been accustomed to organize infor
mation in a hierarchy of broader and broader classes, each
including the narrower ones beneath it. So we descend from
things, to living things, to animals, to mammals, to dogs, to
collies, to Lassie. W hen information is organized in such a
vertical database, the user can follow out the meaningful links,
but the user is forced to commit to a certain class of informa
tion before he can view more specific data that fall under that
class. For example, I have to commit to an interest in animals
before I can find out what I want to know about tortoises; and
once having made that commitment to the animal line in the
database, I can't then examine the data on problems of infinity
without backtracking through the commitments I have made.
W hen information is organized horizontally by hyperlinks,
however, as it is on the Web, instead of the relation between a
class and its members, the organizing principle is simply the
Ill .::.:. c: -
.... QI a. > �
0 .c I'll
QI a. > � QI � I-
Q) c I...
Q)
c 0
N
inter-connectedness of all elements. There are no hierarchies;
everything is linked to everything else on a single level, and
meaning is irrelevant. Thus hyperlinks allow the user to move
directly from one data entry to any other, as long as they are
related in at least some tenuous fashion. The whole of the
Web lies only a few links away from any page. With a hyper
linked database, the user is encouraged to traverse a vast net
work of information, all of which is equally accessible and
none of which is privileged. So, for instance, among the sites
that contain information on tortoises suggested to me by
my browser, I might click on the one called "Tortoises -
compared to hares", and be transported instantly to an entry
on Zeno's paradox.
We can focus the old and new ways of organizing and
retrieving information, and see the attraction of each, by con
trasting the old library culture and the new kind of libraries
made possible by hyperlinks. Table 1 contrasts a meaning
driven, semantic structuring of information with a formal,
syntactic structuring, where meaning plays no role.
Clearly, the user of a hyper-connected library would no
longer be a modern subject with a fixed identity who desires a
more complete and reliable model of the world, 6 but rather a
postmodern, protean being ready to be opened up to ever new
horizons. Such a new being is not interested in collecting what is
significant but in connecting to as wide a web of information as possible.
Web surfers embrace proliferating information as a contri
bution to a new form of life in which surprise and won
der are more important than meaning and usefulness. This
approach appeals especially to those who like the idea of reject
ing hierarchy and authority and who don't have to worry
about the practical problem of finding relevant information.
So postmodern theorists and artists embrace hyperlinks as a
OLD LIBRARY CULTURE HYPERLINKED CULTURE
Classification Diversification
a. stable a. flexible
b. hierarchically organized b. single-level
c. defined by specific interests c. allowing all possible associations
Careful selection Access to everything
a. quality of editions a. inclusiveness of editions
b. authenticity of the text b. availability of texts
c. eliminate old material c. save everything
Permanent collections Dynamic collections
a. preservation of a fixed text a. inter textual evolution
b. interested browsing b. playful surfing
Table 1: Opposition between old and new systems of information retrieval
way of freeing us from anonymous specialists organizing our
databases and deciding for us what is relevant to what. Quantity
of connections is valued above the quality of these connections.
The idea has an all-American democratic ring. As Fareed
Zakaria, the managing editor of Foreign Affairs, observes: "The
Internet is profoundly disrespectful of tradition, established
order, and hierarchy, and that is very American." 7
Those who want to use the available data, however, have to
find the information that is meaningful and relevant to them
given their current concerns. But, given that in a hyperlinked
database anything may be linked to anything else, this is a very
challenging task. Since hyperlinks are made for all sorts of
reasons and since there is only one basic type of link, the
searcher cannot use the meaning of the links to arrive at the
information he is seeking. The problem is tl1at, as far as
Ill ..:.:: c
t QI CL > ..c
0 .a 111 QI CL > ..c QI ..c ....
meaning is concerned, all hyperlinks are alike. As one
researcher puts it, the retrieval job is worse than looking for a
needle in a haystack; it's like looking for a specific needle in a
needle stack. Given the lack of any semantic content deter
mining the connections, it looks like any means for searching
the Web must be a formal, syntactic technique called data
mining that tracks statistical relations such as frequency
between meaningless data.
The difficulty of using meaningless mechanical operations
to retrieve meaningful information did not await the arrival
of the Net. It arises whenever anyone seeks to retrieve infor
mation relevant to a specific purpose from a database not
organized to serve that particular purpose. In a typical case,
researchers may be looking for published papers on a topic
they are interested in, but the mere words in the titles of the
papers do not enable a search engine to return just those
documents or websites that meet a specific searcher's needs.
To understand the problem it helps to distinguish Data
Retrieval (DR) from Information Retrieval (IR). David Blair,
Professor of Computer and Information Systems at the Uni
versity of Michigan, 8 explains the difference:
Data Base Management Systems have revolutionized the
management and retrieval of data - we can call directory
assistance and get the phone number of just about anyone
anywhere in the US or Canada; we can walk to an ATM in a city
far away from our home town and withdraw cash from our
home bank account; we can go to a ticket office in Michigan
and buy a reserved seat for a play in San Francisco; etc. All of
this is possible, in part. because of the large-scale, reliable
database management systems that have been developed
over the last 35 years.
Data retrieval operates on entities like .. names.··
.. addresses:· .. phone numbers:· .. account balances:·
.. social security numbers:· - all items that typically have
clear. unambiguous references. But although some of the
representations of documents have clear senses and
references - like the author or title of a document - many IR
searches are not based on authors or titles. but are interested
in the .. intellectual content" of the documents [e.g .... Get
me any reports that analyse Central European investment
prospects in service industries"). Descriptions of intellectual
content are almost never determinate. and on large retrieval
systems. especially the WWW, subject descriptions are
usually hopelessly imprecise/indeterminate for all but the
most general searching.9
So searching for a known URL on the WWW is simple and
easy; it has the precision and directedness of data retrieval. But
searching for a Web page with specific intellectual content
using Web search engines can be very difficult, sometimes
impossible.
The difference between Data Retrieval and Document
Retrieval can be summed up as shown in Table 2.
Before the advent of the Web and Web search engines, the
attempted solution to the document retrieval problem was to
have human beings - that is, indexers who understood the
documents - help describe their contents so that they might
be retrieved by those who wanted them. But there simply
aren't enough cataloguers to index the Web - it's too large
and it's growing too fast.
The early search engines simply created an index of words
associated with a list of documents that contained them, with
scoring based on whether or not the word was in the title,
where Google is concerned, the more votes as to importance,
that is the more hyper-connected websites, the better. Thus,
with the arrival of Google, pessimism turned to optimism
overnight. Page and Brin conclude: "We are optimistic that
our centralized Web search engine architecture will improve
in its ability to cover the pertinent text information over time
and that there is a bright future for search. "24
ONE THE HYPE ABOUT HYPERLINKS
National Public Radio, "The Future of Computing", Talk of the Nation,
Science Friday, July 7, 2000.
2 Fallows, D., «Search Engine Users», Pew Foundation, URL http:/ I
www.pewinternet.org/pdfs/PIP _Searchengine_us ers.pdf, 2005.12.
3 S. Lawrence and C. L. Giles, NEC Research Institute, "Searching the
World Wide Web", Science, 280, April 3, 1998, p. 98. Moreo ver, the size
isn't just the number of Websites or pages; the number of hyperlinks
embedded in the Web pages is even larger.
4 There has been some interesting litigation of late trying to stop this
"free-linking" of anything to anything, in which parties have sued
others who made links to the plaintiffs Web page. Of course, this is a
fraction of a fraction of a per cent, and is unlikely to have any significant
effect on the way the Web is run which has been called a "loose
ad-hocracy". It no doubt just reflects the dying gasp of the old guard
who would like to place at least some limits on the eventual linking of
everything to everything.
5 The Dewey decimal system was organized in chis way. It did not even
allow the same item to be filed under two different categories, but now
librarians have more leeway and file the same information under several
different headings. For example, Philosophy of Religion would presum
ably be filed under Philosophy and Religion. Still, however, there is an
agreed-upon hierarchical taxonomy.
6 W hat people now refer to as the modern subject came into being in
the early s eventeenth century as - thanks to Luther, the printing press,
and th e new science - people began to think of themselves as self
sufficient individuals. Descartes introduced the idea of the subj ect as
what underlay changing mental states, and Kant argued that, as che
objectifier of everything, the subject must be free and autonomous. As
we shall see in Chapter 4, S0ren Kierkegaard concluded that each one of
us is a subject called upon to take on a fixed identity that defines who
one is and what is meaningful in one's world.
7 Steve Lohr, "Ideas and Trends: Net Americana; Welcome to the
Internet, the First Clobal Colony," The New York Times, January 9, 2000.
8 David Blair's book, Language and Representation in Information Retrieval, New
York, Elsevier Science, 1990, was chosen "Best Information Science
Book of the Year" in 1999 by the American Society for Information
Science, and Blair himself was named "Outstanding Researcher of the
Year" by the same society in the same year.
9 David Blair, Wittgenstein, Language and Information, Springer, 2006, 287.
IO See H. Dreyfus , What Computers (Still) Can't Do, 3rd edn, Cambridge, MA,
MIT Press, 1992.