Computer guidance

profileKaran11
18.Dreyfus-OntheInternet.pdf

The Hype about Hyperlinks

One

The Al Problem. as ifs called - of making machines behave close

enough to how humans behave intelligently - ... has not been solved.

Moreover, there is nothing on the horizon that says, I see some light.

Words like "artificial intelligence." "intelligent agents." "servants" - all

these hyped words we hear in the press - are restatements of the mess

and the problem we're in.

We would love to have a machine that could go and search the Web.

and our personal stores. knowing our preferences. and knowing what

we mean when we say something. But we Just don't have anything at

that level.

Michael Dertouzos. Director, Laboratory for Computer Science, MIT'

Successful retrieval of information is the primary goal of most

Web users. According to a Pew Foundation report: "Search

engines are highly popular among Internet users. Searching

the Internet is one of the earliest activities people try when

they first start using the Internet, and most users quickly feel

comfortable with the act of searching ... 84% of internet users

have used search engines and, on a given day, 56% of those

online use a search engine. " 2 As everyone who has searched on

the Web knows, the power of search engines has changed

dramatically in the past decade. To understand the current situ­

ation and to anticipate future developments we need to under­

stand the problems involved in providing quick and reliable

searches, how it was done a decade ago and how it is done now.

W hen I finished the manuscript of this book in 1999, the

people whose judgment I trusted were deeply pessimistic

HUBERT L. DREYFUS

On the Internet

Second edition

I� ��o�!��n���up LONDON AND NEW YORK

OJ c L

2 c

QJ

..c

c: 0

0

about the future of information retrieval on the World Wide

Web. The issues they raised are still relevant, although, as we

shall soon see, their pessimism is not. In this second edition I

will retain a shortened and lightly edited version of my opening

remarks from the first edition up to the point where the then

current understanding of the problem of search became his­

tory, and the attitude of the reliable researchers changed

almost overnight. Then, in the new material that makes up the

second half of this chapter, I'll explain what is possible now,

how it became possible, and, based on these new develop­

ments, I'll predict where search is going from here. In 1999 I wrote:

The Web is vast and growing exuberantly. At a recent count, it

had over a billion pages and it continues to grow at the rate of

at least a million pages a day. 3 (It is characteristic of the Web

that these statistics, as you read them, are already far out of

date.) There is an amazing amount of useful information on

the Web but it is getting harder and harder to find. The prob­

lem arises from the way information is organized ( or, better,

disorganized) on the Web. The way the Web works, each

element of this welter of information is linked to many other

elements by hyperlinks. Such links can link any element of

information to any other element for any reason that happens

to occur to whoever is making the link. No authority or agreed­

upon catalogue system constrains the linker's associations. 4

Hyperlinks have not been introduced because they are

more useful for retrieving relevant information than the old

systematic ordering. Rather, they are the natural way to use the

speed and processing power of computers to relate a vast

amount of information without needing to understand it or

impose any authoritarian or even generally accepted structure

on it. But, when everything can be linked to everything else

without regard for purpose or meaning, the vast size of the

Web and the arbitrariness of the links make it extremely dif­

ficult for people desiring specific information to find the

information they seek.

The traditional way of ordering information depends on

someone - a zoologist, a librarian, a philosopher - having

worked out a classification scheme according to the meanings

of the terms involved and the interests of the users. 5 People can

then enter new information into this classification scheme on

the basis of what they understand to be the meaning of the

categories and of the new information. If one wants to use

the information, one has to depend on those who developed

the classifications to have organized the information on the

basis of its meaning so that users can find the information that

is relevant given their interests.

Since Aristotle, we have been accustomed to organize infor­

mation in a hierarchy of broader and broader classes, each

including the narrower ones beneath it. So we descend from

things, to living things, to animals, to mammals, to dogs, to

collies, to Lassie. W hen information is organized in such a

vertical database, the user can follow out the meaningful links,

but the user is forced to commit to a certain class of informa­

tion before he can view more specific data that fall under that

class. For example, I have to commit to an interest in animals

before I can find out what I want to know about tortoises; and

once having made that commitment to the animal line in the

database, I can't then examine the data on problems of infinity

without backtracking through the commitments I have made.

W hen information is organized horizontally by hyperlinks,

however, as it is on the Web, instead of the relation between a

class and its members, the organizing principle is simply the

Ill .::.:. c: -

.... QI a. > �

0 .c I'll

QI a. > � QI � I-

Q) c I...

Q)

c 0

N

inter-connectedness of all elements. There are no hierarchies;

everything is linked to everything else on a single level, and

meaning is irrelevant. Thus hyperlinks allow the user to move

directly from one data entry to any other, as long as they are

related in at least some tenuous fashion. The whole of the

Web lies only a few links away from any page. With a hyper­

linked database, the user is encouraged to traverse a vast net­

work of information, all of which is equally accessible and

none of which is privileged. So, for instance, among the sites

that contain information on tortoises suggested to me by

my browser, I might click on the one called "Tortoises -

compared to hares", and be transported instantly to an entry

on Zeno's paradox.

We can focus the old and new ways of organizing and

retrieving information, and see the attraction of each, by con­

trasting the old library culture and the new kind of libraries

made possible by hyperlinks. Table 1 contrasts a meaning­

driven, semantic structuring of information with a formal,

syntactic structuring, where meaning plays no role.

Clearly, the user of a hyper-connected library would no

longer be a modern subject with a fixed identity who desires a

more complete and reliable model of the world, 6 but rather a

postmodern, protean being ready to be opened up to ever new

horizons. Such a new being is not interested in collecting what is

significant but in connecting to as wide a web of information as possible.

Web surfers embrace proliferating information as a contri­

bution to a new form of life in which surprise and won­

der are more important than meaning and usefulness. This

approach appeals especially to those who like the idea of reject­

ing hierarchy and authority and who don't have to worry

about the practical problem of finding relevant information.

So postmodern theorists and artists embrace hyperlinks as a

OLD LIBRARY CULTURE HYPERLINKED CULTURE

Classification Diversification

a. stable a. flexible

b. hierarchically organized b. single-level

c. defined by specific interests c. allowing all possible associations

Careful selection Access to everything

a. quality of editions a. inclusiveness of editions

b. authenticity of the text b. availability of texts

c. eliminate old material c. save everything

Permanent collections Dynamic collections

a. preservation of a fixed text a. inter textual evolution

b. interested browsing b. playful surfing

Table 1: Opposition between old and new systems of information retrieval

way of freeing us from anonymous specialists organizing our

databases and deciding for us what is relevant to what. Quantity

of connections is valued above the quality of these connections.

The idea has an all-American democratic ring. As Fareed

Zakaria, the managing editor of Foreign Affairs, observes: "The

Internet is profoundly disrespectful of tradition, established

order, and hierarchy, and that is very American." 7

Those who want to use the available data, however, have to

find the information that is meaningful and relevant to them

given their current concerns. But, given that in a hyperlinked

database anything may be linked to anything else, this is a very

challenging task. Since hyperlinks are made for all sorts of

reasons and since there is only one basic type of link, the

searcher cannot use the meaning of the links to arrive at the

information he is seeking. The problem is tl1at, as far as

Ill ..:.:: c

t QI CL >­ ..c

0 .a 111 QI CL >­ ..c QI ..c ....

meaning is concerned, all hyperlinks are alike. As one

researcher puts it, the retrieval job is worse than looking for a

needle in a haystack; it's like looking for a specific needle in a

needle stack. Given the lack of any semantic content deter­

mining the connections, it looks like any means for searching

the Web must be a formal, syntactic technique called data

mining that tracks statistical relations such as frequency

between meaningless data.

The difficulty of using meaningless mechanical operations

to retrieve meaningful information did not await the arrival

of the Net. It arises whenever anyone seeks to retrieve infor­

mation relevant to a specific purpose from a database not

organized to serve that particular purpose. In a typical case,

researchers may be looking for published papers on a topic

they are interested in, but the mere words in the titles of the

papers do not enable a search engine to return just those

documents or websites that meet a specific searcher's needs.

To understand the problem it helps to distinguish Data

Retrieval (DR) from Information Retrieval (IR). David Blair,

Professor of Computer and Information Systems at the Uni­

versity of Michigan, 8 explains the difference:

Data Base Management Systems have revolutionized the

management and retrieval of data - we can call directory

assistance and get the phone number of just about anyone

anywhere in the US or Canada; we can walk to an ATM in a city

far away from our home town and withdraw cash from our

home bank account; we can go to a ticket office in Michigan

and buy a reserved seat for a play in San Francisco; etc. All of

this is possible, in part. because of the large-scale, reliable

database management systems that have been developed

over the last 35 years.

Data retrieval operates on entities like .. names.··

.. addresses:· .. phone numbers:· .. account balances:·

.. social security numbers:· - all items that typically have

clear. unambiguous references. But although some of the

representations of documents have clear senses and

references - like the author or title of a document - many IR

searches are not based on authors or titles. but are interested

in the .. intellectual content" of the documents [e.g .... Get

me any reports that analyse Central European investment

prospects in service industries"). Descriptions of intellectual

content are almost never determinate. and on large retrieval

systems. especially the WWW, subject descriptions are

usually hopelessly imprecise/indeterminate for all but the

most general searching.9

So searching for a known URL on the WWW is simple and

easy; it has the precision and directedness of data retrieval. But

searching for a Web page with specific intellectual content

using Web search engines can be very difficult, sometimes

impossible.

The difference between Data Retrieval and Document

Retrieval can be summed up as shown in Table 2.

Before the advent of the Web and Web search engines, the

attempted solution to the document retrieval problem was to

have human beings - that is, indexers who understood the

documents - help describe their contents so that they might

be retrieved by those who wanted them. But there simply

aren't enough cataloguers to index the Web - it's too large

and it's growing too fast.

The early search engines simply created an index of words

associated with a list of documents that contained them, with

scoring based on whether or not the word was in the title,

where Google is concerned, the more votes as to importance,

that is the more hyper-connected websites, the better. Thus,

with the arrival of Google, pessimism turned to optimism

overnight. Page and Brin conclude: "We are optimistic that

our centralized Web search engine architecture will improve

in its ability to cover the pertinent text information over time

and that there is a bright future for search. "24

ONE THE HYPE ABOUT HYPERLINKS

National Public Radio, "The Future of Computing", Talk of the Nation,

Science Friday, July 7, 2000.

2 Fallows, D., «Search Engine Users», Pew Foundation, URL http:/ I

www.pewinternet.org/pdfs/PIP _Searchengine_us ers.pdf, 2005.12.

3 S. Lawrence and C. L. Giles, NEC Research Institute, "Searching the

World Wide Web", Science, 280, April 3, 1998, p. 98. Moreo ver, the size

isn't just the number of Websites or pages; the number of hyperlinks

embedded in the Web pages is even larger.

4 There has been some interesting litigation of late trying to stop this

"free-linking" of anything to anything, in which parties have sued

others who made links to the plaintiffs Web page. Of course, this is a

fraction of a fraction of a per cent, and is unlikely to have any significant

effect on the way the Web is run which has been called a "loose

ad-hocracy". It no doubt just reflects the dying gasp of the old guard

who would like to place at least some limits on the eventual linking of

everything to everything.

5 The Dewey decimal system was organized in chis way. It did not even

allow the same item to be filed under two different categories, but now

librarians have more leeway and file the same information under several

different headings. For example, Philosophy of Religion would presum­

ably be filed under Philosophy and Religion. Still, however, there is an

agreed-upon hierarchical taxonomy.

6 W hat people now refer to as the modern subject came into being in

the early s eventeenth century as - thanks to Luther, the printing press,

and th e new science - people began to think of themselves as self­

sufficient individuals. Descartes introduced the idea of the subj ect as

what underlay changing mental states, and Kant argued that, as che

objectifier of everything, the subject must be free and autonomous. As

we shall see in Chapter 4, S0ren Kierkegaard concluded that each one of

us is a subject called upon to take on a fixed identity that defines who

one is and what is meaningful in one's world.

7 Steve Lohr, "Ideas and Trends: Net Americana; Welcome to the

Internet, the First Clobal Colony," The New York Times, January 9, 2000.

8 David Blair's book, Language and Representation in Information Retrieval, New

York, Elsevier Science, 1990, was chosen "Best Information Science

Book of the Year" in 1999 by the American Society for Information

Science, and Blair himself was named "Outstanding Researcher of the

Year" by the same society in the same year.

9 David Blair, Wittgenstein, Language and Information, Springer, 2006, 287.

IO See H. Dreyfus , What Computers (Still) Can't Do, 3rd edn, Cambridge, MA,

MIT Press, 1992.