Newsgroups: alt.nettime


	previous	workgroup	thread	next


Subject: <nettime> Brewster Kahle: Archiving the Internet
From: mail@mail.thing.at (mail)
Date: 7 Oct 1997 07:03:30 +0100


* * * * *

To: nettime-l@Desk.nl
Date: Mon, 6 Oct 1997 23:15:57 +0100 (MET)
From: "Pit Schultz" <pit@uropax.contrib.de>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8bit
Sender: owner-nettime-l@Desk.nl
Precedence: bulk

[ another classic text... of a 'good' digerati.
but what defines 'public material' on the web, anyway?
and who owns it's archive? the pragmatist says: try it.
the _net museum_ connects to a longer low intensity thread
about archiving, publishing, content economy. if you have
more stuff contextualizing *public content* instead
of following the agenda of privatizing human and not
so human communication then send it. btw. there is a whole
issue of mediamatic on 'storage mania'. and don't forget
to burn your own cd's now ... /p ]

http://www.archive.org/sciam_article.html

Archiving the Internet

Brewster Kahle
Internet Archive
11/4/96

Bold efforts to record the entire Internet are expected
to lead to new services.

Submitted to Scientific American for March 1997 Issue

The early manuscripts at the Library of Alexandria were
burned, much of early printing was not saved, and many
early films were recycled for their silver content.
While the Internet’s World Wide Web is unprecedented in
spreading the popular voice of millions that would never
have been published before, no one recorded these
documents and images from 1 year ago. The history of
early materials of each medium is one of loss and
eventual partial reconstruction through fragments. A
group of entrepreneurs and engineers have determined to
not let this happen to the early Internet.

Even though the documents on the Internet are the easy
documents to collect and archive, the average lifetime
of a document is 75 days and then it is gone. While the
changing nature of the Internet brings a freshness and
vitality, it also creates problems for historians and
users alike. A visiting professor at MIT, Carl Malamud,
wanted to write a book citing some documents that were
only available on the Internet’s World Wide Web system,
but was concerned that future readers would get a
familiar error message "404 Document not found" by the
time the book was published. He asked if the Internet
was "too unreliable" for scholarly citation.

Where libraries serve this role for books and
periodicals that are no longer sold or easily
accessible, no such equivalent yet exists for digital
information. With the rise of the importance of digital
information to the running of our society and culture,
accompanied by the drop in costs for digital storage and
access, these new digital libraries will soon take
shape.

The Internet Archive is such a new organization that is
collecting the public materials on the Internet to
construct a digital library. The first step is to
preserve the contents of this new medium. This
collection will include all publicly accessible World
Wide Web pages, the Gopher hierarchy, the Netnews
bulletin board system, and downloadable software.

If the example of paper libraries is a guide, this new
resource will offer insights into human endeavor and
lead to the creation of new services. Never before has
this rich a cultural artifact been so easily available
for research. Where historians have scattered club
newsletters and fliers, physical diaries and letters,
from past epochs, the World Wide Web offers a
substantial collection that is easy to gather, store,
and sift through when compared to its paper antecedents.
Furthermore, as the Internet becomes a serious
publishing system, then these archives and similar ones
will also be available to serve documents that are no
longer "in print".

Apart from historical and scholarly research uses, these
digital archives might be able to help with some common
infrastructure complaints:

o Internet seems unreliable: "Document not
found"
o Information lacks context: "Where am I? Can I
trust this information?"
o Navigation: "Where should I go next?"

When working with books, libraries help with some of
these issues, with "the stacks" of books, links to other
libraries and librarians to help patrons.

Preservation of our Digital History

Where we can read the 400 year-old books printed by
Gutenberg, it is often difficult to read a 15 year-old
computer disk. The Commission for Preservation and
Access in Washington DC has been researching the thorny
problems faced trying to ensure the usability of the
digital data over a period of decades. Where the
Internet Archive will move the data to new media and new
operating systems every 10 years, this only addresses
part of the problem of preservation.

Using the saved files in the future may require
conversion to new file formats. Text, images, audio, and
video are undergoing changes at different rates. Since
the World Wide Web currently has most of its textual and
image content in only a few formats, we hope that it
will be worth translating in the future, whereas we
expect that the short lived or seldom used formats not
be worth the future investment. Saving the software to
read discarded formats often poses problems of
preserving or simulating the machines that they ran on.

The physical security of the data must also be
considered. Natural and political forces can destroy the
data collected. Political ideologies change over time
making what was once legal becomes illegal. We are
looking for partners in other geographic and national
locations to provide a robust archive system over time.
To give some level of security from commercial forces
that might want exclusive access to this archive, the
data is donated to a special non-profit trust for
long-term care taking. This non-profit organization is
endowed with enough money to perform the necessary
maintenance on the storage media over the years.

Packaging enough meta-data (information about the
information) is necessary to inform future users. Since
we do not know what future researchers will be
interested in, we are documenting the methods of
collection and attempt to be complete in those
collections. As researchers start to use these data, the
methods and data recorded can be refined.

Technical Issues of Gathering Data

Building the Internet Archive involves gathering,
storing, and serving the terabytes of information that
at some point were publicly accessible on the Internet.

Gathering these distributed files requires computers to
constantly probe the servers looking for new or updated
files. The Internet has several different subsystems to
make information available such as the World Wide Web
(WWW), File Transfer Protocol (FTP), Gopher, and
Netnews. New systems for three-dimensional environments,
chat facilities, and distributed software require new
efforts to gather these files. Each of these systems
requires special programs to probe and download
appropriate files. Estimating the current size,
turnover, and growth of the public Internet has proven
tricky because of the dynamic nature of the systems
being probed.

Protocol Number of Sites Total Data Change rate

WWW 400,000 1,500GB 600GB/month

Gopher 5,000 100GB declining (from Veronica Index)

FTP 10,000 5,000GB not known

Netnews 20,000 discussions 240GB 16GB/month

The World Wide Web is vast, growing rapidly, and filled
with transient information. Estimated at 50 million
pages with the average page online for only 75 days, the
turnover is considerable. Furthermore, the number of
pages is reported to be doubling every year. Using the
average web page size of 30 kilobytes (including
graphics) brings the current size of the Web to 1.5
terabytes (or million megabytes).

To gather the World Wide Web requires computers
specifically programmed to "crawl" the net by
downloading a web page, then finding the links to
graphics and other pages on it, and then downloading
those and continuing the process. This is the technique
that the search engines, such as Altavista, use to
create their indices to the World Wide Web. The Internet
Archive currently holds 600GB of information of all
types. In 1997 we will have collected a snapshot of the
documents and images.

The information collected by these "crawlers" is not,
unfortunately, all the information that can be seen on
the Internet. Much of the data is restricted by the
publisher, or stored in databases that are accessible
through the World Wide Web but are not available to the
simple crawlers. Other documents might have been
inappropriate to collect in the first place, so authors
can mark files or sites to indicate that crawlers are
not welcome. Thus the collected Web will be able to give
a feel of what the web looked like at a particular time,
but will not simulate the full online environment.

While the current sizes are large, the Internet is
continuing to grow rapidly. When it is common to connect
one’s home camcorder to the upcoming high bandwidth
Internet, it will not be practical to archive it all. At
some point we will have to become more select what data
will be of the most value in the future, but currently
we can be afford to gather it all.

Storing Terabytes of Data Cost Effectively

Crucial to archiving the Internet, and digital libraries
in general, is the cost effective storage of terabytes
of data while still allowing timely access. Since the
costs of storage has been dropping rapidly, the
archiving cost is dropping. The flip side, of course, is
that people are making more information available.

To stay ahead of this onslaught of text, images, and
soon video information we believe we have to store the
information for much less money than the original
producers paid for their storage. It would be
impractical to spend as much on our storage as everyone
else combined.

Storage Technologies Cost per GigaByte Random access
time

Memory (RAM) $12,000/GB 70nanoSeconds

Hard Disk $200/GB 15miliSeconds

Optical Disk Jukebox $140/GB 10seconds

Tape Jukebox $20/GB 4minutes

Tapes on shelf $2/GB human assistance required

(1 GigaByte = 1000 MegaBytes, 1TeraByte = 1000GigaBytes.
A GigaByte is roughly enough to store 1000 books or 1
hour of compressed video)

With these prices, we chose hard disk storage for a
small amount of the frequently accessed data combined
with tape jukeboxes. In most applications we expect a
small amount of information to be accessed much more
frequently than the rest, leveraging the use of the
faster disk technology rather than the tape jukebox.

Providing Access and New Services

After gathering and storing the public contents of the
Internet, what services would then be of greatest value
with such a repository? While it is impossible to be
certain, digital versions of paper services might prove
useful.

For instance, we can provide a "reliability service" for
documents that are no longer available from the original
publisher. This is similar to one of the roles of a
library. In this way, one document can refer, through a
hypertext link, to a document on another server and a
reader will be able to follow that link even if the
original is gone. We see this as an important piece of
infrastructure if the global hypertext system is to
become a medium for scholarly publishing.

Another application for a central archive would be to
store an "official copy of record" of public
information. These records are often of legal interest,
helping to determine what was said or known at a
particular time.

Historians have already found the material useful. David
Allison of the Smithsonian Institution has used the
materials for an exhibit on Presidential Election
websites, which he thinks might be the equivalent to
saving videotapes of early TV campaign advertisements.
David Eddy Spicer of Harvard’s Kennedy School of
Government has used the materials for their "case
studies" in much the same way they collect old
newspapers articles to capture a point in time.

With copies of the Internet over time and cross
correlation of data from multiple sources, new services
might help users understand what they are reading, when
it was created, and what other people thought of it.
With these services, people might be able to give a
context to the information they are seeing and therefore
know if they can trust it. Furthermore, the coordination
of this meta-information and usage data can help build
services for navigating the sea of data that is
available.

Companies are also interested in saving similar
information and building similar services based on their
internal information to help employees effectively learn
from the experiences of others.

The technologies and the services that will grow out of
building digital archives and digital libraries could
lead towards building a reliable system of information
interchange based on electrons rather than paper. Using
the "library" might be done many times a day to use
documents that are no longer available on the Internet.

Legal and Social Issues

Creating an archive of informal and personal information
has many difficult legal and social issues even if the
material was intended to be publicly accessible at some
point. Such a collection treads into the murky area
intellectual property in the digital era. What can be
done with the digital works that are collected gets into
the area of copyright, privacy, import/export
restrictions, and possession of stolen property.

To give a few examples: what if a college student made a
web page that had pictures of her then-current
boyfriend, but later wanted to take it down and "tear it
up", yet it lived on in digital archives (whether
accessible or not). Should she have the right to remove
that document? Should a candidate for political office
be able to go back 15 years to erase his postings to
public bulletin boards that have been saved in the
Archive? What if a software program that is legal to
publish in Denmark, but illegal in the United States is
collected by an archive: should this program be removed
and hidden even from historians and scholars? The legal
and social issues raised by the construction of the
Archive are not easily resolved.

By allowing authors to exclude their information from
the Archive we hope to avoid some of the immediate
issues, and allow enough time to pass to understand the
larger issues at hand.

The Internet Archive might be able to help resolve some
of these issues by publicly drawing the issues out and
by participating in the debates. While many of these
questions will take years to resolve, we feel it is
important to proceed with the collection of the material
since it can never be recovered in the future.

Where does it go from here?

The new technologies and services currently being
created might be useful in all digital libraries and
help make the Internet more robust and useful.

Through an archive of what millions of people are
interested in making public, we might be able to detect
new trends and patterns. Since these materials are in
computer readable form, searching them, analyzing them,
and distributing them has never been easier. A variety
of services built on top of large data sets will allow
us to connect people and ideas in new ways.

For instance, Firefly Inc. is using the individual
tastes in music and movies to help suggest other CD’s
and videos based on finding "similar" people. They have
even found that people are interested in communicating
with the other "similar" people directly thus forming
communities based on similar interests. This kind of
computer matchmaking which is based on detailed
portraits of people’s preferences suggests similar
services based on reading habits.

Trends in academic fields might be able to be detected
more easily by studying gross statistics of the
communications in the field. The hypertext links of the
World Wide Web form an informal citation system similar
to the footnote system already in use. Studying the
topography of these links and their evolution might
provide insights into what any given community thought
was important.

If archiving cultural and personal histories become
useful commercially, then the efforts can be expanded to
record radio and video broadcasts. These systems might
allow us to study these effects and influences on our
lives.

Current terabyte technologies (storage hardware and
management software) are relatively rare and specialized
because of their costs, but as the costs drop we might
see new applications that have traditionally used
non-computer media. For instance,

o A video store holds about 5,000 video titles,
or about 7 terabytes of compressed data.
o A music radio station holds about 10,000 LP’s
and CD’s or about 5 terabytes of uncompressed
data.
o The Library of Congress contain about 20
million volumes, or about 20 terabytes text if
typed into a computer.
o A semester of classroom lectures of a small
college is about 18 terabytes of compressed
data.

Therefore the continued reduction in price of data
storage, and also data transmission, could lead to
interesting applications as all the text of a library,
music of a radio station, and video of a video store
become cost effective to store and later transmitted in
digital form.

In the end, our goal is to help people answer hard
questions. Not "what is my bank balance?", or "where can
I buy the cheapest shoes", or "where is my friend Bill?"
- these will be answered by smaller commercial services.
Rather, answer the hard questions like: "Should I go
back to graduate school?" or "How should I raise my
children?" or "What book should I read next?". Questions
such as these can be informed by the experiences of
others. Can machines and digital libraries really help
in answering such questions? In the long term, we
believe yes, but perhaps in new ways which would have
importance in education and day-to-day life.

Further Reading:

Preserving Digital Objects: Recurrent Needs and
Challenges, December 1995 presentation at 2nd NPO
conference on Multimedia Preservation, Brisbane,
Australia.

The Vanished Library, Luciano Canfora. University of
Berkeley Press, 1990.

Biography:

Brewster Kahle is a founder of the Internet Archive in
April 1996. Before that, he was the inventor of the Wide
Area Information Servers (WAIS) system in 1989 and
founded WAIS Inc in 1992. WAIS helped bring commercial
and government agencies onto the Internet by selling
Internet publishing tools and production services to
companies such as Encyclopaedia Britannica, New York
Times, and the Government Printing Office.

Schooled at MIT (BSEE ’82), Brewster designed super
computers in the 80's at Thinking Machines Corporation.


Contact us at: info@archive.org or call 415.561.6900

---
# distributed via nettime-l : no commercial use without permission
# <nettime> is a closed moderated mailinglist for net criticism,
# collaborative text filtering and cultural politics of the nets
# more info: majordomo@icf.de and "info nettime" in the msg body
# URL: http://www.desk.nl/~nettime/ contact: nettime-owner@icf.de