s the second decade of the 21st ping by at least a factor of 100 in the
century dawns, predictions of next 10 years, suggesting petabyte
global Internet digital transmis- (1015 bytes) disk drives costing between
sions reach as high as 667 exabytes $100 and $1,000. Of course, the rate at
(1018 bytes; http://en.wikipedia.org/ which data can be transferred to and
wiki/SI_prefix#List_of_SI_prefixes) from such drives will be a major fac-
per year by 2013 (see http://telephony tor in their utility. Solid-state storage is
Vinton G. Cerf online.com/global/news/cisco-ip-traffic faster but also more expensive, at least
Google -0609/). Based on this prediction, traf- at present. A 1-Gbyte solid-state drive
fic levels might easily exceed many was available for $460 in late 2009.
zettabytes (1021 bytes, or 1,000 exa- At that price point, a 1.5-Tbyte drive
bytes) by the end of the decade. Setting would cost about $4,600. These prices
aside the challenge of somehow trans- are focused on low-end consumer
porting all that traffic and wondering products. Larger-scale systems hold-
about the sources and sinks of it all, we ing petabyte- to exabyte-range content
might also focus on the nature of the are commensurately more expensive
information being transferred, how it’s in absolute terms but possibly cheaper
encoded, whether it’s stored for future per Mbyte. As larger-scale systems
use, and whether it will always be pos- are contemplated, operational costs,
sible to interpret as intended. including housing, electricity, opera-
tors, and the like, contribute increasing
Storage Media percentages to the annual cost of main-
Without exaggerating, it seems fair to taining large-scale storage systems.
say that storage technology costs have The point of these observations is
dropped dramatically over time. A simply that it will be both possible and
10-Mbyte disk drive, the size of a shoe likely that the amount of digital con-
box, cost US$1,000 in 1979. In 2010, a tent stored by 2010 will be extremely
1.5-Tbyte disk drive costs about $120 large, integrating over govern-
retail. That translates into about 104 ment, enterprise, and consumer stor-
bytes/$ in 1979 and more than 1010 age systems. The question this article
bytes/$ in 2010. If storage technology addresses is whether we’ll be able to
continues to increase in density and persistently and reliably retrieve and
decrease in cost per Mbyte, we might interpret the vast quantities of digital
anticipate consumer storage costs drop- material stored away in various places.
Published by the IEEE Computer Society 1089-7801/10/$26.00 © 2010 IEEE IEEE INTERNET COMPUTING
Storage media have finite lifetimes. How evolve, adapt, and abandon support for earlier
many 7-track tapes can still be read, even if versions. The same can be said for operating
you can find a 7-track tape drive to read them? system providers. Applications are often bound
What about punched paper tape? CD-ROM, to specific operating system versions and must
DVD, and other polycarbonate media have be “upgraded” to deal with changes in the oper-
uncertain lifetimes, and even when we can ating environment. In extreme cases, we might
rely on them to be readable for many years, have to convert file formats as a consequence of
the equipment that can read these media might application or operating system changes.
not have a comparable lifetime. Digital storage If we don’t find suitable solutions to this
media such as thumb drives or memory sticks problem, we face a future in which our digital
have migrated from Personal Computer Memory information, even if preserved at the bit and byte
Card International Association (PCM-CIA) for- level, will “rot” and become uninterpretable.
mats to USB and USB 2.0 connectors, and older
devices might not interconnect to newer com- Solution Spaces
puters, desktops, and laptops. Where can you Among the more vexing problems is the evolu-
find a computer today that can read 8” Wang tion of application and operating system soft-
word processing disks, or 5 1/4” or 3 1/2” flop- ware or migration from one operating system to
pies? Most likely in a museum or perhaps in a another. In some cases, older versions of appli-
specialty digital archive.
Digital Formats If we don’t find suitable solutions
The digital objects we store are remarkably
diverse and range from simple text to complex to this problem, we face a future
spreadsheets, encoded digital images and video,
and a wide range of text formats suitable for in which our digital information
editing, printing, or display among many other
application-specific formats. Anyone who has will “rot” and become uninterpretable.
used local or remote computing services, and
who has stored information away for a period
of years, has encountered problems with prop- cations don’t work with new operating system
erly interpreting the stored information. Triv- releases or aren’t available on the operating
ial examples are occurring as new formats of system platform of choice. Application provid-
digital images are invented and older formats ers might choose not to support further evo-
are abandoned. Unless you have access to com- lution of the software, including upgrades to
prehensive conversion tools or the applications operate on newer versions of the underlying
you’re using continue to be supported by new operating system. Or, the application provider
operating system versions, it’s entirely possible might choose to cease supporting certain appli-
to lose the ability to interpret older file formats. cation features and formats.
Not all applications maintain backward compat- If users of digital objects can maintain the
ibility with their own versions, to say nothing of older applications or operating environments,
ability to convert into and from a wide range of they might be able to continue to use them,
formats other than their own. Conversion often but sometimes this isn’t a choice that a user
isn’t capable of 100 percent fidelity, as anyone can make. I maintained two operational Apple
who has moved from one email application to IIe systems with their 5 1/4” floppy drives for
another has discovered, for example. The same more than 10 years but ultimately acquired a
can be said for various word processing formats, Macintosh that had a special Apple IIe emula-
spreadsheets, and other common applications. tor and I/O systems that could support the older
How can we increase the likelihood that disk drives. Eventually, I copied everything
data generated in 2010 or earlier will still be onto newer disk drives and relied on conver-
accessible in useful form in 2020 and later? sion software to map the older file formats.
To demonstrate that this isn’t a trivial exer- This worked for some but not all of the digi-
cise, consider that the providers of applications tal objects I’d created in the preceding decade.
(whether open source or proprietary) are free to Word processing documents were transfer-
able, but the formatting conventions weren’t formats exist, such as OpenDocument format
directly transformable between the older and 1.2 (and further versions) developed by OASIS
newer word processing applications. Although (see www.oasis-open.org). The Joint Photo-
special-purpose converters might have been graphic Experts Group has developed standards
available or could have been written — and in for still imagery (JPEG; www.jpeg.org), and the
some cases were written — this isn’t something Motion Pictures Experts Group has developed
we can always rely on. them for motion pictures and video (MPEG;
If the rights holder to the application or oper- www.mpeg.org). Indeed, standards in general
ating system in question were to permit third play a major role in helping reduce the number
parties to offer remote access in a cloud-based of distinct formats that might require support,
computing environment, it might be possible to but even these standards evolve with time, and
run applications or operating systems that devel- transformations from older to newer ones might
opers no longer supported. This kind of licens- not always be feasible or easily implemented.
ing would plainly require creative licensing and The World Wide Web application on the Inter-
access controls, especially for proprietary soft- net uses HTML to describe Web page layouts.
ware. If a software supplier goes out of business, The W3C is just reaching closure on its HTML5
we might wonder about provisions for access to specification (http://dev.w3.org/html5/spec/Over
source code to allow for support in the future, if view.html). Browsers have had to adapt to
anyone is willing to provide it, or acquisition by interpreting older and newer formats. XML
those depending on the software for interpreta- (www.w3.org/XML/) is a data description lan-
tion of files of data created with it. Open source guage. High-level language text (such as Java
to the mix of conventions that need to be sup-
Digital Vellum ported. Anyone exploring this space will find
Among the most reliable and survivable for- hundreds if not thousands of formats in use.
mats for text and imagery preservation is vel-
lum (calf, goat, or sheep skin). Manuscripts Finding Objects on the Internet
prepared more than a thousand years ago on Related to the format of digital objects is also
this writing material can be read today and are the ability to identify and find them. It’s com-
often as beautiful and colorful as they were mon on the Internet today to reference Web
when first written. We have only to look at pages using Uniform Resource Identifiers
some of the illuminated manuscripts or codi- (URIs), which come in two flavors: Uniform
ces dating from the 10th century to appreciate Resource Locators (URLs) and Uniform Resource
this. What steps might we take to create a kind Names (URNs). The URL is the most common,
of digital vellum that could last as long as this and many examples of these appear in this arti-
or longer? cle. Embedded in most URLs is a domain name
Adobe Systems has made one interesting (such as www.google.com). Domain names
attempt with its PDF archive format (PDF/A-1; aren’t necessarily stable because they exist only
www.digitalpreservation.gov/formats/fdd/fdd as long as the domain name holder (also called
000125.shtml) that the ISO has standardized the registrant) continues to pay the annual fee
as ISO 19005-1. Widespread use of this format to keep the name registered and resolvable
and continued support for it throughout Ado- (that is, translatable from the name to an Inter-
be’s releases of new PDF versions have created net address). If the registrant loses the regis-
at least one instance of an intended long-term tration or the domain name registry fails, the
digital archival format. In this case, a company associated URLs might no longer resolve, los-
has made a commitment to the notion of long- ing access to the associated Web page. URNs
term archiving. It remains an open question, are generally not dependent on specific domain
of course, as to the longevity of the company names but still need to be translated into Inter-
itself and access to its software. All the issues net addresses before we can access the objects.
raised in the preceding section are relevant to An interesting foray into this problem area
this example. is called the Digital Object Identifier (DOI;
Various other attempts at open document www.doi.org), which is based on earlier work
www.computer.org/internet/ IEEE INTERNET COMPUTING
at the Corporation for National Research Initia- vital that we solve the problems of long-term
tives (www.cnri.reston.va.us) on digital librar- storage, retrieval, and interpretation of our
ies and the Handle System (www.cnri.reston. digital treasures. Absent such attention, we’ll
va.us/doa.html) in particular. Objects are given preside over an increasingly large store of rot-
unique digital identifiers that we can look up ting bits whose meaning has leached away with
in a directory intended to be accessible far time. We can hope that the motivation to cir-
into the future. The directory entries point to cumvent such a future will spur creative solu-
object repositories where the digital objects are tions and the means to implement them.
stored and can be retrieved via the Internet.
The system can use but doesn’t depend on the Vinton G. Cerf is vice president and chief Internet evange-
Internet’s Domain Name System and includes list at Google. His research interests include computer
metadata describing the object, its ownership, networking, space communications, inter-cloud com-
formats, access modes, and a wide range of munications, and security. Cerf has a PhD in computer
other salient facts. science from the University of California, Los Angeles.
Contact him at email@example.com.
A s we look toward a future fi lled with an
increasingly large store of digital objects, it’s
Selected CS articles and columns are also available
for free at http://ComputingNow.computer.org.
How far have we come?
See IC’s Millennium Predictions (Jan/Feb 2000 special issue)
• “Guest Editors’ Introduction: An Internet Millennium Mosaic”:
• “Millennial Forecasts”:
Where will we go?
See more from our IC’s Internet Predictions issue (Jan/Feb 2010)
• “Guest Editors’ Introduction: Internet Predictions”:
• “Internet Predictions”:
This article was featured in
For access to more content from the IEEE Computer Society,
Top articles, podcasts, and more.