This is the text version of a talk I gave at Personal Digital Archiving 2012 on Feb 24,
Implicit in archiving is a requirement for long-term durability. There’s no point in
building a technically flawless archive if the institution around it can’t last. Nobody
cares how strong the shelves were in the library of Alexandria.
Right now there’s an agency problem with large web businesses that collect user data.
People upload photos, videos, email, and all kinds of valuable personal information to
websites large and small on the assumption that someone there will take appropriate
technical measures to safeguard their stuff.
Most of those websites don’t get their revenue from users. Instead, they rely on some
form of advertising, or on investor money they receive in return for telling a credible
story about future advertising.
And since the job of advertisers, by definition, is to persuade people to buy things they
would not otherwise purchase, the third leg of this relationship is somewhat adversarial.
But the real problem with this triangular model is that it gives users no visibility into or
control over the relationship that ultimately pays for their long-term data storage. If the
advertising market collapses, or if storage costs rise unexpectedly, the site might give
them little warning before going belly-up.
Meanwhile, since user attention is their most precious commodity, ad-supported sites
have incentives to make it hard for users to get their data back out in a useful form.
Consumer protection and liability laws here are weak. Terms of service typically allow
the company to sever the relationship with a user at any time, and indemnify the company
from legal responsibility for losing data. There is basically no case law.
So the most common way we store important data online right now has shaky
Pinboard is a paid bookmarking service founded in 2009. It has about 20K active users.
People pay a one-time signup fee when they join, and they can elect to pay $25/year for
archiving and full-text search. [Short demo of the site]
People have tried three different business models around social bookmarking:
This is a groundbreaking new approach where a site accepts fungible tokens of value in
return for a product or service.
The first three are interesting as they offer a free web version and a paid iPhone app.
The bottom four sell a paid version of the site with added features. The sites vary in
scope and ambition (Evernote will happily OCR your beer bottle label!). Some archive
content, others just save bookmarks.
All of these sites are still with us.
Find a sponsor with deep pockets and run at a loss indefinitely.
This is the oldest and arguably the most successful model – all four sites are alive and
well after many years, although Delicious had to change sponsors in late 2011 at the price
of a fairly comprehensive redesign.
The problem with this model from a user perspective is that without knowing the
rationale that keeps the parent entity signing checks, it’s impossible to know how long
the site will continue to exist.
For example, while it’s likely that Delicious has a long-term strategy for becoming a
profitable business, it’s almost certain that Yahoo has simply forgotten that Yahoo
Bookmarks still exists, and will shut it down as soon as a manager accidentally stumbles
across their office.
Previously known as Windows Live Favorites
The most popular business model has been to offer a free service and then shut down (or
transition to model #1).
The case of Xmarks is particularly interesting. After trying and failing for years to fund
the site by selling aggregate user data, the site owners conceded defeat and announced
that the site would be closing, only to face a revolt by angry users demanding that they
charge for the service.
While I have no idea how well the paid model works for other sites, I can give the
relevant data for Pinboard:
There are two things to note here. First, costs are relatively low. The figures here
represent a transitional year where the site was moving from dedicated hosting to its own
hardware, so the total hosting and hardware costs are about double what they could have
Second, while the business is healthy and profitable, no angel investor or venture
capitalist would touch it with a long stick, as it sits on the wrong side of the risk/reward
The combination of low startup costs and investor aversion means there are all kinds of
opportunities lying around for a developer to run a profitable small business, provided he
or she remembers to charge money.
Labor is by far the dominant cost in running a little web site. As the tech blister
reinflates for the eighth year with no sign of popping, Bay Area salaries for developers
and sysadmins have climbed into the six figures, and competent contractors (if you can
find them) will seldom charge less than $100/hour. In practical terms, this means if you
self-finance, you can’t afford hired help.
Hosting, on the other hand, is cheap, particularly for those willing to run their own
hardware. It’s important to note that ‘cloud’ services remain far more costly than leased
or dedicated hardware, despite offering comparable reliability.
As an example, here are estimated annual hosting costs for Pinboard (20K active users, 6
TB data, three servers) on each of three services:
Storage costs have not been falling so much. The various cloud services remain at a
steady (and fairly high) ~0.13/GB*month. Meanwhile, floods in Thailand have caused
disk drive prices to double.
Still, these costs show a similar structure depending on how ‘cloudy’ you want to get.
Here’s what it costs me to store four terabytes for one year:
Assuming hardware costs amortized over 36 months
Raid 6 with off-site backup on a Raid 6 storage appliance
There are certain things that make running a bookmarking site different from the usual
First, users tend to treat a bookmarking site like a bank. They are very loath to switch
services unless forced to, and have little tolerance for perceived or real risk. As an
example, I’ll re-post the Pinboard server logs from December 2010, when Yahoo
inadvertently disclosed plans to divest themselves of Delicious:
The blue bar shows web traffic one week before the storm hit; the green Himalayas show
how traffic spiked immediately after the plans to ‘sunset’ Delicious became public. I’m
sure every other bookmarking site could show a similar graph from that week.
Note that Delicious did not change in any way, all the export tools still worked, and it
would be over half a year before users noticed anything different on the site.
Nonetheless, people stampeded away at the first sign of danger.
This risk aversion illustrates how important people’s bookmarks are to them, and how
gently a bookmarking site has to tread when considering significant changes.
The second peculiarity is the significantly longer time horizon involved with archiving.
People want their data and metadata to stick around permanently. The most important
part of running a bookmarking business is to have a credible rationale for why you can
stick around, and to be very slow and careful in how you change the site.
This is at odds with Web startup culture, which is happy-go-lucky, tends to look down on
‘lifestyle businesses’, and values swinging for the fences. Services come and go,
redesigns and major changes in focus are normal. There is no shame in selling out for a
large payday, or for shutting things down to pursue some other project.
Meanwhile an archive needs to have a credible plan for offering the same basic feature
set over a time scale of decades. Any major redesign risks spooking users who will
perceive it as a sign of instability. And the last thing people want to hear is that you’re
swinging for the fences – real archivists bunt.
A good role model here is craigslist, which has endured sustained derision for its
‘ancient’ UI for years from a succession of more modern websites, nearly all of which are
now out of business.
Finally, here’s a ranked list of the things I worry about:
Sometimes I think there may be some in Pinboard.
It’s very easy to inadvertently destroy data in various creative ways in the course of
administering the site.
This would not ordinarily figure high on my list, but the FBI confiscated a Pinboard
server in the summer of 2011. Turns out they were interested in someone else using the
same physical enclosure, but that didn’t make the server any less gone.
The lesson here is not so much to fear the FBI, but rather that there’s no such thing as a
‘cloud service’. Bits have to physically exist somewhere, and strange things can happen
to them. Jurisdictional redundancy is just as important as physical redundancy.
The legal status of personal archiving sites is far from clear, and while my conscience is
clear I certainly don’t want to be the reference case.
Bad people breaking into the site to intentionally destroy stuff.
I don’t particularly care about this one (since I won’t be around for the consequences),
but for a one-person shop it’s important to have a credible answer for users worried about
what happens to their stuff if the developer goes to the great cubicle farm in the sky.
Problem Exists Between Keyboard And Chair