Home > Metablogging > Does digital data disappear?

Does digital data disappear?

August 8th, 2011

I’ve seen this kind of article many times but is it correct? I’d say that I’ve generated several million words in papers, newspaper articles, blog posts and so on since I got my first Mac in 1984 (a bit over 100kw/yr for 25+ years, for something like 3 million), and also attracted maybe 10 million more in blog comments (over 100k of non-spam comments. Of that, I’ve lost
* a fair bit of material I produced before 1990, when hard disk space was v expensive, and stuff had to be stored in various external disk formats. Sadly that includes my first econ theory book and a book of satirical songs I turned out in the 198s0
* The first year or so of comments on my blog in the now-obsolete Haloscan system.
* The blog has also suffered a lot of linkrot, including internal links to its older incarnations
* A lot of my older text is in formats that can now only be read by extracting a text-only format, and some old stuff (eg pre .qif financial records) is in formats that are no longer readable in any way. But again, that’s mostly a problem with pre-1990 stuff.

Compared to my slightly obsessive desire to preserve every revision of every piece I’ve ever written, those are substantial losses. But compared to my paper records, my digital stuff is almost perfectly complete, not to mention instantly accessible and searchable.

Categories: Metablogging Tags:
  1. aidan
    August 8th, 2011 at 15:26 | #1

    Successful digital data archiving requires active curation. If you don’t keep “active” copies of material it will disappear.

    The tech columnist David Pogue found this out the hard way.

    Effectively you have to make active decisions about what you keep and what you throw away. You are better off discarding stuff you don’t want, or temporary stuff, or perhaps having separate partitions on your hard drive, one for material you wish to keep long term, the other for probably useless material, or ephemera.

  2. aidan
    August 8th, 2011 at 15:33 | #2

    The BBC Domesday Project is a great example (of electronic, not actually digital, data-rot). Less than 20 years after it was first published it was almost lost forever. Only a dedicated restoration project managed to save it.

    Which also shows that even active curation can’t help if there is no way to migrate data from the old to the new platforms. As a young research student my lab played host to a fellow researcher for a couple of days, as we were the only place on campus that still had the washing-machine size magnetic disk drives, with attached PDP-11, that he could use to read some old data sets. Good luck with doing that now.

  3. Greg G
    August 8th, 2011 at 17:42 | #3

    The data are accessible to you but would present a challenge to pass on to an archive or manuscripts library if you were inclined to do so. Not to say there aren’t methods in place for digital preservation but they are in early(ish) days.

  4. Ikonoclast
    August 9th, 2011 at 00:11 | #4

    It’s called entropy. All data will be lost eventually. The heat death of the universe (maximum disorder) will see to that.

    An interesting philsophical question is this. Did anything happen once all data is lost? That is, could it be said at the point of total heat death of the universe that it had a history to get there? I personally suspect not as once all information of the history is lost not only can it not be recounted it cannot be accounted as ever having existed.

    The heat death of the universe might be hypothetical not real if the universe recycles by collapsing to a new singularity. Is data lost in the singularity? According to one source;

    “According to general relativity, when matter falls into a black hole, all information is lost and disappears into a singularity: many different physical states evolve into the same state. In quantum theory, quantum information must be conserved: complete information about a physical system at one point in time should determine its state at any other time. This incompatibility is known as the information paradox.”

  5. dylwah
    August 9th, 2011 at 00:42 | #5

    Sorry to hear about the loss of songs Proff Q. The year I saw you at the ACT (or was it the National at ANU, it was a while ago) the sound dude was making bootlegs of every body, so there is probebly a dusty TDK sitting on a shelf somewhere. Maybe a kind of Time Team will find it one day. More likely a crack team of blaggers from the gutter press who will reverse engineer the magnetic tape to insert paens of respect for Lenin, Nero and Burley Griffin, the rotters.

  6. John Mashey
    August 9th, 2011 at 04:39 | #6

    (earlier psot seems to have gotten lost), try again.

    1) I’m a long-time Trustee @ The Computer Hisitory Museum in Mountain View , CA. We worry about this all the time.

    2) Old physical media degrades, but perhaps worse, it is really difficult to find the older readers and keep them working. A few places keep old readers to data conversions, but for some media, there exists zero working readers.

    3) We sometimes restore old computers, such as:
    IBM 1620
    PDP-1 (play the original “spacewar”!)
    IBM 1401 (the most common business computer of the 1960s. They built ~15,000, of which ~6-8 are left. We have several, located in a classic 1960s computer room, raised floor, keypunches, tape drives IBM 1403 chain printers, card reader/punches, all the Good Old Stuff.

    Of course, we will not be able to keep these working forever, especially as the wonderful DEC and IBM systems engineers who worked on these, and kept schematics in their gragaes, fade away.

    4) Fortunately, there is a thriving “simulator” community of people who write instruction-simulators that run on Linux, Windows, etc … and people are archiving old versions of application software, which gives a fighting chance to interpret old bits. Fortunately, there has been a tendency for more manually-created text and images to be kept in forms (like PDFs or Word files) for which there are multiple independent programs that can read them.

    5) Back in the 1401 era, most of the world’s digital data was on old magnetic tapes and punch cards. In the 1980s/1990s, PCs used {floppy disks, tapes, CDs, DVDS), but by now, when one can easily get get a TB of disk for under $100, it’s a lot easier to backup and just migrate.
    A person’s lifetime production of words and even still photos is easy to keep around.

    6) Of course, migrating into the “cloud” makes a lot of the issues lessen for an individual, although…

    7) A cautionary tale in one direction is Hal Draper’s classic old MS Fnd in a Lbry, a short story available online.

    8) A cautionary tale in the other direction (software lasts longer than you’d ever imagine) can be found in my Languages, Levels, Libraries, and Longevity from ACM Queue. For this topic, just read the first page, derived from Vernor Vinge.

  7. Robert (not from UK)
    August 9th, 2011 at 08:41 | #7

    If it’s any consolation, a recent WASHINGTON POST article referred to “manuscripts stored on Reagan-era floppy disks and unreadable on the modern computer.”

    Here’s the whole article:

    http://www.washingtonpost.com/local/education/u-va-school-celebrates-the-embattled-book/2011/07/22/gIQADR61fI_story.html

  8. Jim Birch
    August 9th, 2011 at 13:09 | #8

    There’s a big advantage in using standard modern public file formats, especially xml based, that are properly documented. ODF would be a great candidate but Microsoft’s xml format and even html variants are probably ok for a very long time. XML-based formats are inherently easy to convert and upgrade.

    The early versions of the Microsoft Word (and Excel, etc) file formats were strange beasts. They were designed to be read in in large chunks bit-for-bit from a floppy disk directly into the computer memory and be immediately ready for editing without further processing – on the slow processors of the day. This design met it’s objective of a rapid load but produced almost intractable file formats. Intractable, that is, except to the keepers of the code at Microsoft, that even then they had problems where some documents could no longer be read correctly as new features were added between versions. This is approach is positively weird these days when an extra layer or three of processing is a small cost for a robust file format. ( I guess a similar story applies to with old formats of other proprietary word processing software.)

    In the 90s Microsoft were accused of withholding document interoperability as a business strategy but their reticence to release a specification for Word was probably due as much to the formats’ complexity and uncertainty. The situation has improved out of sight with their newer xml formats but even these formats contain a certain amount of weird junk to support conversion of older format documents.

    As for physical degradation of storage, this does occur but at very low bit rates, and this can be just about overcome by copying from one media to another as the next disk/thing – which has multiple times the storage – comes along. You should have at least two physical repositories of data at separate locations, eg, home and work. Cloud storage could be one of the options. People are sceptical of the cloud but I’d expect a cloud storage outfit to have more reliable storage systems that any interested amateur. This strategy covers things like flood, fire, physical failure, attack and larceny.

    The ultimate danger is,of course, personal stupidity: deleting stuff without realizing, assuming you have another copy, and so on, then replicating the error to all areas. A well designed archive strategy and periodic do-not-touch snapshots can help here, but human stupidity has been found to have a transcendent quality that periodically beats the best systems.

  9. John Mashey
    August 9th, 2011 at 14:37 | #9

    @Robert (not from UK)

    ““manuscripts stored on Reagan-era floppy disks and unreadable on the modern computer.”

    This is certainly true, since Reagan-era covers 5.25″ and 3.5″ floppies, not normally found on modern systems. But if they were written by IBM-compatible PCs or Apples, and have not succumbed to bit-rot, there still exist vast numbers of PCs or Apples that read them.
    We (the Museum) already have more than we can use of those and must politely refuse offers of more.

    [But, if you happen to have a Burroughs B5000 in your basement, PLEASE contact us.
    Even another IBM 1401 would be nice, if only for spare parts, since neither Fry's Electronics nor Weird Stuff has them.]

  10. sam
    August 9th, 2011 at 20:16 | #10

    I use external USB hard drives. They are held on two separate computers. The backup one only switches on briefly every day to copy changes. This means there is minimal extra power consumption, and almost no exposure to the possibility of damaging voltage surges. I think for the forseeable future all successors to USB will be backward compatible. Barring a house-fire, I feel pretty confident.

  11. Ron
    August 19th, 2011 at 06:55 | #11

    “located in a classic 1960s computer room, raised floor, keypunches, tape drives IBM 1403 chain printers, card reader/punches, all the Good Old Stuff.”

    That brings back some memories. I started in IP (or ADP as it was called then) in 1969 on an IBM 360/20 with all the above.

Comments are closed.