Organize Your Data, It’s Going To Be Mine One Day
This article was originally written for and published at The New Tech on June 8th, 2012. It was a collaboration between Moonlit and myself. Enjoy 🙂
- Famicoman –
I think I’ve always been an archivist. A vital ally in the digital world. I’m the guy that saves a file from six years ago and pulls it up when people wonder whatever happened to it. I’m the guy who is going to make sure you can still find The New Tech episodes in 20 years, whether anyone would want to or not.
Some might call me a hoarder. Technically, by definition, they are correct. But just like how the word “hacker” has been usurped and manipulated by mass media, so has this term. The word conjures up television-tinted images of people living in trash and debris. It isn’t always like that. Things I save are organized, studied, and shared with the world, not rotting away in some closed off building. Not sealed from the world. If anything, I save because these items may be important to someone else. I’m not always part of the equation.
One could argue that you’re born with an archivist instinct. My philosophy has always been that to be able to look forward, we must look back. Besides digital data, I collect physical artifacts of our technological past. You can learn a lot about Blu-ray by looking at Betamax. This resonates in all archiving. There will always be someone wanting to know how we got to where we are, and hopefully he isn’t left with puzzled faces.
My digital archiving habits started with the world of internet video. In the beginning, I was maxing out my DSL connection and throwing videos up on to Google Video. That later evolved to the IPTV Archive and ultimately my current efforts with archiving Revision3 and a wider range of digital content.
Archiving isn’t an easy task. It isn’t just plucking files off of a download page. It’s mastering wget. It’s manipulating URLs. It’s fighting tooth and nail with a server for weeks, months. It’s talking to people, some of whom don’t want to be talked to. It really stops becoming a hobby and starts being a mindset. You begin to look at things differently, communicate differently, prioritize differently.
When I started out with the IPTV Archive, things were simpler. I could just go download episodes from show sites and be on my way. Now, I get to sites that don’t want to be downloaded in their entirety, and are definitely not set up to be. For example, last year I worked on backing up portions of good.net. After a while, they’d lock me out of their servers and the only way to keep downloading was to get a new ip address or wait the block out. This year with Revision3, their CDN throttles me, which ultimately just means I’m going to be waiting longer for their files. For whatever reason, corporations are not fans of someone downloading their entire library of material. Some entities are set up with commercial content, meaning eyeballs are numbers. If you mirror their content, they don’t get as many viewers and less viewers mean less money. In this light, I’m an enemy. I’m a thief. More importantly, I’m a necessity. Without me and those like me, entire cultures could be snuffed out like a flame. Many already have. It’s a strange feeling when you’re contacted by a show creator asking if he could download his episodes from you.
Archiving someone’s digital work is a weird concept to get your head around. Think if you were approached and someone wanted a copy of your entire website. Every little detail becomes theirs to thumb through, spread to others, and replicate for years after you’ve brought the original down. It’s weird, but it’s necessary. When someone years down the road says, “Man, I wish I could watch some old Revision3,” I’ll be there to say, “Here is a copy of all their content. Ever. Enjoy.”
It would be wonderful if it was all as easy as hitting a button and someone’s site downloads for you, but it’s never that simple. Most websites are not designed to be cloned so readily. They lack internal organization. When you peel back the layers, you’d be surprised to see how clumsily some large sites are maintained and held together with rubber bands and paper clips. Out of convenience, we can pull up the Revision3 example again. So many episodes are mislabeled, so many links are dead, the formats for each episode can vary at will, and there are so many episodes and full shows that are just outright gone to the point that if you had no prior knowledge, you wouldn’t know them to have ever existed. It feels like someone ripping pages out of a book and passing it off as if nothing happened.
You have to be one part resources, one part nice guy, one part detective, one part historian and one part hacker. You have to learn about the missing files, you have to track them down, you have to communicate with others who may have them, you have to have the storage and bandwidth to get them, and you have to do it all no matter what is trying to stop you. You have to do all these things, be all these people, at the same time. Sometimes, you have to do it as quickly as possible.
After you gather everything, there is always the question of how to preserve it and disperse it. You have to keep the files up, and make sure they’ll stay up. More importantly, you need to make sure that people can get to them without jumping through hoops. I’ve tried everything on this front. Torrent sites, ftp drops, streaming services, etc. but have ultimately cemented my toolbox with archive.org. For the uninitiated, the Internet Archive is a non-profit digital library offering permanent data storage. It’s big, and it’s growing every day. Anyone can upload content provided it’s licensed to be distributed openly. It makes things easy when I can be bringing things in through the front door, and flipping them right out the back to archive.org.
Digital archiving is a brutal but rewarding process that most people don’t see on the front lines. The next time you’re going to put something up online, take a minute to think about it. Your files are going to live much longer than you could imagine. You might as well make it easy for them to.
– Moonlit –
I’ve been a wannabe archivist for some time, but through a mixture of altruistic and less altruistic means, which just so happen to coincide.
On one hand I can’t bear the thought that there is so much recent history that may be, or in some cases already is, needlessly lost forever. Whether it be hardware, software or media, much of what is produced today has no vision for the future, it’s created, it’s used and, ultimately, it’s destined to be lost to whatever forces may eventually whittle its existence down to extinction. Failed storage media, the thought that “if I delete it, somebody else will still have it” or even just plain old waning interest in a flash in the pan which is no longer relevant tomorrow.
On the other hand I find it somewhat distressing that the content I grew up with, much of which came from TV rather than the internet, is very difficult to find. It’s just that little bit too old to have been swept up by a thousand torrent sites or archived to the ever expanding YouTube. It appears to me to exist in a narrow void between content old and popular enough to have made its way to public release via VHS or DVD as a nostalgia trip for the previous generations and the modern piracy scene, who will capture and upload almost anything as pristine digital clones of the broadcast content we enjoy.
Luckily, the two often overlap, so one can be the driving inspiration to accomplish both. But as long as the end result is shared, I don’t view the selfishness of the latter to be a problem. In fact it could very well be a boon, because if everybody was selfish enough to demand copies of the content they thought they’d lost, it means that content still exists, and given that everybody likes different things, meshing all that together would create a patchwork of content from that point in time.
Now, I’ve erred somewhat on the side of piracy so far, but I don’t mean to imply that I’m only interested in commercial media, or indeed in breaking the law. Before moving on though, I’d like to say that I think it’s a collossal shame that in order to capture and preserve certain parts of their lives, we often have to resort to methods which might seem unsavoury to those who disseminate that content. I don’t think it’s unreasonable to suggest that there are indeed large archives these days maintained by large media producers and broadcasters, yet those of us the content was created to be viewed by have no access. Whether that be through music or video clip copyright and licensing issues, laziness or cost, it’s still a great loss to us, and will continue to be until such a time that the content is opened up. This history should not suffer for the sake of a few contracts and a slew of many-digit bank balances. Please, somehow, let this content see the light of day again.
Whew. Got a little bit heavy there. User-created content, there’s a good place to jump to. Podcasts and video podcasts exploded in the mid-2000s along with the proliferation of high speed broadband and cheap consumer cameras. The trouble is, many of those shows had small numbers of fans who, along with the creators themselves, have moved on and left behind their content. This is an important chunk of internet history to me, it got me involved in a large percentage of what I do and who I speak to every day. That’s why I tried my best to help Famicoman build the IPTV Archive when we originally began trying to preserve this stuff. With my pitiful upload speeds and meagre hard drive space, which was frustrating enough, I helped transcode and re-host piles of videos. Those videos were then uploaded to DivX’s Stage6 video hosting site, all neatly encoded in DivX format, with their own special DivX player plugin. Then they took the service down. After countless weeks of pulling down videos, transcoding where necessary, uploading back to Stage6, straining my resources as I went, it was all for naught. Once bitten, twice shy, as they say, and since then I’ve been very wary of trying to do it again, but I’m slowly getting back on the horse. Lesson learned: redundancy. Redundancy and backups. Everywhere. Never rely on any single service to host this kind of stuff, it might be gone tomorrow.
Things get a bit weird somewhere in the middle of those two areas of content, though, with companies like Revision3. They began as a show, or later a couple of shows, which very much fit into the user created content model, a couple of guys with a camera drinking and talking out of their arses for 20-30 minutes. But then it changed. It became the Revision3 we have today, the corporate ad-driven sludge that could very well have been taken direct from the TV and uploaded wholesale to the internet. I’m not against making a profit on content, but stop sucking the soul out of it, it feels like it’s hurting the product. But I’m not here to rag on content creators, my point here is that no matter how poor, tasteless or boring I believe the content or its presentation to be, it still deserves to be archived. What’s crap for me might be gold to somebody else, and it’s not my job to curate history in the making. If I even began to try I would doubtlessly decide that something which later turned out to be pivotal in the future was actually the naffest thing to ever grace a visual display. I believe Jason Scott made a similar point about the preservation of GeoCities. Yes, it might be full of weakly written, poorly laid out, eye-damaging animated horribleness, but it’s historical weakly written, poorly laid out, eye-damaging animated horribleness. It’s a snapshot of what the internet was at that time, and as such it should not be forgotten. So go forth and grab it, grab it all, because as hard as it might be to believe, one day it will all be gone.