Benefits and challenges of archiving the Internet

Cloud

Virtually all human beings that haven’t been hiding in some dark wilderness behind stone walls for most of their life know that too much stuff exists online to ever learn or even look at in many lifetimes.

The Internet is a bottomless well of practically infinite knowledge with instant accessibility to anyone with a computing device. Of course all that wonderful information is intermixed with massive amounts of worthless information as well. But as always, an old cliché proves apt today that advises, “One man’s trash is another man’s treasure.” Likewise, a second timeless truism still holds, “The more things change, the more they stay the same.”

Both immutable laws pose huge problems for modern knowledge lovers with an intense yearning for learning and a burning desire to seek out information. Fortunately, a solution exists that enables each urge to be fully satiated. Known as Web archiving, this innovation lets you capture any desired information with complete automation that makes it easy and effortless to boot.

What is Web archiving?

All Web archival methods conserve bits and pieces of the Internet by capturing data for long-term storage to enable reuse in various research efforts in the future. There are basically three distinct archival methods that reign with unique features as follows:

Remote harvesting

This is reportedly the most commonly used strategy that employs automated Web crawlers to collect webpage content. As such crawlers work much the same as conventional browsers, would-be Web archivists can use simple modus operandi. A few specific examples are Heritrix, HTTrack and Wget. Two popular free apps for on-demand remote harvesting are WebCite and the Wayback Machine.

Database archiving

This method archives online databases by extracting data into a standardized schema, often with XML. After the desired content is stored in the pre-selected format, archived data compiled from several different sources can be accessed via one portal such as a home PC.

Transactional archiving

As the title implies, this Web archival technique captures data on a per-event basis during specific interactions between server and browser. Transactional archiving is typically used to document proof that a particular webpage was viewed on a certain date. This transaction-based tack is most typical throughout industry sectors with mandatory legal and/or regulatory compliance guidelines for data retention and disclosure.

Various benefits vs. detriments of content archival

‘Universal access to all knowledge.”

That’s the official slogan of the Internet Archive, a California-based licensed online library. However, one should never take it at face value lexis interpretation. Far beyond that single application, it perhaps conveys the single most vital and highest motivation for preserving digitized information to ensure valid reuses by future generations. Despite this widespread laudable aim, Web archival presents many novel technicalities and legal issues that constitute uncharted jurisprudential and IT landscapes.

Just a handful of such poignant but puzzling questions are issues such as what to archive, what entity (ies) – if any should have primary delegated archival responsibility, what should be accessible by whom and when is granting access proper? Even more troubling is determining what entity (ies) are qualified to decide all the above matters.

Besides all those quandaries, even the most experienced and highly-skilled legitimate Web archivists face huge uphill battles to avoid allegations that may easily lead to adverse litigation for IP misappropriation. That’s because a vast majority of Web-based content is considered proprietary intellectual property as the original creator(s) hold legally protectable copyrights.

Still another huge challenge is lacking permanence of online data that inherently resides in comparable print media. According to Internet Archive Chairman and founder Brewster Kahle, the average lifespan of online content is 100 days before death via extinction or deletion. He then observed that this life expectancy applies virtually universally and even includes big-name e-vendors like Google Video, Yahoo Video and Apple’s Mobile Me app.

Thus, the catchphrase for 21^st-Century Information Age data seekers, catchers, movers and makers hasn’t changed one whit for several millennia since a sage Biblical phrase was first penned to advise Planet Earth tillers way back when to, “Make haste while the sun still shines. For when darkness falls, no man can work at all.”