Found online # 4 – web archives

The front page of my first-ever website, back in 2000, care of the Internet Archive

Next in this occasional series of handy resources to be found online is web archives. Too many of us think of the web as being its own archive. Everything is there, and if it is not there then it is not worth bothering about because there will always be something else like it that will do instead. Archives thrive on scarcity, but the web offers super-abundance. There is much educating needed to point out how much of the web will disappear unless we archive it, and how to find what has been rescued. It is estimated that the average lifespan of a web page is 100 days, while the British Library’s web archiving team has calculated that 50% of the UK web either disappears within a year or is no longer discoverable at the original URL.

Web archives are the new archives. Inevitably they are very much in their infancy (certainly not every country has one), but here is a list of some of the most prominent and useful:

  • Internet Archive Wayback Machine – 284 billion web pages and counting, the IA’s web archive was launched in 2001. It captures web pages every few months or so. Discovery is via URL (i.e. you have to know the address of a site or page) not word-searching (yet). See also the new look beta version
  • Archive-it – DIY web archive tool developed by the Internet Archive for curating your own web archive (with full text-searching) via the IA, which lists some 4,000 collections made using it
  • UK Web Archive – The UK has three web archives – this one is the open access version, with selected websites for which the owners have given permission for their archived sites to be published. The much larger Legal Deposit UK Web Archive has billions of pages archived since 2013 but can only be used in the reading rooms of the UK’s legal deposit libraries, including the British Library. It is fully word-searchable
  • Shine – The third UK web archive is a collaboration with the Internet Archive. Shine is a search engine for UK web sites 1996-2013 available on the Internet Archive, with full word-searching
  • Library of Congress – Websites archived by the Library of Congress (selectively, openly) plus a handy list of collections with archived web sites (e.g. the Iraq War 2003 Web Archive, the September 11, 2001 Web Archive)
  • UK Government Web Archive – UK government web pages archived by the National Archives, searchable by keyword, category or website. See also the associated UK Government Twitter archive
  • UK Parliament Web Archive – a collection of archived web content selected for preservation by the Parliamentary Archives, all word-searchable
  • Pandora – Australia’s web archive, established National Library of Australia in 1996 and now maintained by a consortium of Australian libraries and other institutions
  • International Internet Preservation Coalition – The IIPC is the international web archiving body, and its list of members gives you the several national web archives now in operation (Spain, France, Iceland, Japan etc) with varying degrees of online access depending on local legislation
  • Internet Memory Foundation – European foundation with archived websites on open access including its own collection and collections from partner institutions (including CERN and UK official archives)
  • List of Web archiving initiatives – As ever, Wikipedia has a handy list of web archiving operations around the world, with the varied means of access (from open to all to controlled onsite access only) showing that the challenges involved are as much legislative as they are technical
  • Trump Twitter Archive – The web archive of the moment, and one of the best, capturing every Trump tweet (including those subsequently deleted) and with some great use of categorisation (What’s the Worst? Personal Superlatives, Media Disdain etc). A private enterprise though, so strictly speaking not an archive since there is no long-term policy in place. But don’t worry, the Internet Archive is archiving it…

About

View all posts by

Leave a Reply

Your email address will not be published.