A few months ago, the British Library launched the Mechanical Curator. This was a tool built out of the BL Labs project, which automatically extracted images from 65,000 or so out-of-copyright 19th century books which Microsoft had digitised for the Library backs in the days when it thought it might compete with Google in the mass book digitisation game (it changed its mind, but we got to keep what had been digitised). By an ingenious process of image recognition and metadata extraction, the Mechanical Curator automatically lifted images from these digitised pages where it could find them, uploaded them to a Tumblr site complete with book title and a link to the catalogue record (with downloadable PDF of the complete book), and proceeded to publish these images one an hour, every hour, announcing each new image via its Twitter account. No too many people actually followed the Twitter account, but it proceeded methodically to add the images, hour by hour.
Now the project has gone one stage further, and has uploaded the images to Flickr under a Creative Commons licence, so they are free to use by anyone. There are one million of them. In one fell swoop the Library has created, attributed, catalogued, uploaded and shared a unique image collection of prodigious size and infinite application. There are portraits, sketches, landscapes, cartoons, illustrations of fictional characters, maps, diagrams, photographs, advertisements, ornamental letters and many examples of elaborate chapter titles and page borders. The effect is of some mad scrapbook from the Victorian era, a fabulous treasure trove from an age when people learned so much about the world around them through illustrations. It is a monument to the great skills of the (often anonymous) artists of that age and a delight to the twenty-first-century eye.
But this is just the start. The images have their catalogue records, and often they come with a caption underneath because the Mechanical Curator captures parts of the text around the image to help with its identification. But in general we do not know what the images are. The plan therefore is to combine the crowd with automation to enrich the images with specific information as to their significance. As the man behind the Mechanical Curator, Ben O’ Steen, describes it:
We plan to launch a crowdsourcing application at the beginning of next year, to help describe the what the images portray. Our intention is to use this data to train automated classifiers that will run against the whole of the content. The data from this will be as openly licensed as is sensible (given the nature of crowdsourcing) and the code, as always, will be under an open licence.
And there could be so much more to this. After all, this resource has been built out of 65,000 books, and the British Library is sitting on tens of millions of books, newspapers, magazines and journals which will contains hundreds of millions of images. Of course many of those images will be in copyright, and the vast majority are not held in digital form as yet. But the potential image archive that could be unlocked boggles the mind.
However, it’s not just about unlocking, releasing and describing images. In sharing them in this way, the Library wants to see how people become inspired to take the image and the data and reapply these in new ways. Certainly there is inspiration there for designers, historians, anthropologists, cartographers, artists, authors, social scientists and indeed illustrators, but IT developers can also work with the files (manifests of the images, together with their descriptions, have been released on github). This is moving the library out of its four walls into the public domain, in every sense. We wait with eager anticipation to see who takes us up on the challenge, and indeed what other collections start to do likewise.
There are a number of interesting issues that arise from this initiative. One is the question of scale. We make think that the Internet age has loaded us with more information than we can possibly handle, but really things have only just got started. I wrote a while back how the 750 million pages of our newspaper library, representing over 300 years of publishing history, will be trumped by 1 billion pages of our web archive to be acquired in a single crawl of the UK web space. Here now is an image archive of huge scale created out of a small fraction of our book collection – in theory it could be a thousand times bigger.
Then there is the rough-edged nature of the images. Traditionally image archives present just the image of itself. This archive presents the untidiness around that image. This was a result of the limitations of image recognition software, but there is nothing lost and a great deal gained in speed, context and transparency of method. It is a new, pragmatic way of presenting an image archive.
Finally there is the notion of mechanical curation. What are curators for, if technology can achieve so much to make a library’s content available to all? Any additional description may be undertaken by the knowledgeable general public rather than wait for the Library expert to catch up, If the impulse is towards opening up our collections so all may share with them online, where should the expertise lie? Of course specialist collections will have been identified by, collected, built up and cared for by curators who recognise the value of those objects. The curator brings intelligence to the process of collection, description and preservation. But how to balance such intelligent understanding with the wisdom (and folly) of the crowd? Curators have great knowledge, but it is seldom exclusive to them.
How might you curate these images, now that we have let them go?
- Ben O’Steen writes about the release of the images on the Digital Scholarship blog
- You can browse through images images on the British Library’s Flickr pages, or through the Mechanical Curator’s Tumblr site, or simply follow the Mechanical Curator on Twitter
- There are Digital Scholarship posts about the Mechanical Curator’s content and about its mechanics
3 thoughts on “Mechanical curation”
Another excellent feature, which I hadn’t spotted until now, is that each Flickr record provides a link to all the other images from that publication, and there’s another link to all other illustrations published in books in the same year. Each images has also been automatically tagged (year, place publication, author etc). And there’s a link to a high quality version of the image should you wish to purchase one. Inspired all round.
Thoughtful post, Luke, thanks. In response to Ben’s informal request for new ways to use this wonderful image repository, I’ve posted at piece about how the collection could be used in a social-game for kids to teach robots (machine-learning programs) how to “see” and understand the world around them… a ‘Seeing Eye Kid’ Robot Adoption Agency. Check it out here: http://goo.gl/xYlymt.