Searching for words

Internet Archive servers, via Mashable

The times are busy at the moment, and I’m not finding much time for writing. And when I do find the time I discover after a couple of paragraphs that I so dislike the results that I have to click on the ‘Move to Trash’ option WordPress understandingly provides for me. Ah well.

But I have to write about a great advance for anyone who researches historical texts on the Web. Just about the greatest boon to the modern researcher, after the web itself, has been Optical Character Recognition, or OCR. The magical thing about OCR is that it turns printed text into word-searchable text. It’s magic because when a book is digitised the words are recorded as images, despite appearances, and these images have to be translated back into words that a machine can read so that we can then search for words within those digitised texts.

For modern texts with nice, sharp type, it’s an easy process for the machine to convert image into true words. For older texts with less distinct type it gets to be more of a challenge, which is why searching through newspaper archives produces uneven results, simply because the machine (i.e. the software programme used) cannot always interpret what it sees. This is what they call uncorrected OCR, and it’s always worth remembering that beneath what looks look a clear text is often a jumble of words and half-words. The term that you have been looking for might still be there but the machine can’t see it.

Digitised newspaper with its uncorrected OCR, via Chronicling America

Most of the great digitised newspaper online collections – Newspaperarchive.com, Trove, British Newspaper Archive – offer word searching across the wholeof their digitised texts. Among the major digitised book depositories Hathi Trust does so, while Project Gutenberg used to but does no longer. The largest of them, the Internet Archive, until now had not done so, offering only searching across titles and short descriptions. But now – hallelujah – the IA has launched a Beta full text search facility for its books. Just enter in your search term or phrase (in inverted commas, of course) in the search box on the front page, and under the search results you will see a check box with the words ‘Search full text of books’. Check the box, and never leave it unchecked from now on. (The facility is not available on the Advanced Search option, as yet)

This must be quite new, as I’ve seen no report on it, and there’s nothing on the IA blog about it. It follows on from the IA’s release of Beta word-search of its Wayback Machine for archived websites, and its captions searching for TV news programmes. As I have said here on more than one occasion, everything is and should be word-searchable (text, video, audio), and when we bring together properly we will have something truly powerful. We might even start to learn stuff.

Highlighting search terms in an individual Internet Archive text

Of course the Internet Archive has offered us searching within its individual digitised books before now, with those handy orange markers which show you every point across the full text at which you search term occurs. But until it has not offered word searching across every text, all at once. It is a Beta service, so one must expect to find quirks which will get ironed out in time. What may never be ironed out is the limited classification of IA books, whose clumsy user-generated terms mean that filtering down results is a haphazard exercise. But we wouldn’t be researchers if it was made too easy for us.

I do a lot of speculative word-searching across book and newspaper archives. Some of this is professional (I have care of quite a few newspapers as part of my day job), but in my own time I rely on some creative word and phrase searching strategies to find texts for my Picturegoing and Theatregoing sites, which reproduce selections from archive texts for their respective themes. It’s become a habit. It’s just magical how you can find things, and an intoxicating challenge to apply smart searching strategies to uncover what has been only obliquely described.

As said, OCR is not always perfect, though I regularly amazed by the quality of the OCR from even very old texts on Hathi Trust, which is why it is my preferred source for locating relevant passages of archive texts. OCR techniques have improved greatly over the years, and some institutions are now thinking about whether it is worth re-OCRing documents they digitised ages ago to get better results. That doesn’t necessarily mean re-digitising the original, though – software programmes such as Tesseract have been shown greatly to improve OCR results from the existing digital files, so what was previously unclear becomes that much clearer. If the results are significantly and measurably clearer, investing in re-OCRing will become worth it.

The word-searchable text is one of the wonders of our time. Not only has it opened up hidden collections and answered countless enquiries, but it has reinvented what an archive text is. What was read linearly or occasionally interpreted through an index has become infinitely computable. It’s a move from a building with a door to one with an unending multiplicity of windows, each with its own unique view, whether you are looking in or looking out. We have been taken all that closer into the mind of the author. We uncover the world in which their thoughts grew. As OCR shows, each word is an image, and each translated image a new way of seeing.

About

View all posts by

Leave a Reply

Your email address will not be published.