It is good to be involved in a good thing. Last week, after years of development and the coming together of assorted initiatives, the British Library made one million pages from historic newspapers freely available online. Next year it will publish one million more, and million the year after that. At the same time raw text from many digitised newspapers is also being made available for free, as a separate resource, for specialist researchers who want to analyse history through the words we have left behind.
In 2010 the British Library, which holds the nation’s newspaper collection, did a deal with family history company brightsolid (now known as Findmypast). The BL would make its newspaper collection available to brightsolid; they in turn would digitise the newspapers, produce a text copy (using OCR, or Optical Character Recognition), publish the results on a website, and deliver a digital preservation file to the Library, which the latter would add to its digital library store. For all in-copyright newspapers, brightsolid would need to obtain licences from the rights owners. They would run a business out of this, charging a subscription for use of what was launched in 2011 as the British Newspaper Archive. They would select all the content; they owned all the derived data.
The BNA, as it is commonly known, has been a huge success. Over a period of ten years 44 million pages have been added to the site, with the current rate being something over six million pages a year. They have been able to do so, firstly by having a very efficient management system in place; secondly by digitising mostly from microfilm. Around a third of the British Library’s newspaper collection has been copied onto microfilm (while still keeping the print, of course), and digitising from microfilm is far faster than doing so from print. People doing family history benefitted hugely from the BNA, but so did researchers from many other disciplines, ranging from political studies, to sports history, to media historians, to novelists looking for ideas.
But things weren’t perfect.
At the British Library we had no say in what was digitised. That was OK up to a point, since what was chosen to be digitised was properly representative across time periods and geographical areas, but it was a little frustrating to have such a passive role. Moreover, many worthwhile titles were not getting digitised simply because they had not been copied onto microfilm and were often in poor condition (meaning that the print could be troublesome to digitise). Moreover, the years after the deal was signed had seen a huge rise in academic interest in newspaper data, to support ‘big data’ studies, where sophisticated software could be used to answer new kinds of historical enquiry. But we did not own the data. Other national institutions were developing open newspaper archives – why not the British Library?
The history of the negotiations can be told at another time, but essentially we used the opportunity of an extension to our contract with Findmypast to broker change, something that they warmly supported. Firstly, the British Library would digitise selected newspapers itself, from print, which Findmypast would post-process (creating the OCR and other metadata) and publish. There was a BL programme called Heritage Made Digital, which aimed to rebalance what was digitised from the collections. We used this to digitise out-of-copyright British newspapers whose print copies were in a poor or unfit condition and where no microfilm access copy existed, focussing on titles that were published in London but read beyond London. This was for three good reasons: most 19th-century London-based titles had not been digitised, they complemented Findmypast’s focus on regional titles, and they would have value for the widest audience. It was our intention from the start to make these titles freely available online, by whatever means available.
We also negotiated ownership of the derived data; that is, the data created as part of the digitisation process, including the OCR. This we would make available as a separate resource to support data science. At much the same time, the BL entered into partnership with the new Alan Turing Institute for data science, securing funding for a major historical study of the British industrial revolution, entitled Living with Machines, using data science tools, for which access to large amounts of newspaper data would be essential.
We were no longer passive. We were active.
The results you can now find in two places. On the British Newspaper Archive there is now a ‘Free to View’ filter on their search engine. You have to sign up to BNA beforehand, but once you have done so there are one million pages from 158 newspaper titles that that you can view and download without charge. Thanks to the agreement with Findmypast, we’ll be adding a million more next year, and so on. Meanwhile, they will keep on digitising newspapers at the same extraordinary rate (or even more) as the subscription site grows and grows, supporting the free element.
We’re also publishing the OCR data on the BL’s Research Repository. There are just a few titles there at present, representing a few hundred thousand pages, but in time we’ll be adding millions of pages of text, all free to use.
So, what have we made ‘free to view’? As said, it needs to be out-of-copyright, and by the guidelines that the BL follows, that means 140 years old. It’s a conservative rule of thumb adopted by other institutions (different countries have different copyright laws, please note), designed to identify a ‘safe’ period. That to me is important – identify the conditions under which you can make the most available with the fewest restrictions, and make the best of that.
The sort of titles we have made available are dotted throughout this post. There are establishment titles weighty with Victorian authority: The Morning Chronicle, Morning Herald, The Sun (no, not that one), The Press, The St James’s Chronicle. There are radical titles produced by those demanding social change: The Poor Man’s Guardian, The Bee-Hive, Cobbett’s Weekly Political Register. There are illustrated newspaper rich in depictions of Victorian society: the Illustrated Sporting and Theatrical News, The Lady’s Newspaper, Colored News (a short-lived foray into colour printing in 1855). There are regional titles: The Brighton Patriot, Manchester Times, Swansea and Glamorgan Herald, The Liverpool Standard and General Commercial Advertiser. There are special interest newspapers such as The British Emancipator (an anti-slavery title), British Miner and General Newsman, and the Jewish Record.
Some titles come from the Heritage Made Digital project, some from Living with Machines, some from newspapers we digitised years ago with funding from the Joint Information Systems Committee. More, as said, will follow. We’ll also be able to make them available via the British Library catalogue, and not just the British Newspaper Archive site, once we have completed some technical work on our digital archives (by next year, fingers crossed).
Finally, some big numbers. There are 44 million pages on the British Newspaper Archive. We have opened up just over 2% of that. But there are 450 million newspaper pages held by the British Library, so we’ve not yet digitised a tenth of the collection (though nearly so). There is a whole mountain range to get through, rights and money permitting (all of this costs money). It will take decades. But, as far as free access, and hence a wider audience, are concerned, we have established base camp. All we can now do is climb.
- The free-to-view newspapers are available on the British Newspaper Archive, with a guide to their use here: https://blog.britishnewspaperarchive.co.uk/2021/08/09/introducing-free-to-view-pages-on-the-british-newspaper-archive/
- A blog post I wrote for the British Library’s Newsroom blog lists all 158 titles: https://blogs.bl.uk/thenewsroom/2021/08/free-to-view-online-newspapers.html
- The Research Repository is at https://bl.iro.bl.uk and the news datasets it holds can be found at https://bl.iro.bl.uk/collections/353c908d-b495-4413-b047-87236d2573e3
- There’s a fine video tour of highlights from the free-to-view titles given by Findmypast’s Mary McKee, below: