Digitising a Century of History with Data
For several years we’ve been working on an internal project codenamed “Vintage” to digitise our archives. The process entails several detailed steps and often is manual and time consuming.
Turning our “dusty” archives into digital artefacts in our data warehouse would enable us to leverage our legacy for a myriad of purposes:
-
Making the history of Hong Kong and China searchable and accessible for educational institutions and research
-
Increase efficiency and ease of reference for our newsroom internally
-
Syndicate content to partners, news agencies and businesses
-
Make selected content available to SCMP readers
-
License archival content to individuals, companies, or institutions for commercial purposes
The first step is taking the microfilms from the archives and turning them into high-resolution digital scans. While we scanned these in 300 DPI (dots per inch), 600 DPI is actually recommended – the higher the better, given time and memory considerations, particularly if the broadsheet is large format. With distortion from wear and tear on the print copy itself over time or smudges on newsprint, small fonts can be difficult to decipher.
A scan via microfilm from SCMP’s first day of publication on November 6, 1903.
Once the high-resolution scans are completed, we need to transform these scans into text via OCR (optical character recognition) so that we can begin mapping each article into a semi-structured or structured format. We did so with XML (Extensible Markup Language) since it’s human and machine-readable, below:
Sample of XML output from the OCR process
As you can see from the results above, the mapping does have some inconsistencies and requires further cleaning and transformation, removal of extra spaces, special characters and erroneous letters.
The final step in the process is to convert that text into structured data and to transfer it to our data warehouse.
In the past few months, our data engineering team has taken a century’s worth of our historical archives and transformed it into structured data which is now in our data warehouse. We took a look at the archives and found some interesting insights.
Plotting our average article output per week we see a small dip during World War I and then a substantial drop during 1941-1945 as the Japanese occupy Hong Kong during World War II. However, the SCMP continues to grow its volume of coverage over time through the ’70s into the late ‘90s.
Once this data is available in our warehouse, it enables us to run various NLP (Natural Language Processing) models against it including sentiment analysis, readability scoring, keyword tagging and topic analysis. However, archival content also poses particular challenges as the news cycle is ever-changing alongside the world we live in, and training an algorithm across a century’s worth of topics brings in new challenges.
Leveraging unsupervised learning to perform keyword tagging to count recurring words (excluding “stop words” such as “the”, “is” and “and”) may be a more effective approach to extracting recurring themes over time in our content. After doing this we find, not surprisingly, that among our top keywords are China, Hong Kong, British, Chinese, government and police.
Looking at the keyword “China”, we see the usage of it on a per article basis has fluctuated over time. As we have recently completed importing this data into our warehouse, we are just starting to scratch the surface of the depth of insights that digitising a century of historical news coverage can reveal.
By bringing historical perspectives alive through the infusion of today’s data technology, we look forward to revealing more insights, news findings and learnings in the near future!