How to Text Mine Old Newspaper Articles

I had an interesting request from a student earlier this semester, who was looking for ways to pull and analyze news articles and social media for a big research paper. Because I’m working at a college library, I’ll keep it anonymous and create a hybrid of several students’ experiences for illustration purposes. We’ll say that Anna* was researching girl scout cookies*. Everyone loves Samoas, am I right?

Some senior librarians have suggested I look up a topic before a student arrives (based on appointment emails) to find resources for them. I have mixed feelings on this: is it too easy? Sure, I can find and present several useful resources. I can be librarian as ‘gatekeeper’, where I have some super-secret store of knowledge that the user + Google doesn’t have.

But I hate that role. It’s fake. I learn the same way that my professors and students do: by trial and error. So even while I’m looking for background info for Anna, I’m wondering: when do students learn this? When do they learn the exploring, asking around, and messing around with library databases and online tools to learn new skills skills that I’m using?

Maybe that happens when they become research librarians. Or ‘social media experts.’ Or entrepreneurs. Or any other white-collar job that’s focused vaguely on technology, people, and information.

This is a long way of saying that before Anna* even got to my office, I was searching for text-mining tools or scrape-able news articles and social media data. I knew I wanted something simple to use: she shouldn’t need to code, set up an API, or use specialized file formats for a class project. For a senior thesis, sure. But for now, learning to find and run data through an analysis tool (and write intelligently about the results) should be enough.

Easy Text Mining for Books: the Google nGram Viewer

I’ve mentioned Google nGrams before–a free online tool which lets you analyze the frequency of words across time and languages. But this only includes what Google Books has taken the time to digitize. Here are words in close proximity to genetically modified organisms (e.g. half of what we eat, as Americans):


Search for *=>GMO on Google NGram Viewer

NGrams’ specialized search notation also lets you look for specific combinations of words, like nouns connected to GMOs in print books over time:


GMOs=>*_NOUN searches books for nouns in relation to GMOs on Google NGram Viewer

Nice. That’s somewhat interesting, but it’s focused on books, and not on online media.

Quick Text Mining for Tweets: Topsy

An interesting tool for getting an overview of Twitter is Topsy, which shows what you’re tweeting or blogging and then sharing on Twitter. I can search for ‘Kazakhstan’ by time period:

Topsy Kazakhstan

And the analytics page tracks the number of tweets about each search topic, by day:

Topsy tweets by day

You could probably use Kimono labs or to scrape and analyze the tweets… but because this just aggregates tweets, it’s kind of like a glorified search engine.

Similarly, Facebook has a mostly useless API that lets you see the technical structure of your own posts and pages. That’s revealing in itself, so you should totally check it out if tech-inclined. However, Facebook doesn’t let you search public posts or hashtags anymore. That’s yet another reason not to share data with a company that won’t share back.

Data Mining for News: The New York Times Article Search API

The NYT has an article search API that returns structured data from their news articles. You know, metadata–the information about the article (header, title, date, people mentioned). Sadly, it doesn’t include full-text, which doesn’t help my student. Below, the fields that get indexed:

Metadata for articles at the New York Times

As an example in action, I can search for girl scout cookies and get structured JSON data. Which… I don’t know what to do with (other than that I should probably learn to work with d3.js for data visualization, at some point).

I guess the NYT won’t share full-text so that they can keep their stuff private, but it’s unfortunate for anyone trying to do systemic academic research:

Girl scout cookies on NYT article api

Sentiment Analysis for online chatter: Opinion Crawl

Sentiment analysis is technically complex, but Opinion Crawl gives a quick view of the positive and negative chatter surrounding current events. Worth a mention, but it’s too informal to recommend for academic research:

Opinion crawl results for Vladimir Putin March 2015

Key Terms in Academic Articles: Data for Research (beta) on JSTOR

Google, Yahoo, the NYT, and Vogue (via Yale scholars using a Proquest database) have all recently released ways to mine their archive and explore changes in topics over time. Below, I compare the mention of pantalons, skorts, and suffragettes in Vogue over time:

Bookwords Reading Vogue

Pantalons, skorts, and suffragettes in Vogue

More and more of these databases are starting to allow rudimentary text-mining for academic purposes.

So it makes sense that JSTOR would jump on the bandwagon. Jstor’s Data for Research (DfR) is a curious tool, still in beta, that lets you compare topics in academic articles over time. For instance, I can sort for articles on economic anthropology over the past 10 years, and see JSTOR’s article-derived “key terms” (which are actually surprisingly relevant):

DfR JSTOR search results for economic anthropology

I can also see what ‘subject group’ the articles are published in. In the case of economic anthropology (the study of culture, trade, and resources), articles are in social science and area studies (Africa, Asia) journals:

DfR discipline covering economic anthropology

Overall: I like that I can view ‘key terms’ extracted from the article itself, rather than just the high-prestige keywords an author wants to label their work with. But I’m not sure what else this can do. DfR seems underdeveloped at the moment–and it wasn’t helpful for Anna because JSTOR doesn’t index newspapers.

A Source of News Articles: LexisNexis Academic

When Anna got to my office, we ended up pulling news articles on girl scout cookies from LexisNexis:

lexis-nexis academic search

We had also tried ProQuest newspapers, but many results were in PDF and hard to copy-paste. LexisNexis, on the other hand, returned substantial full-text results, which we could narrow down by US or international newspaper, and by year of interest.  Here are their results for Girl Scout Cookies in 2014-2015:

lexis-nexis newspaper options

And below, a sample resulting article. This was easy to copy and paste into a text analysis program or tool:

how cookie crumbles article

It’s also easy, I later found, to multiselect a batch of 25-1000 articles and download them as a single text file. I can also select them one at a time and download, but that takes much more time. (However, as I’ll show below, individual .txt files are useful for analysis in the AntConc tool).

Girl scout cookies news article, download screenshot on LexisNexis

A simple textual analysis: Textalyser and ProWritingAid

I feel like I’m barely scratching the surface here, as this was my first text mining request. Anna* and I looked at discourse analysis as a textual research method, which requires close reading of a few articles (cf tutorial at Politics East Asia).

But she wanted something easier than close analysis and detailed qualitative coding (surprise, surprise). As a trial, we pasted the article from above into Textalyser, to at least get basic frequency statistics:

textalyser word stats

Like other quantitative text analysis tools, Textalyser let us compare repeated phrases of 2-5 words each, looking for common themes or topics:

textalyser five word frequency

Anna was happy with this, and went to explore further on her own. I’m still not sure that Textalyser was the *best* tool, as Anna wants to analyze hundreds of newspaper articles against each other.

Another program that came to mind is ProWritingAid. It’s meant for writers, but catches repeated phrases, cliches, and awkward wording. I love it for writing, but it could perhaps be used for surface-level research as well:

Pro Writing Aid

Repeated words and phrases analysis in ProWritingAid

Update: After meeting with Anna, I found AntConc and related tools by Laurence Anthony. This free Windows / Mac OS X tool allows students to compare multiple files in a concordance, with word lists, keyword lists, and collocation of words. I’ll recommend it next time students ask for easy ways of of analyzing multiple texts together. (Source: Basic Text Mining intro at Macroscope).

Screenshot of AntConc in action


I still feel like I’m learning the sources myself, as students come to me with widely disparate requests (detailed GIS points of a city in a remote country? economic indicators in 1980s Latin America? local Maine ecology and economic development?). Anna and I did end up looking at the text-analysis and digital research tools at and the DiRT Directory before she left, and DiRT is where I’d recommend others start in the future to find textual analysis tools

However, I’ve shared the full process above so that you could see a couple of things:

a) how messy it is. I hit dozens of dead-end websites and tools, which I haven’t shown.

b) how variable the results. Some things above are high-quality, and others are not.

c) how complex it can be. Some tools were too intricate to learn in the 2-3 hours I allotted for exploring with and without the student.

d) How fun it can be. The tools above are open, fairly easy, and hopefully fun for you to play with and explore!

Leave a comment

Your email address will not be published.