Data Files: getting information from websites with Kimono

Have you ever wanted to grab information from a website but found it time-consuming to copy-paste everything? Well, here’s your way around that!

A few weeks ago I ran across an extension of Chrome that lets you gather and transform information from webpages. Normally, this sort of thing requires some technical skill, but Kimono makes it easy for almost anyone to pull and reuse what’s posted online. Here, I’ll show you how to use Kimono to pull small sets of information from a single webpage in a structured form, either in RSS or Excel formats:

Getting the Data

Let’s suppose I’d like to get regular updates on news concerning Augusta, the state capital of Maine. Unfortunately, the Morning Sentinel doesn’t provide an RSS feed (that I can see) for just one town. I can narrow it down to get 100 articles about all of Central Maine each day… but who has time to read that?

1. So, I can go to Kimono Labs and create an account, and glance at the tutorials to see how thing work.

2. Then, I install the extension on Chrome to allow me to easily select which parts of a website I want to extract.

3. At this point, you follow the easy instructions on the Kimono website to create an API. Basically, what I’ve done is:

4. Go to the Central Maine website and search for Augusta in local news.

5. Clicking the Kimono extension on my browser, I start selecting the bits of information I want to extract from the search page:

I’ll let you follow the tutorials for this. Then, I label each aspect of the website selected (title, date, excerpt) so that an RSS reader can access it (see the Want RSS? button below? Yeah, click that).

From there, I can create my API, and then view results. There are many uses, and I’ll just highlight two:

Uses: Research

The first thing I could do would be to download the data in CSV, and then sort and label it in Excel. There are tutorials that show you how to do this, even for 50 pages of results. In this way, I could, say trawl the web for a research project, download results in Excel, and then categorize what I found in some sort of content analysis:

Uses: News and Updates

The second evident use is to create my own RSS feed. (If you don’t know what RSS is, google it and you’ll see it’s an easy way to find all your favorite blogs, cartoons, news sites, and job postings all in one place!)

To get an RSS feed, I can use my API, above, and click Endpoints –> RSS (see screenshot above). This pops up a new URL (here is the URL for the Augusta feed), that I can copy and add to my RSS reader. Once you add that link, you’ll be able to view results just like you would for any other blog or feed:

And there you go! I’ve created a customized RSS feed for articles that mention Augusta (which, being the capital, is way too many articles; this would be more useful in a smaller town!). I’ve also seen that I can download information about articles and then analyze them in Excel. I have to admit I haven’t quite sorted out the details of RSS, as it seems to only pull the 10 most recent items at a time, but you can set how frequently it will update.

A final note, of course, is that this is most easily justified for personal use, RSS feeds, or small student web projects. Do keep in mind the ethics of data extraction, especially if you’re planning commercial use, and check website terms to make sure you’re not extracting and analyzing data that isn’t permitted to you! In this case, I don’t find any notices telling me not to use info in this way, I’m using it for teaching purposes, and I’m not pulling anything subscription-based, just publicly available data–so I believe that I’m fine.

More questions on Kimono? Find their help pages here!

1 Trackback / Pingback

  1. An Intro to Web Scraping with Kimono

Leave a comment

Your email address will not be published.