In previous posts I have looked at several aspects of Earth and Space Science citations in Wikipedia. As part of a project I am working on, I’m interested in expanding this work to look at some mechanics of citations in Wikipedia to articles in Geophysical Research Letters (e.g., when do they appear, who writes the edit, on what Wikipedia pages, etc.). In this post, I want to walk through my method for getting the data that I will analyze. All the code is available (note that I am not a good R programmer).
Data on Wikipedia mentions are aggregated by Altmetric. rOpenSci built a tool to get altmetric data (rAltmetric) using the Altmetric API. rAltmetric works by retrieving data for each paper using the paper’s DOIs — so I need the DOIs for any/all papers before I do anything. Fortunately, rOpenSci has a tool for this too — rcrossref — which queries the Crossref database for relevant DOIs given some condition.
Since my project is focused on Geophysical Research Letters, I only need the DOIs for papers published in GRL. Using the ISSN for GRL, I downloaded 36,960 DOIs associated with GRL and then the associated Altmetric data (using rAltmetric).
The data from rAltmetric returns the number of times a given article is cited in Wikipedia. But I want some extra detail:
- The name of the Wikipedia article where the GRL citation appears
- When the edit was made
- and Who made the edit
This information is returned through the Altmetric commercial API — you can email Stacy Konkiel at Altmetric to get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research). I got the data another way, via webscraping. To keep everything in R, I used rvest to scrape the Altmetric page (for each GRL article) to get Wikipedia information — the Wikipedia page that was edited, the author, and the edit date. Here is an example of an Altmetric.com page for a GRL article:
The Wikipedia page (‘Overwash’), the user (‘Ebgoldstein’ — hey that’s me!), and the edit date (’10 Feb 2017′) are all mentioned… this is the data that I scraped for.
Additionally I scraped the GRL article page to get the date that the GRL article first appeared online (not when it was formally typeset and published). Here is an exampLE of a GRL article landing page:
Notice that the article was first published on 15 Dec 2016. However, if you click the ‘Full Publication History’ link, you find out that the article first appeared online 24 Nov 2016 — so potential Wikipedia editors could add a citation prior to the formal ‘publication date’ of the GRL article.
So now that I have that data, what does it look like? Out of 36,960 GRL articles, 921 appear in Wikipedia, some are even cited multiple times. Below is a plot with the number of GRL articles (y-axis) that appear in Wikipedia, tallied by the number of times they are cited in Wikipedia — note the log y-axis.
GRL articles are spread over a range of Wikipedia pages, but some Wikipedia pages have many references to GRL articles (note the log scale of the y-axis):
553 Wikipedia Articles have a reference to only a single GRL article, while some articles contain many GRL references. Take for instance the ‘Cluster II (spacecraft)‘ page, with 25 GRL citations, or ‘El Niño‘ with 11 GRL references).
I’ll dive into data I collected over the next few weeks in a series of blog posts, but I want to leave you with some caveats about the code and the data so far. (Edited after the initial posting)
Altmetric.com only shows the data for up to 5 Wikipedia mentions for a given journal articles unless you have paid (instituitonal) access. Several GRL articles were cited in >5 Wikipedia articles, so I manually added the missing data. Hopefully i will make a programmatic work-around sometime. After I wrote this post, I was informed that the commerical Altmetric API gives all of the Wikipedia data (edit, editor, date). To get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research), email Stacy Konkiel at Altmetric (thanks Stacy!).
Furthermore, many of the edit times that you see here could be re-edits, therefore ‘overprinting’ the date and editor for the first appearance of the wikipedia citation. This will be the subject of a future post, though I haven’t yet found an easy way to get the original edit…