Category Archives: data-driven

Data Collection: getting data for GRL articles

In previous posts I have looked at several aspects of Earth and Space Science citations in Wikipedia. As part of a project I am working on, I’m interested in expanding this work to look at some mechanics of citations in Wikipedia to articles in Geophysical Research Letters (e.g., when do they appear, who writes the edit, on what Wikipedia pages, etc.). In this post, I want to walk through my method for getting the data that I will analyze. All the code is available (note that I am not a good R programmer).

Data on Wikipedia mentions are aggregated by Altmetric. rOpenSci built a tool to get altmetric data (rAltmetric) using the Altmetric API. rAltmetric works by retrieving data for each paper using the paper’s DOIs — so I need the DOIs for any/all papers before I do anything. Fortunately, rOpenSci has a tool for this too — rcrossref — which queries the Crossref database for relevant DOIs given some condition.

Since my project is focused on Geophysical Research Letters, I only need the DOIs for papers published in GRL. Using the ISSN for GRL, I downloaded 36,960 DOIs associated with GRL and then the associated Altmetric data (using rAltmetric).

The data from rAltmetric returns the number of times a given article is cited in Wikipedia. But I want some extra detail:

  • The name of the Wikipedia article where the GRL citation appears
  • When the edit was made
  • and Who made the edit

This information is returned through the Altmetric commercial API — you can email Stacy Konkiel at Altmetric to get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research). I got the data another way, via webscraping. To keep everything in R, I used rvest to scrape the Altmetric page (for each GRL article) to get Wikipedia information — the Wikipedia page that was edited, the author, and the edit date. Here is an example of an page for a GRL article:


The Wikipedia page (‘Overwash’), the user (‘Ebgoldstein’ — hey that’s me!), and the edit date (’10 Feb 2017′) are all mentioned… this is the data that I scraped for.

Additionally I scraped the GRL article page to get the date that the GRL article first appeared online (not when it was formally typeset and published). Here is an exampLE of a GRL article landing page:


Notice that the article was first published on 15 Dec 2016. However, if you click the ‘Full Publication History’ link, you find out that the article first appeared online 24 Nov 2016 — so potential Wikipedia editors could add a citation prior to the formal ‘publication date’ of the GRL article.

So now that I have that data, what does it look like? Out of 36,960 GRL articles, 921 appear in Wikipedia, some are even cited multiple times. Below is a plot with the number of GRL articles (y-axis) that appear in Wikipedia, tallied by the number of times they are cited in Wikipedia — note the log y-axis.


GRL articles are spread over a range of Wikipedia pages, but some Wikipedia pages have many references to GRL articles (note the log scale of the y-axis):


553 Wikipedia Articles have a reference to only a single GRL article, while some articles contain many GRL references. Take for instance the ‘Cluster II (spacecraft)‘ page, with 25 GRL citations, or ‘El Niño‘ with 11 GRL references).

I’ll dive into data I collected over the next few weeks in a series of blog posts, but I want to leave you with some caveats about the code and the data so far. (Edited after the initial posting) only shows the data for up to 5 Wikipedia mentions for a given journal articles unless you have paid (instituitonal) access. Several GRL articles were cited in >5 Wikipedia articles, so I manually added the missing data. Hopefully i will make a programmatic work-around sometime. After I wrote this post, I was informed that the commerical Altmetric API gives all of the Wikipedia data (edit, editor, date). To get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research), email Stacy Konkiel at Altmetric (thanks Stacy!).

Furthermore, many of the edit times that you see here could be re-edits, therefore ‘overprinting’ the date and editor for the first appearance of the wikipedia citation. This will be the subject of a future post, though I haven’t yet found an easy way to get the original edit…


Peering into the Nature Geoscience author dataset

A recent Nature Geoscience editorial looked at the reviewer suggestions of submitting authors.  The editorial examined many different issues, including:

  • The geographic breakdown of submitting authors.
  • The geographic breakdown of author-suggested reviewers.
  • The geographic and gender breakdown for submitting authors whose paper was sent for review.
  • The gender breakdown of suggested reviewers by submitting author gender.

Fortunately, the data behind the editorial was also provided as a supplement. So let’s take a peek and investigate some other aspects of the data. First, let’s check out the gender breakdown of submitting authors by geographic region


For reference, ‘f’ is female, ‘m’ is male, and ‘u’ is unknown. The disproportion is clear accross all regions (note that Australia and NZ seem to be least disproportionate).

Next, let’s check out the geography of suggested reviewers by submitting author geography. Here is the number of authors who suggested reviewers, by geography:F3.jpeg

Now from this set of authors, the proportion of suggested reviewers broken down by geography:


One major trend I see, aside from the lack of balance across all recommendations, is that North American authors recommend North American reviewers most of the time (~65%). No other geographic location recommends itself as much (see even the European + Middle East authors, who recommend European + Middle East reviewers equally with North Americans)

I can think of data that is missing from this dataset —  in particular, the breakdown of assigned reviewers by geography. However the editorial alludes to some answers:

“Nevertheless, the geographical distribution of editor-assigned reviewers resembles the biases of author-suggested reviewers”

The R code for my analysis is here — this post was a good excuse to continue learning R (keep that in mind that I am learning R as you look at the messy, verbose code).


AGU publications on Sci-Hub

Sci-Hub, the web service with over 60 million academic papers, released the DOIs of its article holdings earlier this year:

The data was quickly put into a figshare repository (Hahnel, 2017), and some analysis has already been done by Greshake (2017) on this list of DOIs.

Here I want do some simple analysis with this dataset, and look at how many papers published by AGU are part of this collection. Keep in mind a few things:

  1. I believe in acquiring papers through legal means, and do not advocate searching for/using illegally distributed copies (through Sci-Hub or ResearchGate). New great tools to look for free versions of manuscripts are Unpaywall and oadoi.
  2. Papers in AGU journals from 1997 – 24 months ago are freely available, so many of the papers on Sci-Hub are already free.

OK, back to the Sci-Hub dataset: article Digital Object Identifiers (DOIs), are broken into 2 parts: the prefix and a suffix. As I understand it, prior to being published by Wiley, AGU articles used the prefix (10.1029), so I first extracted all entries with this older prefix. This really restricts this analysis to pre-2013 AGU articles (I’m not sure when exactly the change occurred), but some of the older articles might be published prior to the 1997 ‘open access’ cutoff.

I was left with a list of 171,752 articles (from the original 62 million).

The suffix of AGU DOIs corresponds to a single article from a specific journal, and letter codes in the suffix are used to denote the journal — for example, a GRL article has a suffix that includes ‘GL’ or ‘gl’.

For example, a GRL article has a DOI that looks like this: 10.1029/2006GL028162

Each AGU journal has a unique letter combo in the suffix, so the list of DOIs can be counted based on this suffix. Here is the line of julia code that I used to search through the list of ‘10.1029’ DOIs to find GRL articles.


Parsing the 171,752 articles into specific journals yields:


  • For these AGU journals, GRL has the most articles on Sci-Hub, note also the huge volume of EOS articles (?!?), and Water Resources Research.
  • The older and higher volume JGR sections (Oceans, Atmopsheres, Solid Earth and Space Physics) outweigh the newer, smaller sections (Biogeoscience, Earth Surface, Planets). Here are some JGR publication stats.
  • ~5% of the 171,752 AGU DOIs did not conform to this search — they might be books, chapters, or other documents.

Here is another interesting article on Sci-Hub, from Science (Bohannon, 2016): Who’s Downloading Pirated Papers? Everyone

‘Sleeping Beauties’ of Geomorphology: cases from the Journal of Geology

To recap from a previous post:

“Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers receive no citations for long stretches of time only to receive a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology. What other papers show this interesting signature of delayed recognition?”

Today I want to look for Sleeping Beauties in ‘The Journal of Geology‘. 

JG has been published since 1893, and has been the venue for some classic geomorphology papers (e.g., Wolman And Miller, 1960; Magnitude and Frequency of Forces in Geomorphic Processes; which i will discuss in a future post..)

In January 2017 I downloaded the citation time series for the 500 most cited journal of geology articles and used the algorithm of Ke et al. (2015) to find papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak (read Ke et al. (2015) to learn more about the method).

The top for papers, published from 1922 to 1935, are all focused on grain size and shape:

  1. Wentworth, C. K. (1922). A scale of grade and class terms for clastic sediments. The Journal of Geology, 30(5), 377-392. (pdf here)
  2. Wadell, H. (1935). Volume, shape, and roundness of quartz particles. The Journal of Geology, 43(3), 250-280. (article here)
  3. Wadell, H. (1932). Volume, shape, and roundness of rock particles. The Journal of Geology, 40(5), 443-451. (article here)
  4. Wadell, H. (1933). Sphericity and roundness of rock particles. The Journal of Geology, 41(3), 310-331. (article here)

The citation time series for each paper is shown below:JG.jpg

As with the last post, I will not offer any ‘reasons’ why these papers have an explosion in citations in the past 10 years. To do this, a first step would be a careful look at co-citation networks — what papers often co-occur with the citations — and the actual in-text usages and citations.

I did a cursory look at co-cited papers, and all of the papers show an affinity to two recent well-cited papers:

  • Blott, S. J., & Pye, K. (2001). GRADISTAT: a grain size distribution and statistics package for the analysis of unconsolidated sediments. Earth Surface Processes and Landforms, 26(11), 1237-1248.
  • Blott, S. J., & Pye, K. (2008). Particle shape: a review and new methods of characterization and classification. Sedimentology, 55(1), 31-63.

Last I looked Blott and Pye (2001) was the most cited paper in ESPL, and is cited in a policy document, a rare occurrence for a geomorphology paper.

‘Sleeping Beauties’ of Geomorphology: a case from the American Journal of Science

Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers recieve no citations for long stretches of time only to recieve a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology.

What other papers show this interesting signature of delayed recognition?

I have looked in other journals and found a few neat examples, which I hope to chronicle in a series of posts. Today, I will look at an example from the American Journal of Science (AJS):

The AJS has been published since 1818, and has long been a geology venue. In January 2017  I downloaded the 500 most cited AJS articles from the Web of Science. I used the algorithm presented in Ke et al. (2015) to find the papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak  (I urge you all to read Ke et al. (2015) which describes the method).

The most delayed paper is about brachiopods, but I want to focus on research related to geomorphology, so let’s look at the 2nd most delayed paper:

W.W.Rubey (1933): Settling velocities of gravel, sand, and silt particles. Am J Sci April 1, 1933 Series 5 Vol. 25:325-338; doi:10.2475/ajs.s5-25.148.325

(n.b., settling velocity has a special place in my heart)

Rubey’s paper has a score that is similar to the papers from GSA Bulletin. Here is the citation time series for the Rubey paper:Rubey CTS.jpg

So the natural quesiton is —  what happened that caused this 2014 burst of citations? As far as I can tell (from looking at the papers that cited Rubey), nothing in particular… Most papers that cite Rubey are focused on typical sediment transport questions. A close read of all the citing papers would be needed to figure out what is going on here, if there is some ‘signal’. Not a satisfying answer, and I apologize —leave a comment if you have an idea and I’ll update the post if I find anything out.

References to AGU Journals in Wikipedia: JGR-B, JGR-P and JGR-ES

Wikipedia page views are immense. Editing Wikipedia to include more references to journals is one way to get more science into the public eye. Additionally, Wikipedia is a portal to peer-reviewed science.  But how many Earth and Space science papers are actually cited in Wikipedia?

For this post, I’m focusing on articles published by AGU. From an earlier investigation, I found 1599 citations to AGU publications in Wikipedia. But how are these 1599 citations spread across the journals? Let’s look at works published in JGR-Planets, JGR-Biogeoscience and JGR-Earth Surface because they have a similar number of publications per year — with 123, 196 and 126 articles published in 2016 (see the AGU publication stats). (Compare these numbers to the other 4 sections of JGR: ~400 articles in 2016 for Solid Earth and Oceans, and ~800 articles in 2016 for Space Physics and Atmospheres).

A quick note on the data: I first downloaded all of the articles records for a given journal from the Web of Science. Using the article DOI numbers, I used the rAltmetric package created by rOpenSci to find Wikipedia mentions listed in the Altmetric database. Note that this was done in Dec. 2016 and Wikipedia changes constantly, so treat this data as a snapshot.

The top panel is the percent of articles (published in a given year) that are referenced in Wikipedia. The bottom panel is the number of articles (published in a given year) referenced in Wikipedia. Also plotted is the data for GRL.


JGR-Planets steals the show here..

For # of articles cited, GRL does well too.

I’ll post results for the other 4 JGR sections in a future post. In the meantime:

  • Here is an open dataset of scholarly citations in Wikipedia, from Wikipedia Research.
  • Here is an early analysis of the issue of scholarly citations in Wikipedia.
  • This type of analysis has also been done for the PLoS Journals.
  • I wrote an article that compared month page views of relevant Wikipedia pages, my website, and one of my articles (the only one with publicly available article level metrics) — Wikipedia page views are orders of magnitude higher.

Twitter mentions of GRL papers

Last week I had a guest post on the AGU Blog ‘The Plainspoken Scientist’ regarding the percent of Geophysical Research Letters (GRL) papers that are mentioned somewhere on the web. Today I want to dig further into the data regarding Twitter mentions — Specifically, how many Twitter mentions does a typical article in GRL receive?

To recap, almost every recent GRL paper has atleast one Twitter mention. Here is the percentage of GRL papers published in a given year with atleast one mention:


In addition to more GRL articles being mentioned on Twitter, the total of all Twitter mentions to GRL articles published in a given year is increasing:

mentions per year.jpg

Parsing this data further, here are (yearly) histograms for the percent of articles from GRL with a given number of Twitter mentions:Histograms.jpg

Three observations:

  • ~ 40% of recent papers receive only a single twitter mention (perhaps from bots?).
  • The tail is long — a handful of papers are well of the chart, with several hundred twitter mentions.
  • The tail seems to grow fatter each year.

Crossing this data with the number of GRL papers published per year, here is the median number Twitter mentions per paper in a given year:

median mentions.jpg

(Keep in mind that this data comes from all time periods — for example, tweets referencing a paper from 2013 can come from any year.)