Category Archives: data-driven

AGU publications on Sci-Hub

Sci-Hub, the web service with over 60 million academic papers, released the DOIs of its article holdings earlier this year:

The data was quickly put into a figshare repository (Hahnel, 2017), and some analysis has already been done by Greshake (2017) on this list of DOIs.

Here I want do some simple analysis with this dataset, and look at how many papers published by AGU are part of this collection. Keep in mind a few things:

  1. I believe in acquiring papers through legal means, and do not advocate searching for/using illegally distributed copies (through Sci-Hub or ResearchGate). New great tools to look for free versions of manuscripts are Unpaywall and oadoi.
  2. Papers in AGU journals from 1997 – 24 months ago are freely available, so many of the papers on Sci-Hub are already free.

OK, back to the Sci-Hub dataset: article Digital Object Identifiers (DOIs), are broken into 2 parts: the prefix and a suffix. As I understand it, prior to being published by Wiley, AGU articles used the prefix (10.1029), so I first extracted all entries with this older prefix. This really restricts this analysis to pre-2013 AGU articles (I’m not sure when exactly the change occurred), but some of the older articles might be published prior to the 1997 ‘open access’ cutoff.

I was left with a list of 171,752 articles (from the original 62 million).

The suffix of AGU DOIs corresponds to a single article from a specific journal, and letter codes in the suffix are used to denote the journal — for example, a GRL article has a suffix that includes ‘GL’ or ‘gl’.

For example, a GRL article has a DOI that looks like this: 10.1029/2006GL028162

Each AGU journal has a unique letter combo in the suffix, so the list of DOIs can be counted based on this suffix. Here is the line of julia code that I used to search through the list of ‘10.1029’ DOIs to find GRL articles.

GRL=length(matchall(r"gl"i,s))

Parsing the 171,752 articles into specific journals yields:

Papers.jpg

  • For these AGU journals, GRL has the most articles on Sci-Hub, note also the huge volume of EOS articles (?!?), and Water Resources Research.
  • The older and higher volume JGR sections (Oceans, Atmopsheres, Solid Earth and Space Physics) outweigh the newer, smaller sections (Biogeoscience, Earth Surface, Planets). Here are some JGR publication stats.
  • ~5% of the 171,752 AGU DOIs did not conform to this search — they might be books, chapters, or other documents.

Here is another interesting article on Sci-Hub, from Science (Bohannon, 2016): Who’s Downloading Pirated Papers? Everyone

‘Sleeping Beauties’ of Geomorphology: cases from the Journal of Geology

To recap from a previous post:

“Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers receive no citations for long stretches of time only to receive a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology. What other papers show this interesting signature of delayed recognition?”

Today I want to look for Sleeping Beauties in ‘The Journal of Geology‘. 

JG has been published since 1893, and has been the venue for some classic geomorphology papers (e.g., Wolman And Miller, 1960; Magnitude and Frequency of Forces in Geomorphic Processes; which i will discuss in a future post..)

In January 2017 I downloaded the citation time series for the 500 most cited journal of geology articles and used the algorithm of Ke et al. (2015) to find papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak (read Ke et al. (2015) to learn more about the method).

The top for papers, published from 1922 to 1935, are all focused on grain size and shape:

  1. Wentworth, C. K. (1922). A scale of grade and class terms for clastic sediments. The Journal of Geology, 30(5), 377-392. (pdf here)
  2. Wadell, H. (1935). Volume, shape, and roundness of quartz particles. The Journal of Geology, 43(3), 250-280. (article here)
  3. Wadell, H. (1932). Volume, shape, and roundness of rock particles. The Journal of Geology, 40(5), 443-451. (article here)
  4. Wadell, H. (1933). Sphericity and roundness of rock particles. The Journal of Geology, 41(3), 310-331. (article here)

The citation time series for each paper is shown below:JG.jpg

As with the last post, I will not offer any ‘reasons’ why these papers have an explosion in citations in the past 10 years. To do this, a first step would be a careful look at co-citation networks — what papers often co-occur with the citations — and the actual in-text usages and citations.

I did a cursory look at co-cited papers, and all of the papers show an affinity to two recent well-cited papers:

  • Blott, S. J., & Pye, K. (2001). GRADISTAT: a grain size distribution and statistics package for the analysis of unconsolidated sediments. Earth Surface Processes and Landforms, 26(11), 1237-1248. http://doi.org/10.1002/esp.261
  • Blott, S. J., & Pye, K. (2008). Particle shape: a review and new methods of characterization and classification. Sedimentology, 55(1), 31-63. http://doi.org/10.1111/j.1365-3091.2007.00892.x

Last I looked Blott and Pye (2001) was the most cited paper in ESPL, and is cited in a policy document, a rare occurrence for a geomorphology paper.

‘Sleeping Beauties’ of Geomorphology: a case from the American Journal of Science

Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers recieve no citations for long stretches of time only to recieve a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology.

What other papers show this interesting signature of delayed recognition?

I have looked in other journals and found a few neat examples, which I hope to chronicle in a series of posts. Today, I will look at an example from the American Journal of Science (AJS):

The AJS has been published since 1818, and has long been a geology venue. In January 2017  I downloaded the 500 most cited AJS articles from the Web of Science. I used the algorithm presented in Ke et al. (2015) to find the papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak  (I urge you all to read Ke et al. (2015) which describes the method).

The most delayed paper is about brachiopods, but I want to focus on research related to geomorphology, so let’s look at the 2nd most delayed paper:

W.W.Rubey (1933): Settling velocities of gravel, sand, and silt particles. Am J Sci April 1, 1933 Series 5 Vol. 25:325-338; doi:10.2475/ajs.s5-25.148.325

(n.b., settling velocity has a special place in my heart)

Rubey’s paper has a score that is similar to the papers from GSA Bulletin. Here is the citation time series for the Rubey paper:Rubey CTS.jpg

So the natural quesiton is —  what happened that caused this 2014 burst of citations? As far as I can tell (from looking at the papers that cited Rubey), nothing in particular… Most papers that cite Rubey are focused on typical sediment transport questions. A close read of all the citing papers would be needed to figure out what is going on here, if there is some ‘signal’. Not a satisfying answer, and I apologize —leave a comment if you have an idea and I’ll update the post if I find anything out.

References to AGU Journals in Wikipedia: JGR-B, JGR-P and JGR-ES

Wikipedia page views are immense. Editing Wikipedia to include more references to journals is one way to get more science into the public eye. Additionally, Wikipedia is a portal to peer-reviewed science.  But how many Earth and Space science papers are actually cited in Wikipedia?

For this post, I’m focusing on articles published by AGU. From an earlier investigation, I found 1599 citations to AGU publications in Wikipedia. But how are these 1599 citations spread across the journals? Let’s look at works published in JGR-Planets, JGR-Biogeoscience and JGR-Earth Surface because they have a similar number of publications per year — with 123, 196 and 126 articles published in 2016 (see the AGU publication stats). (Compare these numbers to the other 4 sections of JGR: ~400 articles in 2016 for Solid Earth and Oceans, and ~800 articles in 2016 for Space Physics and Atmospheres).

A quick note on the data: I first downloaded all of the articles records for a given journal from the Web of Science. Using the article DOI numbers, I used the rAltmetric package created by rOpenSci to find Wikipedia mentions listed in the Altmetric database. Note that this was done in Dec. 2016 and Wikipedia changes constantly, so treat this data as a snapshot.

The top panel is the percent of articles (published in a given year) that are referenced in Wikipedia. The bottom panel is the number of articles (published in a given year) referenced in Wikipedia. Also plotted is the data for GRL.

JGR-wiki.jpeg

JGR-Planets steals the show here..

For # of articles cited, GRL does well too.

I’ll post results for the other 4 JGR sections in a future post. In the meantime:

  • Here is an open dataset of scholarly citations in Wikipedia, from Wikipedia Research.
  • Here is an early analysis of the issue of scholarly citations in Wikipedia.
  • This type of analysis has also been done for the PLoS Journals.
  • I wrote an article that compared month page views of relevant Wikipedia pages, my website, and one of my articles (the only one with publicly available article level metrics) — Wikipedia page views are orders of magnitude higher.

Twitter mentions of GRL papers

Last week I had a guest post on the AGU Blog ‘The Plainspoken Scientist’ regarding the percent of Geophysical Research Letters (GRL) papers that are mentioned somewhere on the web. Today I want to dig further into the data regarding Twitter mentions — Specifically, how many Twitter mentions does a typical article in GRL receive?

To recap, almost every recent GRL paper has atleast one Twitter mention. Here is the percentage of GRL papers published in a given year with atleast one mention:

F2.jpg

In addition to more GRL articles being mentioned on Twitter, the total of all Twitter mentions to GRL articles published in a given year is increasing:

mentions per year.jpg

Parsing this data further, here are (yearly) histograms for the percent of articles from GRL with a given number of Twitter mentions:Histograms.jpg

Three observations:

  • ~ 40% of recent papers receive only a single twitter mention (perhaps from bots?).
  • The tail is long — a handful of papers are well of the chart, with several hundred twitter mentions.
  • The tail seems to grow fatter each year.

Crossing this data with the number of GRL papers published per year, here is the median number Twitter mentions per paper in a given year:

median mentions.jpg

(Keep in mind that this data comes from all time periods — for example, tweets referencing a paper from 2013 can come from any year.)

Citations to GRL articles conform to Benford’s law

I have written about Benford’s law in a previous post — the ‘law’ describes the frequency of first digits (1,2,….9) in a set of data will be non-uniform: i.e., Lower digits (1,2) occur more frequently than higher digits (8,9). From work on this blog, I have a data set composed of bibliographic and citation data on ~30,000 articles from Geophysical Research Letters downloaded from the Web of Science. I have used this data for authorship trends, number of papers published yearly, and to look at the number of references cited in a typical article.  Each article is associated with a number of times it has been cited. This data (times cited for each article) conforms (almost exactly) to Benford’s law:

GRLBenford.jpg

 

Three Types of Coastal Dune Papers

I loaded the titles and abstracts from the 4,342 Coastal Dune papers discussed in my previous post into the network visualization tool VOSviewer. The co-occurence of keywords in abstracts and titles leads to three prominent groups:Coastal Dune network map.jpg.

Group 1 (purple) is the biology, botany, and ecology literature — with words such as ‘habitat’, ‘species’, ‘biodiversity’, ‘nutrient’, and ‘biomass’. 

Group 2 (blue) is the sedimentology, geology, and (long timescale) geomorphology literature — with words such as ‘sea level’, ‘age’, ‘progradation’, ‘holocene’ and ‘luminescence’

Group 3 (yellow) is the coastal engineering and (short timescale) geomorphology literature— with words such as ‘wave’, ‘storm’, ‘surge’, ‘dune erosion’, ‘vulnerability’, and ‘coastal management’

The coastal engineering, sedimentology, and geomorphology groups (yellow and blue) are near one another (more words co-occur), but the purple biological literature is further apart (fewer co-occurring words).