Category Archives: data-driven

My Time at CSDMS 2019


(This post originally appeared on the Coast and Ocean Collective Blog)

In May I went to my first annual meeting of CSDMS— the Community Surface Dynamics Modeling System. It was great to see old friends and meet new ones.

CSDMS is involved in a range of different projects and provides a suite of different services to the earth surface processes modeling community. You might know about CSDMS from its model repository (with metadata and links to source code) and the handy tools developed by CSDMS to link models together. For more background on CSDMS, check out their webpage.

One nice aspect of CSDMS is that the keynotes and panels are recorded and put on YouTube, and many poster presenters upload PDFs of their poster. I have spent a few hours skimming through these videos and PDFs from past meetings — lots of interesting ideas.

The annual meeting theme this year was ‘Bridging Boundaries’, and there was a range of interesting talks, posters, clinics, breakout sessions, and panels. I want to just mention a few highlight during those 3 packed days.

  • I really enjoyed the wide range of keynotes. Two particularly interesting ones were:
  • I really enjoyed the 2 panel discussions:
  • A real highlight for me was Dan Buscombe’s deep learning clinic. Dan walked us through a comprehensive Jupyter notebook based on his work on pixel-scale image classification. It was great to hear Dan explain his workflow, and it was great to meet him in person. I urge you to check out his work!
  • There were too many amazing posters to cover in one post. I recommend scrolling through the abstracts and poster pdfs online.
  • I live-tweeted the 3rd day through the CSDMS and AGU EPSP twitter accounts. This was really fun and I’m grateful for the opportunity from the AGU EPSP social media team.
  • I am very grateful to CSDMS for inviting me to give a keynote this year — it was exciting to share my ideas with such a talented group of people. My talk — video, slides — focused on ML work that I have done with the Coast and Ocean Collective (and others), specifically work on swash, runup, ‘hybrid’ models, and the ML review paper that was just published.
  • Lastly, I ate a lot of (good) pizza.



Twitter record of #AGU17

I missed the 2017 Fall AGU meeting, but I did follow along on twitter. However the coverage was spotty — some sessions were mentioned, some not at all. From this experience I kept wondering about the digital traces of the meeting on twitter. Lo and behold I saw this tweet from Dr. Christina K. Pikas (@cpikas) at the beginning of this year:

So let’s look at this awesome dataset that Dr. Pikas collected and published on figshare:. First, this data was collected using TAGS, and contains tweets from Nov. 4th, 2017 to Jan. 4th, 2018 that used the hashtag #AGU17There are a total of 31,909 tweets in this dataset. In this post I am subsetting the data to look only at the meeting (with a 1 day buffer, so Sunday Dec. 10, 2017 to Saturday Dec. 17, 2017) — a total of 25,531 tweets during the 7 days:


I noticed:

  • Twitter activity decays through the week (fatigue? do most people just tweet their arrival? Daily attendance variations?)
  • There is a noticeable lunch break on M, W, Th, and F
  • Each day twitter activity starts suddenly, but has a slow(er) decay at the end of the day (late night activities?)

Retweets account for 44% of the 25,531 tweets during the meeting. Removing RTs yields an almost identical plot, but there is small peak that appears at the end of each day (pre-bedtime tweets?):


Lastly, the biggest #AGU17 twitter user is @theAGU (by far), which sent 1063 tweets during the week. Here is the timeseries with only @theAGU tweets:


I see the lunch break and not as many late nights for the organization.

Thanks @cpikas for collecting and publishing the data! It is available on figshare:

My code is on github here

Data Collection: getting data for GRL articles

In previous posts I have looked at several aspects of Earth and Space Science citations in Wikipedia. As part of a project I am working on, I’m interested in expanding this work to look at some mechanics of citations in Wikipedia to articles in Geophysical Research Letters (e.g., when do they appear, who writes the edit, on what Wikipedia pages, etc.). In this post, I want to walk through my method for getting the data that I will analyze. All the code is available (note that I am not a good R programmer).

Data on Wikipedia mentions are aggregated by Altmetric. rOpenSci built a tool to get altmetric data (rAltmetric) using the Altmetric API. rAltmetric works by retrieving data for each paper using the paper’s DOIs — so I need the DOIs for any/all papers before I do anything. Fortunately, rOpenSci has a tool for this too — rcrossref — which queries the Crossref database for relevant DOIs given some condition.

Since my project is focused on Geophysical Research Letters, I only need the DOIs for papers published in GRL. Using the ISSN for GRL, I downloaded 36,960 DOIs associated with GRL and then the associated Altmetric data (using rAltmetric).

The data from rAltmetric returns the number of times a given article is cited in Wikipedia. But I want some extra detail:

  • The name of the Wikipedia article where the GRL citation appears
  • When the edit was made
  • and Who made the edit

This information is returned through the Altmetric commercial API — you can email Stacy Konkiel at Altmetric to get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research). I got the data another way, via webscraping. To keep everything in R, I used rvest to scrape the Altmetric page (for each GRL article) to get Wikipedia information — the Wikipedia page that was edited, the author, and the edit date. Here is an example of an page for a GRL article:


The Wikipedia page (‘Overwash’), the user (‘Ebgoldstein’ — hey that’s me!), and the edit date (’10 Feb 2017′) are all mentioned… this is the data that I scraped for.

Additionally I scraped the GRL article page to get the date that the GRL article first appeared online (not when it was formally typeset and published). Here is an exampLE of a GRL article landing page:


Notice that the article was first published on 15 Dec 2016. However, if you click the ‘Full Publication History’ link, you find out that the article first appeared online 24 Nov 2016 — so potential Wikipedia editors could add a citation prior to the formal ‘publication date’ of the GRL article.

So now that I have that data, what does it look like? Out of 36,960 GRL articles, 921 appear in Wikipedia, some are even cited multiple times. Below is a plot with the number of GRL articles (y-axis) that appear in Wikipedia, tallied by the number of times they are cited in Wikipedia — note the log y-axis.


GRL articles are spread over a range of Wikipedia pages, but some Wikipedia pages have many references to GRL articles (note the log scale of the y-axis):


553 Wikipedia Articles have a reference to only a single GRL article, while some articles contain many GRL references. Take for instance the ‘Cluster II (spacecraft)‘ page, with 25 GRL citations, or ‘El Niño‘ with 11 GRL references).

I’ll dive into data I collected over the next few weeks in a series of blog posts, but I want to leave you with some caveats about the code and the data so far. (Edited after the initial posting) only shows the data for up to 5 Wikipedia mentions for a given journal articles unless you have paid (instituitonal) access. Several GRL articles were cited in >5 Wikipedia articles, so I manually added the missing data. Hopefully i will make a programmatic work-around sometime. After I wrote this post, I was informed that the commerical Altmetric API gives all of the Wikipedia data (edit, editor, date). To get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research), email Stacy Konkiel at Altmetric (thanks Stacy!).

Furthermore, many of the edit times that you see here could be re-edits, therefore ‘overprinting’ the date and editor for the first appearance of the wikipedia citation. This will be the subject of a future post, though I haven’t yet found an easy way to get the original edit…

Peering into the Nature Geoscience author dataset

A recent Nature Geoscience editorial looked at the reviewer suggestions of submitting authors.  The editorial examined many different issues, including:

  • The geographic breakdown of submitting authors.
  • The geographic breakdown of author-suggested reviewers.
  • The geographic and gender breakdown for submitting authors whose paper was sent for review.
  • The gender breakdown of suggested reviewers by submitting author gender.

Fortunately, the data behind the editorial was also provided as a supplement. So let’s take a peek and investigate some other aspects of the data. First, let’s check out the gender breakdown of submitting authors by geographic region


For reference, ‘f’ is female, ‘m’ is male, and ‘u’ is unknown. The disproportion is clear accross all regions (note that Australia and NZ seem to be least disproportionate).

Next, let’s check out the geography of suggested reviewers by submitting author geography. Here is the number of authors who suggested reviewers, by geography:F3.jpeg

Now from this set of authors, the proportion of suggested reviewers broken down by geography:


One major trend I see, aside from the lack of balance across all recommendations, is that North American authors recommend North American reviewers most of the time (~65%). No other geographic location recommends itself as much (see even the European + Middle East authors, who recommend European + Middle East reviewers equally with North Americans)

I can think of data that is missing from this dataset —  in particular, the breakdown of assigned reviewers by geography. However the editorial alludes to some answers:

“Nevertheless, the geographical distribution of editor-assigned reviewers resembles the biases of author-suggested reviewers”

The R code for my analysis is here — this post was a good excuse to continue learning R (keep that in mind that I am learning R as you look at the messy, verbose code).


AGU publications on Sci-Hub

Sci-Hub, the web service with over 60 million academic papers, released the DOIs of its article holdings earlier this year:

The data was quickly put into a figshare repository (Hahnel, 2017), and some analysis has already been done by Greshake (2017) on this list of DOIs.

Here I want do some simple analysis with this dataset, and look at how many papers published by AGU are part of this collection. Keep in mind a few things:

  1. I believe in acquiring papers through legal means, and do not advocate searching for/using illegally distributed copies (through Sci-Hub or ResearchGate). New great tools to look for free versions of manuscripts are Unpaywall and oadoi.
  2. Papers in AGU journals from 1997 – 24 months ago are freely available, so many of the papers on Sci-Hub are already free.

OK, back to the Sci-Hub dataset: article Digital Object Identifiers (DOIs), are broken into 2 parts: the prefix and a suffix. As I understand it, prior to being published by Wiley, AGU articles used the prefix (10.1029), so I first extracted all entries with this older prefix. This really restricts this analysis to pre-2013 AGU articles (I’m not sure when exactly the change occurred), but some of the older articles might be published prior to the 1997 ‘open access’ cutoff.

I was left with a list of 171,752 articles (from the original 62 million).

The suffix of AGU DOIs corresponds to a single article from a specific journal, and letter codes in the suffix are used to denote the journal — for example, a GRL article has a suffix that includes ‘GL’ or ‘gl’.

For example, a GRL article has a DOI that looks like this: 10.1029/2006GL028162

Each AGU journal has a unique letter combo in the suffix, so the list of DOIs can be counted based on this suffix. Here is the line of julia code that I used to search through the list of ‘10.1029’ DOIs to find GRL articles.


Parsing the 171,752 articles into specific journals yields:


  • For these AGU journals, GRL has the most articles on Sci-Hub, note also the huge volume of EOS articles (?!?), and Water Resources Research.
  • The older and higher volume JGR sections (Oceans, Atmopsheres, Solid Earth and Space Physics) outweigh the newer, smaller sections (Biogeoscience, Earth Surface, Planets). Here are some JGR publication stats.
  • ~5% of the 171,752 AGU DOIs did not conform to this search — they might be books, chapters, or other documents.

Here is another interesting article on Sci-Hub, from Science (Bohannon, 2016): Who’s Downloading Pirated Papers? Everyone

‘Sleeping Beauties’ of Geomorphology: cases from the Journal of Geology

To recap from a previous post:

“Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers receive no citations for long stretches of time only to receive a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology. What other papers show this interesting signature of delayed recognition?”

Today I want to look for Sleeping Beauties in ‘The Journal of Geology‘. 

JG has been published since 1893, and has been the venue for some classic geomorphology papers (e.g., Wolman And Miller, 1960; Magnitude and Frequency of Forces in Geomorphic Processes; which i will discuss in a future post..)

In January 2017 I downloaded the citation time series for the 500 most cited journal of geology articles and used the algorithm of Ke et al. (2015) to find papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak (read Ke et al. (2015) to learn more about the method).

The top for papers, published from 1922 to 1935, are all focused on grain size and shape:

  1. Wentworth, C. K. (1922). A scale of grade and class terms for clastic sediments. The Journal of Geology, 30(5), 377-392. (pdf here)
  2. Wadell, H. (1935). Volume, shape, and roundness of quartz particles. The Journal of Geology, 43(3), 250-280. (article here)
  3. Wadell, H. (1932). Volume, shape, and roundness of rock particles. The Journal of Geology, 40(5), 443-451. (article here)
  4. Wadell, H. (1933). Sphericity and roundness of rock particles. The Journal of Geology, 41(3), 310-331. (article here)

The citation time series for each paper is shown below:JG.jpg

As with the last post, I will not offer any ‘reasons’ why these papers have an explosion in citations in the past 10 years. To do this, a first step would be a careful look at co-citation networks — what papers often co-occur with the citations — and the actual in-text usages and citations.

I did a cursory look at co-cited papers, and all of the papers show an affinity to two recent well-cited papers:

  • Blott, S. J., & Pye, K. (2001). GRADISTAT: a grain size distribution and statistics package for the analysis of unconsolidated sediments. Earth Surface Processes and Landforms, 26(11), 1237-1248.
  • Blott, S. J., & Pye, K. (2008). Particle shape: a review and new methods of characterization and classification. Sedimentology, 55(1), 31-63.

Last I looked Blott and Pye (2001) was the most cited paper in ESPL, and is cited in a policy document, a rare occurrence for a geomorphology paper.

‘Sleeping Beauties’ of Geomorphology: a case from the American Journal of Science

Most papers in disciplinary geomorphology journals are cited at some point, but citations to papers do not always accrue immediately upon publication — ideas and papers might take time to be used by researchers and therefore cited. Extreme examples of delayed recognition (‘Sleeping Beauties‘) — where papers recieve no citations for long stretches of time only to recieve a large, late burst in citations — have been identified and investigated previously.

Do geomorphology ‘Sleeping Beauties’ exist? Using the methods of Ke et al. (2015) to find and score ‘Sleeping Beauties’, it turns out that 9 out of the 20 most delayed papers in GSA Bulletin are focused on quantitative geomorphology.

What other papers show this interesting signature of delayed recognition?

I have looked in other journals and found a few neat examples, which I hope to chronicle in a series of posts. Today, I will look at an example from the American Journal of Science (AJS):

The AJS has been published since 1818, and has long been a geology venue. In January 2017  I downloaded the 500 most cited AJS articles from the Web of Science. I used the algorithm presented in Ke et al. (2015) to find the papers with the highest ‘delayed recognition’ score — a ranking of each paper’s citation time series based on the largest, latest peak  (I urge you all to read Ke et al. (2015) which describes the method).

The most delayed paper is about brachiopods, but I want to focus on research related to geomorphology, so let’s look at the 2nd most delayed paper:

W.W.Rubey (1933): Settling velocities of gravel, sand, and silt particles. Am J Sci April 1, 1933 Series 5 Vol. 25:325-338; doi:10.2475/ajs.s5-25.148.325

(n.b., settling velocity has a special place in my heart)

Rubey’s paper has a score that is similar to the papers from GSA Bulletin. Here is the citation time series for the Rubey paper:Rubey CTS.jpg

So the natural quesiton is —  what happened that caused this 2014 burst of citations? As far as I can tell (from looking at the papers that cited Rubey), nothing in particular… Most papers that cite Rubey are focused on typical sediment transport questions. A close read of all the citing papers would be needed to figure out what is going on here, if there is some ‘signal’. Not a satisfying answer, and I apologize —leave a comment if you have an idea and I’ll update the post if I find anything out.