Citation distributions for JGR-ES (using the data that underlies the 2017 and 2018 JIF calculation)

The 2019 Journal Citation report was released a few months ago — it includes the new journal impact factor calculations for 2018. To recap from my previous posts discussing this issue, a journal impact factor is calculated by dividing the total number of citations in a given year (to articles published in the previous 2 years) by the total number of articles published in the previous 2 years. It is a not a metric for individual papers, and applying it to individual papers is a misapplication. A metric to see the full range of citations garnered by papers published in the previous two years is the journal citation distribution — the number of papers published in the past 2 years that have a given number of citations in the following year. Citation distributions can be calculated using the same set of data as a Journal Impact Factor (1 year of citations for previous 2 years of papers). I became aware of citation distributions after reading a preprint by Larivière et al. (2016).

Below I calculate the citation distributions for AGU’s JGR-ES with data for the 2017 and 2018 journal impact factor calculation (so for the 2018 calculation — 2018 citations to 2017 and 2016 publications ; for the 2017 calculation, 2017 citations to 2016 and 2015 publications). All of the data was downloaded from Scopus.


Just to clarify further — this plot reports the number of papers (on the y axis) with a given number of citations (on the x axis) — 256 papers in the left plot. 243 papers in the right plot.

Distributions are skewed, as is expected. The 2017 distribution has a spike in papers with 3 citations, which disappears in the 2018 distribution. The center of mass moves slightly rightward in 2018 — more papers had more citations. So what is in the tail (far to the right)? in 2018, it’s a bunch of cryosphere papers. I’m not going to name names…

To be honest, I’m not sure what I learned from these graphs and data tables — There are no strong signals like SfM in the 2014 and 2015 data.

I’ve done this for several years now, here is a list of the previous posts:


Five questions (and answers) about preprints

Five questions (and answers) about preprints

(This post originally appeared on the Coast and Ocean Collective Blog)

After the last post, Giovanni asked me to respond to five questions.

1) What about reviewing, are you saying that reviewing of scientific manuscripts is not necessary?

My previous post purposely avoids discussion of peer-review. If you look at the graphic of the earlier post taken from the Wouters et al. 2019 article, one role of a journal is to certify that published work has been evaluated. I think readers need to understand that preprints are not evaluated. On EarthArXiv, all manuscripts that are not peer-reviewed must have notice stating as much.

But thinking about peer-review and evaluation brings up some deeper questions. By assuming that an article is true or valid just because it is typeset in a journal is ridiculous idea — the burden of convincing a reader is on the author based on their work. My point is that perhaps we should remember that most of use read and evaluate papers for ourselves — regardless of where it is published.

Additionally, most journals hide peer review reports (the Copernicus/EGU journals being an exception) — most of the time, we are just assuming that the peer reviewing process was performed adequately for published papers. This is not the case for published peer review reports (e.g., the Copernicus/EGU journals) a reader can actually read and understand the peer reviewing process for a given paper, but hidden review reports force a reader to trust that the journal publisher, the editors, and the reviewers all did an adequate job.

Lastly, there are initiatives to decouple review from journals. For example, PREreview is an initiative focused on reviews for preprints. I think it’s a neat idea if journal clubs focus on preprints, and produce review reports that could be relayed to the author. Another example is PubPeer.

2) Are preprints just a stop-gap solutions or are they a general solution to problems in scholarly publishing?  

I think preprints are part of a larger solution to some problems in scholarly publishing. There are bigger issues though, many having to do with the costs to authors (page charges; OA fees) and the cost of journal subscriptions for libraries. A clear next step is to start to develop free open access journals, and potentially even journals that could potentially work on top of the preprint servers. here are some really inspiring examples:

Journals like VOLCANICA are breaking new ground for the Earth sciences by creating a free, open access journal with a v. low operating cost (500 euros per year) for the entire journal and transparency in the breakdown of those costs. Here is the editorial —

Another journals with low operating costs and transparency on how costs are used is JOSS — Journal of Open Source Software — — operating costs of ~$3.50/per article. Joss runs on Github, and here is the editorial describing the model and costs:

Lastly an example of a journal that explicitly leverages preprint infrastructure is Discrete Analysis — An author submits an article to a preprint server, in this case ArXiv, and tells the journal they would like to submit. The journal organizes reviews, and if it is accepted, the article is given a special LaTeX template to indicate it was reviewed, and the new version is resubmitted to ArXiv.

3) What about papers that get rejected? Especially if a paper is rejected from a short-form journal and needs to go to a long-form Journal?

Preprints have version control, so if the paper needs to change to respond to criticism, or the authors need to adjust formatting for a different journal, then the preprint is replaced with a new copy.

4) Can I cite preprints?

Yes, preprints can and should be cited. Preprints are given a DOI, and this can be used in the journal citation. EarthArXiv preprints are also indexed by Google Scholar. In the case that people have cited your preprint and you now have a published journal article, Dan Ibarra wrote a nice summary on twitter on how to merge these two entries in Google Scholar.

5) Are you sure about how good it is for early career researchers? Are there risks?

In my opinion, I think preprints are really great for early career researchers. I think there are two issues with preprints that might be perceived as risks for researchers — first, if a paper is rejected and needs to be reformatted for a new journal. Second, if there is an error in a paper and it needs to be corrected. If there are errors in a paper, the author can easily issue a correction in the form of a new version of the preprint. Error correction becomes easier and hopefully faster than dealing with a journal article. Reformatting is solved in the same way — a new version of the preprint can be produced. Perhaps having multiple versions of a preprint might make a researcher feel self conscious or embarrassed, but as a community we should destigmatize this process of versioning of preprints (and all scholarly artifacts, for that matter). Error identification and correction is an important part of scientific process, and journal choices are often not controlled by early career researchers. Regardless, it’s likely that only a rare few people will dig through old preprint versions to determine the changes that authors made.

My personal belief is that the rewards of visibility to written scientific work outweigh anything that can be called a risk.

Preprints Exist!

(This post originally appeared on the Coast and Ocean Collective Blog)

Preprints are non-peer-reviewed scholarly documents that precede publication in a peer-reviewed journal. Several disciplines like Physics, Astronomy and computer Science have been using preprints through arXiv for decades. Other disciplines are catching on, notably the biological sciences (See bioRxiv), and a variety of other discipline specific preprint services (e.g., here). There are many great articles and blog posts discussing preprints recently— common questions, critiques, misconceptions, concerns, etc. — here are three especially useful introductions:

1) Bourne PE, Polka JK, Vale RD, Kiley R (2017) Ten simple rules to consider regarding preprint submission. PLoS Comput Biol 13(5): e1005473.

2) Sarabipour S, Debat HJ, Emmott E, Burgess SJ, Schwessinger B, Hensel Z (2019) On the value of preprints: An early career researcher perspective. PLoS Biol 17(2): e3000151.

3) A recent comprehensive look by Sheila Saia for the Young Hydrologic Society website is particularly useful for the Earth and Environmental scientists.

Full disclosure: I am a big advocate for preprints, interested in preprint adoption as a topic of study, and I am a current member of the EarthArXiv community — EarthArXiv is a community run preprint server for the Earth sciences (Narock et al. 2019). We have a very active community (especially on twitter!) so please bring us your questions/comments/concerns/clarifications.

To me, there are too many interesting facets about preprints to discuss in a single post. Here, I focus on some ways in which preprints compliment existing, more traditional ways of publishing — so we need to start by looking at scholarly communication and scholarly publishing, specifically journals.

A recent comment by Wouters et al (2019) outlined 5 roles for journals:

[Graphic from Wouters et al (2019) ]

I want to focus on discussing preprints in relation to task 1 (Registering) and 5 (Archiving). These are tasks that a scientific journal currently does by giving submission dates and assigning a persistent identifier to a journal article (i.e., the digital object identifier; DOI). In my opinion, these are tasks that we do not necessarily need a scientific journal to do. Instead, a preprint can accomplish these tasks — establishing precedent for an idea, and providing a means of citing the idea via a DOI.

If we rely on journals to do these tasks, the process can be attenuated. Peer-review can take months (or even years) before an article is published and visible to a community of peers. This is not a complaint against peer-review or the peer-review process — I am arguing here that several steps can occur before peer-review. My opinion is that bundling the registration of an idea (Task 1) and the archiving of idea (Task 5) with the peer-review process is suboptimal for one key reason:

No one can read, cite, or respond to an idea when the paper is hidden in review — only the editor, AE, reviewers, and coauthors can read, engage with, explore, and think about the work. These ideas may be presented at conferences, but in the written record, they do not exist (e.g., many journals have policies discouraging citations to conference abstracts). Ideas that are preprinted have persistent identifiers (DOIs) and can (and should!) be cited and discussed by others — preprints exist.

As an early career scientist, this is especially important. Scholarly work in review with no preprint remains invisible to the broader community. Early career scientists often mention ‘in prep’ or in review articles on CVs — I’d argue that this is far less meaningful than linking to a preprint version (where people could actually read and cite you work). Again, preprints exist.

Being unable to read and cite articles that are in review in a transparent way hampers our ability to do science. Hiding articles through the review process is a form of information asymmetry — and a bizarre, imperfect hiding. I know about lots of work that remains hidden — I read about it as a grant or paper reviewer, I hear about it in passing from colleagues, and conference presentations give a glimpse of what will be published in the next few years — but I cannot cite these ideas or these works unless there are preprints. Put another way — there is a subset of ideas that I know about, but can’t share with colleagues. This is strange.

This is where preprints come into the picture. Preprint services like EarthArXiv can 1) store papers (i.e., registering intellectual claims associated with author names and submission timestamps), and 2) assign DOIs and archive scholarly artifacts. Therefore preprint services accomplish Task 1 (registering) and Task 5 (archiving) in the Wouters et al (2019) taxonomy. Preprints leave the other tasks (curating, evaluating, and disseminating) for other services such as scientific journals.

My argument here is that we should unbundle the services that journals provide to increase the flow of information. Preprints can accomplish some of these tasks faster, cheaper, and better than traditional journals.

My Time at CSDMS 2019


(This post originally appeared on the Coast and Ocean Collective Blog)

In May I went to my first annual meeting of CSDMS— the Community Surface Dynamics Modeling System. It was great to see old friends and meet new ones.

CSDMS is involved in a range of different projects and provides a suite of different services to the earth surface processes modeling community. You might know about CSDMS from its model repository (with metadata and links to source code) and the handy tools developed by CSDMS to link models together. For more background on CSDMS, check out their webpage.

One nice aspect of CSDMS is that the keynotes and panels are recorded and put on YouTube, and many poster presenters upload PDFs of their poster. I have spent a few hours skimming through these videos and PDFs from past meetings — lots of interesting ideas.

The annual meeting theme this year was ‘Bridging Boundaries’, and there was a range of interesting talks, posters, clinics, breakout sessions, and panels. I want to just mention a few highlight during those 3 packed days.

  • I really enjoyed the wide range of keynotes. Two particularly interesting ones were:
  • I really enjoyed the 2 panel discussions:
  • A real highlight for me was Dan Buscombe’s deep learning clinic. Dan walked us through a comprehensive Jupyter notebook based on his work on pixel-scale image classification. It was great to hear Dan explain his workflow, and it was great to meet him in person. I urge you to check out his work!
  • There were too many amazing posters to cover in one post. I recommend scrolling through the abstracts and poster pdfs online.
  • I live-tweeted the 3rd day through the CSDMS and AGU EPSP twitter accounts. This was really fun and I’m grateful for the opportunity from the AGU EPSP social media team.
  • I am very grateful to CSDMS for inviting me to give a keynote this year — it was exciting to share my ideas with such a talented group of people. My talk — video, slides — focused on ML work that I have done with the Coast and Ocean Collective (and others), specifically work on swash, runup, ‘hybrid’ models, and the ML review paper that was just published.
  • Lastly, I ate a lot of (good) pizza.



Twitter record of #AGU17

I missed the 2017 Fall AGU meeting, but I did follow along on twitter. However the coverage was spotty — some sessions were mentioned, some not at all. From this experience I kept wondering about the digital traces of the meeting on twitter. Lo and behold I saw this tweet from Dr. Christina K. Pikas (@cpikas) at the beginning of this year:

So let’s look at this awesome dataset that Dr. Pikas collected and published on figshare:. First, this data was collected using TAGS, and contains tweets from Nov. 4th, 2017 to Jan. 4th, 2018 that used the hashtag #AGU17There are a total of 31,909 tweets in this dataset. In this post I am subsetting the data to look only at the meeting (with a 1 day buffer, so Sunday Dec. 10, 2017 to Saturday Dec. 17, 2017) — a total of 25,531 tweets during the 7 days:


I noticed:

  • Twitter activity decays through the week (fatigue? do most people just tweet their arrival? Daily attendance variations?)
  • There is a noticeable lunch break on M, W, Th, and F
  • Each day twitter activity starts suddenly, but has a slow(er) decay at the end of the day (late night activities?)

Retweets account for 44% of the 25,531 tweets during the meeting. Removing RTs yields an almost identical plot, but there is small peak that appears at the end of each day (pre-bedtime tweets?):


Lastly, the biggest #AGU17 twitter user is @theAGU (by far), which sent 1063 tweets during the week. Here is the timeseries with only @theAGU tweets:


I see the lunch break and not as many late nights for the organization.

Thanks @cpikas for collecting and publishing the data! It is available on figshare:

My code is on github here

Data Collection: getting data for GRL articles

In previous posts I have looked at several aspects of Earth and Space Science citations in Wikipedia. As part of a project I am working on, I’m interested in expanding this work to look at some mechanics of citations in Wikipedia to articles in Geophysical Research Letters (e.g., when do they appear, who writes the edit, on what Wikipedia pages, etc.). In this post, I want to walk through my method for getting the data that I will analyze. All the code is available (note that I am not a good R programmer).

Data on Wikipedia mentions are aggregated by Altmetric. rOpenSci built a tool to get altmetric data (rAltmetric) using the Altmetric API. rAltmetric works by retrieving data for each paper using the paper’s DOIs — so I need the DOIs for any/all papers before I do anything. Fortunately, rOpenSci has a tool for this too — rcrossref — which queries the Crossref database for relevant DOIs given some condition.

Since my project is focused on Geophysical Research Letters, I only need the DOIs for papers published in GRL. Using the ISSN for GRL, I downloaded 36,960 DOIs associated with GRL and then the associated Altmetric data (using rAltmetric).

The data from rAltmetric returns the number of times a given article is cited in Wikipedia. But I want some extra detail:

  • The name of the Wikipedia article where the GRL citation appears
  • When the edit was made
  • and Who made the edit

This information is returned through the Altmetric commercial API — you can email Stacy Konkiel at Altmetric to get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research). I got the data another way, via webscraping. To keep everything in R, I used rvest to scrape the Altmetric page (for each GRL article) to get Wikipedia information — the Wikipedia page that was edited, the author, and the edit date. Here is an example of an page for a GRL article:


The Wikipedia page (‘Overwash’), the user (‘Ebgoldstein’ — hey that’s me!), and the edit date (’10 Feb 2017′) are all mentioned… this is the data that I scraped for.

Additionally I scraped the GRL article page to get the date that the GRL article first appeared online (not when it was formally typeset and published). Here is an exampLE of a GRL article landing page:


Notice that the article was first published on 15 Dec 2016. However, if you click the ‘Full Publication History’ link, you find out that the article first appeared online 24 Nov 2016 — so potential Wikipedia editors could add a citation prior to the formal ‘publication date’ of the GRL article.

So now that I have that data, what does it look like? Out of 36,960 GRL articles, 921 appear in Wikipedia, some are even cited multiple times. Below is a plot with the number of GRL articles (y-axis) that appear in Wikipedia, tallied by the number of times they are cited in Wikipedia — note the log y-axis.


GRL articles are spread over a range of Wikipedia pages, but some Wikipedia pages have many references to GRL articles (note the log scale of the y-axis):


553 Wikipedia Articles have a reference to only a single GRL article, while some articles contain many GRL references. Take for instance the ‘Cluster II (spacecraft)‘ page, with 25 GRL citations, or ‘El Niño‘ with 11 GRL references).

I’ll dive into data I collected over the next few weeks in a series of blog posts, but I want to leave you with some caveats about the code and the data so far. (Edited after the initial posting) only shows the data for up to 5 Wikipedia mentions for a given journal articles unless you have paid (instituitonal) access. Several GRL articles were cited in >5 Wikipedia articles, so I manually added the missing data. Hopefully i will make a programmatic work-around sometime. After I wrote this post, I was informed that the commerical Altmetric API gives all of the Wikipedia data (edit, editor, date). To get a commerical API key through Altmetric’s ‘researcher data access program’ (free access for those doing research), email Stacy Konkiel at Altmetric (thanks Stacy!).

Furthermore, many of the edit times that you see here could be re-edits, therefore ‘overprinting’ the date and editor for the first appearance of the wikipedia citation. This will be the subject of a future post, though I haven’t yet found an easy way to get the original edit…