OKCupid Data Leak – Framing the Debate

You’ve probably heard by now that a ‘researcher’ by the name of Emil Kirkegaard released the sensitive data of 70,000 individuals from OKCupid on the Open Science framework. This is an egregious violation of research ethics and we’re already beginning to see mainstream media coverage of this unfolding story. I’ve been following this pretty closely as it involves my PhD alma mater Aarhus University. All I want to do here is collect relevant links and facts for those who may not be aware of the story. This debacle is likely going  become a key discussion piece in future debates over how to conduct open science. Jump to the bottom of this post for a live-updated collection of news coverage, blogs, and tweets as this issue unfolds.

Emil himself continues to fan flames by being totally unapologetic:

An open letter has been formed here, currently with the signatures of over 150 individuals (myself included) petitioning Aarhus University for a full statement and investigation of the issue:


Meanwhile Aarhus University has stated that Emil acted without oversight or any affiliation with AU, and that if he has claimed otherwise they intend to take (presumably legal) action:


I’m sure a lot more is going to be written as this story unfolds; the implications for open science are potentially huge. Already we’re seeing scientists wonder if this portends previously unappreciated risks of sharing data:

I just want to try and frame a few things. In the initial dust-up of this story there was a lot of confusion. I saw multiple accounts describing Emil as a “PI” (primary investigator), asking for his funding to be withdrawn, etc. At the time the details surrounding this was rather unclear. Now as more and more emerge it seems to paint a rather different picture, which is not being accurately portrayed so far in the media coverage:

Emil is not a ‘researcher’. He acted without any supervision or direct affiliation to AU. He is a masters student who claims on his website that he is ‘only enrolled at AU to collect SU [government funds])’. I’m seeing that most of the outlets describe this as ‘researchers release OKCupid data’. When considering the implications of this for open science and data sharing, we need to frame this as what it is: a group of hacktivists exploiting a security vulnerability under the guise of open science. NOT a university-backed research program.

What implications does this have for open science? From my perspective it looks like we need to discuss the role oversight and data protection. Ongoing twitter discussion suggests Emil violated EU data protection laws and the OKCupid terms of service. But other sources argue that this kind of scraping ‘attack’ is basically data-gathering 101 and that nearly any undergraduate with the right education could have done this. It seems like we need to have a conversation about our digital rights to data privacy, and whether those are doing enough to protect us. Doesn’t OKCupid itself hold some responsibility for allowing this data be access so easily? And what is the responsibility of the Open Science Foundation? Do we need to put stronger safeguards in place? Could an organization like anonymous, or even ISIS, ‘dox’ thousands of people and host the data there? These are extreme situations, but I think we need to frame them now before people walk away with the idea that this is an indictment of data sharing in general.

Below is a collection of tweets, blogs, and news coverage of the incident:


Brian Nosek on the Open Science Foundations Response:

More tweets on larger issues:


Emil has stated he is not acting on behalf of AU:


News coverage:










Here is a great example of how bad this is; Wired runs stury with headline ‘OKCupid study reveals perils of big data science:

OkCupid Study Reveals the Perils of Big-Data Science

This is not a study!  It is not ‘science’! At least not by any principle definition!





Here is a defense of Emil’s actions:


Birth of a New School: How Self-Publication can Improve Research

Edit: click here for a PDF version and citable figshare link!

Preface: What follows is my attempt to imagine a radically different future for research publishing. Apologies for any overlooked references – the following is meant to be speculative and purposely walks the line between paper and blog post. Here is to a productive discussion regarding the future of research.

Our current systems of producing, disseminating, and evaluating research could be substantially improved. For-profit publishers enjoy extremely high taxpayer-funded profit margins. Traditional closed-door peer review is creaking under the weight of an exponentially growing knowledge base, delaying important communications and often resulting in seemingly arbitrary publication decisions1–4. Today’s young researchers are frequently dismayed to find their pain-staking work producing quality reviews overlooked or discouraged by journalistic editorial practices. In response, the research community has risen to the challenge of reform, giving birth to an ever expanding multitude of publishing tools: statistical methods to detect p-hacking5, numerous open-source publication models6–8, and innovative platforms for data and knowledge sharing9,10.

While I applaud the arrival and intent of these tools, I suspect that ultimately publication reform must begin with publication culture – with the very way we think of what a publication is and can be. After all, how can we effectively create infrastructure for practices that do not yet exist? Last summer, shortly after igniting #pdftribute, I began to think more and more about the problems confronting the publication of results. After months of conversations with colleagues I am now convinced that real reform will come not in the shape of new tools or infrastructures, but rather in the culture surrounding academic publishing itself. In many ways our current publishing infrastructure is the product of a paper-based society keen to produce lasting artifacts of scholarly research. In parallel, the exponential arrival of networked society has lead to an open-source software community in which knowledge is not a static artifact but rather an ever-expanding living document of intelligent productivity. We must move towards “research 2.0” and beyond11.

From Wikipedia to Github, open-source communities are changing the way knowledge is produced and disseminated. Already this movement has begun reach academia, with researchers across disciplines flocking to social media, blogs, and novel communication infrastructures to create a new movement of post-publication peer review4,12,13. In math and physics, researchers have already embraced self-publication, uploading preprints to the online repository arXiv, with more and more disciplines using the site to archive their research. I believe that the inevitable future of research communication is in this open-source metaphor, in the form of pervasive self-publication of scholarly knowledge. The question is thus not where are we going, but rather how do we prepare for this radical change in publication culture. In asking these questions I would like to imagine what research will look like 10, 15, or even 20 years from today. This post is intended as a first step towards bringing to light specific ideas for how this transition might be facilitated. Rather than this being a prescriptive essay, here I am merely attempting to imagine what that future may look like. I invite you to treat what follows as an ‘open beta’ for these ideas.

Part 1: Why self-publication?

I believe the essential metaphor is within the open-source software community. To this end over the past few months I have  feverishly discussed the merits and risks of self-publishing scholarly knowledge with my colleagues and peers. While at first I was worried many would find the notion of self-publication utterly absurd, I have been astonished at the responses – many have been excitedly optimistic! I was surprised to find that some of my most critical and stoic colleagues have lost so much faith in traditional publication and peer review that they are ready to consider more radical options.

The basic motivation for research self-publication is pretty simple: research papers cannot be properly evaluated without first being read. Now, by evaluation, I don’t mean for the purposes of hiring or grant giving committees. These are essentially financial decisions, e.g. “how do I effectively spend my money without reading the papers of the 200+ applicants for this position?” Such decisions will always rely on heuristics and metrics that must necessarily sacrifice accuracy for efficiency. However, I believe that self-publication culture will provide a finer grain of metrics than ever dreamed of under our current system. By documenting each step of the research process, self-publication and open science can yield rich information that can be mined for increasingly useful impact measures – but more on that later.

When it comes to evaluating research, many admit that there is no substitute for opening up an article and reading its content – regardless of journal. My prediction is, as post-publication peer review gains acceptance, some tenured researcher or brave young scholar will eventually decide to simply self-publish her research directly onto the internet, and when that research goes viral, the resulting deluge of self-publications will be overwhelming. Of course, busy lives require heuristic decisions and it’s arguable that publishers provide this editorial service. While I will address this issue specifically in Part 3, for now I want to point out that growing empirical evidence suggests that our current publisher/impact-based system provides an unreliable heuristic at best14–16. Thus, my essential reason for supporting self-publication is that in the worst-case scenario, self-publications must be accompanied by the disclaimer: “read the contents and decide for yourself.” As self-publishing practices are established, it is easy to imagine that these difficulties will be largely mitigated by self-published peer reviews and novel infrastructures supporting these interactions.

Indeed, with a little imagination we can picture plenty of potential benefits of self-publication to offset the risk that we might read poor papers. Researchers spend exorbitant amounts of their time reviewing, commenting on, and discussing articles – most of that rich content and meta-data is lost under the current system. In documenting the research practice more thoroughly, the ensuing flood of self-published data can support new quantitative metrics of reviewer trust, and be further utlized in the development of rich information about new ideas and data in near real-time. To give just one example, we might calculate how many subsequent citations or retractions a particular reviewer generates, generating a reviewer impact factor and reliability index. The more aspects of research we publish, the greater the data-mining potential. Incentivizing in-depth reviews that add clarity and conceptual content to research, rather than merely knocking down or propping up equally imperfect artifacts, will ultimately improve research quality. By self-publishing well-documented, open-sourced pilot data and accompanying digital reagents (e.g. scripts, stimulus materials, protocols, etc), researchers can get instant feedback from peers, preventing uncounted research dollars from being wasted. Previously closed-door conferences can become live records of new ideas and conceptual developments as they unfold. The metaphor here is research as open-source – an ever evolving, living record of knowledge as it is created.

Now, let’s contrast this model to the current publishing system. Every publisher (including open-access) obliges researchers to adhere to randomly varied formatting constraints, presentation rules, submission and acceptance fees, and review cultures. Researchers perform reviews for free for often publically subsidized work, so that publishers can then turn around and sell the finished product back to those same researchers (and the public) at an exorbitant mark-up. These constraints introduce lengthy delays – ranging from 6+ months in the sciences all the way up to two years in some humanities disciplines. By contrast, how you self-publish your research is entirely up to you – where, when, how, the formatting, and the openness. Put simply, if you could publish your research how and when you wanted, and have it generate the same “impact” as traditional venues, why would you use a publisher at all?

One obvious reason to use publishers is copy-editing, i.e. the creation of pretty manuscripts. Another is the guarantee of high-profile distribution. Indeed, under the current system these are legitimate worries. While it is possible to produce reasonably formatted papers, ideally the creation of an open-source, easy to use copy-editing software is needed to facilitate mainstream self-publication. Innovators like figshare are already leading the way in this area. In the next section, I will try to theorize some different ways in which self-publication can overcome these and other potential limitations, in terms of specific applications and guidelines for maximizing the utility of self-published research. To do so, I will outline a few specific cases with the most potential for self-publication to make a positive impact on research right away, and hopefully illuminate the ‘why’ question a bit further with some concrete examples.

 Part 2: Where to begin self-publishing

What follows is the “how-to” part of this document. I must preface by saying that although I have written so far with researchers across the sciences and humanities in mind, I will now focus primarily on the scientific examples with which I am more experienced.  The transition to self-publication is already happening in the forms of academic tweets, self-archives, and blogs, at a seemingly exponential growth rate. To be clear, I do not believe that the new publication culture will be utopian. As in many human endeavors the usual brandism3, politics, and corruption can be expected to appear in this new culture. Accordingly, the transition is likely to be a bit wild and woolly around the edges. Like any generational culture shift, new practices must first emerge before infrastructures can be put in place to support them. My hope is to contribute to that cultural shift from artifact to process-based research, outlining particularly promising early venues for self-publication. Once these practices become more common, there will be huge opportunities for those ready and willing to step in and provide rich informational architectures to support and enhance self-publication – but for now we can only step into that wild frontier.

In my discussions with others I have identified three particularly promising areas where self-publication is either already contributing or can begin contributing to research. These are: the publication of exploratory pilot-data, post-publication peer reviews, and trial pre-registration. I will cover each in turn, attempting to provide examples and templates where possible. Finally, Part 3 will examine some common concerns with self-publication. In general, I think that successful reforms should resemble existing research practices as much as possible: publication solutions are most effective when they resemble daily practices that are already in place, rather than forcing individuals into novel practices or infrastructures with an unclear time-commitment. A frequent criticism of current solutions such as the comments section on Frontiers, PLOS One, or the newly developed PubPeer, is that they are rarely used by the general academic population. It is reasonable to conclude that this is because already over-worked academics currently see little plausible benefit from contributing to these discussions given the current publishing culture (worse still, they may fear other negative repercussions, discussed in Part 3). Thus a central theme of the following examples is that they attempt to mirror practices in which many academics are already engaged, with complementary incentive structures (e.g. citations).

Example 1: Exploratory Pilot Data 

This previous summer witnessed a fascinating clash of research cultures, with the eruption of intense debate between pre-registration advocates and pre-registration skeptics. I derived some useful insights from both sides of that discussion. Many were concerned about what would happen to exploratory data under these new publication regimes. Indeed, a general worry with existing reform movements is that they appear to emphasize a highly conservative and somewhat cynical “perfect papers” culture. I do not believe in perfect papers – the scientific model is driven by replication and discovery. No paper can ever be 100% flawless – otherwise there would be no reason for further research! Inevitably, some will find ways to cheat the system. Accordingly, reform must incentivize better reporting practices over stricter control, or at least balance between the two extremes.

Exploratory pilot data is an excellent avenue for this. By their very nature such data are not confirmatory – they are exciting in that they do not conform well to prior predictions. Such data benefit from rapid communication and feedback. Imagine an intuition-based project – a side or pet project conducted on the fly for example. The researcher might feel that the project has potential, but also knows that there could be serious flaws. Most journals won’t publish these kinds of data. Under the current system these data are lost, hidden, obscured, or otherwise forgotten.

Compare to a self-publication world: the researcher can upload the data, document all the protocols, make the presentation and analysis scripts open-source, and provide some well-written documentation explaining why she thinks the data are of interest. Some intrepid graduate student might find it, and follow up with a valuable control analysis, pointing out an excellent feature or fatal flaw, which he can then upload as a direct citation to the original data. Both publications are citable, giving credit to originator and reviewer alike. Armed with this new knowledge, the original researcher could now pre-register an altered protocol and conduct a full study on the subject (or alternatively, abandon the project entirely). In this exchange, it is likely that hundreds of hours and research dollars will have been saved. Additionally, the entire process will have been documented, making it both citable and minable for impact metrics. Tools already exist for each of these steps – but largely cultural fears prevent it from happening. How would it be perceived? Would anyone read it? Will someone steal my idea? To better frame these issues, I will now examine a self-publication practice that has already emerged in force.

 Example 2: Post-publication peer review

This is a particularly easy case, precisely because high-profile scholars are already regularly engaged in the practice. As I’ve frequently joked on twitter, we’re rapidly entering an era where publishing in a glam-mag has no impact guarantee if the paper itself isn’t worthwhile – you may as well hang a target on your head for post-publication peer reviewers. However, I want to emphasize the positive benefits and not just the conservative controls. Post-publication peer review (PPPR) has already begun to change the way we view research, with reviewers adding lasting content to papers, enriching the conclusions one can draw, and pointing out novel connections that were not extrapolated upon by the authors themselves. Here I like to draw an analogy to the open source movement, where code (and its documentation) is forkable, versioned, and open to constant revision – never static but always evolving.

Indeed, just last week PubMed launched their new “PubMed Commons” system, an innovative PPPR comment system, whereby any registered person (with at least one paper on PubMed) can leave scientific comments on articles.  Inevitably, the reception on twitter and Facebook mirrored previous attempts to introduce infrastructure-based solutions – mixed excitement followed by a lot of bemused cynicism – bring out the trolls many joked. To wit, a brief scan of the average comment on another platform, PubPeer, revealed a generally (but not entirely) poor level of comment quality. While many comments seem to be on topic, most had little to no formatting and were given with little context. At times comments can seem trollish, pointing out minor flaws as if they render the paper worthless. In many disciplines like my own, few comments could be found at all. This compounds the central problem with PPPR; why would anyone acknowledge such a system if the primary result is poorly formed nitpicking of your research? The essential problem here is again incentive – for reviews to be quality there needs to be incentive. We need a culture of PPPR that values positive and negative comments equally. This is common to both traditional and self-publication practices.

To facilitate easy, incentivized self-publication of comments and PPPRs, my colleague Hauke Hillebrandt and I have attempted to create a simple template that researchers can use to quickly and easily publish these materials. The idea is that by using these templates and uploading them to figshare or similar services, Google Scholar will automatically index them as citations, provide citation alerts to the original authors, and even include the comments in its h-index calculation. This way researchers can begin to get credit for what they are already doing, in an easy to use and familiar format. While the template isn’t quite working yet (oddly enough, Scholar is counting citations from my blog, but not the template), you can take a look at it here and maybe help us figure out why it isn’t working! In the near future we plan to get this working, and will follow-up this post with the full template, ready for you to use.

Example 3: Pre-registration of experimental trials

As my final example, I suggest that for many researchers, self-publication of trial pre-registrations (PR) may be an excellent way to test the waters of PR in a format with a low barrier to entry. Replication attempts are a particularly promising venue for PR, and self-publication of such registrations is a way to quickly move from idea to registration to collection (as in the above pilot data example), while ensuring that credit for the original idea is embedded in the infamously hard to erase memory of the internet.

A few benefits of PR self-publication, rather than relying on for-profit publishers, is that PR templates can be easily open-sourced themselves, allowing various research fields to generate community-based specialized templates adhering to the needs of that field. Self-published PRs, as well as high quality templates, can be cited – incentivizing the creation and dissemination of both. I imagine the rapid emergence of specialized templates within each community, tailored to the needs of that research discipline.

Part 3: Criticism and limitations

Here I will close by considering some common concerns with self-publication:

Quality of data

A natural worry at this point is quality control. How can we be sure that what is published without the seal of peer review isn’t complete hooey? The primary response is that we cannot, just like we cannot be sure that peer reviewed materials are quality without first reading them ourselves. Still, it is for this reason that I tried to suggest a few particularly ripe venues for self-publication of research. The cultural zeitgeist supporting full-blown scholarly self-publication has not yet arrived, but we can already begin to prepare for it. With regards to filtering noise, I argue that by coupling post-publication peer review and social media, quality self-publications will rise to the top. Importantly, this issue points towards flaws in our current publication culture. In many research areas there are effects that are repeatedly published but that few believe, largely due to the presence of biases against null-findings. Self-publication aims to make as much of the research process publicly available as possible, preventing this kind of knowledge from slipping through the editorial cracks and improving our ability to evaluate the veracity of published effects. If such data are reported cleanly and completely, existing quantitative tools can further incorporate them to better estimate the likelihood of p-hacking within a literature. That leads to the next concern – quality of presentation.

Hemingway's thoughts on data.

Quality of presentation

Many ask: how in this brave new world will we separate signal from noise? I am sure that every published researcher already receives at least a few garbage citations a year from obscure places in obscure journals with little relevance to actual article contents. But, so the worry goes, what if we are deluged with a vast array of poorly written, poorly documented, self-published crud. How would we separate the signal from the noise?

 The answer is Content, Presentation, and Clarity. These must be treated as central guidelines for self-publication to be worth anyone’s time. The Internet memesphere has already generated one rule for ranking interest: content rules. Content floats and is upvoted, blogspam sinks and is downvoted. This is already true for published articles – twitter, reddit, facebook, and email circles help us separate the wheat from the chaff at least as much as impact factor if not more. But presentation and clarity are equally important. Poorly conducted research is not shared, or at least is shared with vehemence. Similarly, poorly written self-publications, or poorly documented data/reagents are unlikely to generate positive feedback, much less impact-generating eyeballs. I like to imagine a distant future in which self-publication has given rise to a new generation of well-regarded specialists: reviewers who are prized for their content, presentation, and clarity; coders who produce cleanly documented pipelines; behaviorists producing powerful and easily customized paradigm scripts; and data collection experts who produce the smoothest, cleanest data around. All of these future specialists will be able to garner impact for the things they already do, incentivizing each step of the research processes rather than only the end product.

Being scooped, intellectual credit

Another common concern is “what if my idea/data/pilot is scooped?” I acknowledge that particularly in these early days, the decision to self-publish must be weighted against this possibility. However, I must also point out that in the current system authors must also weight the decision to develop an idea in isolation against the benefits of communicating with peers and colleagues. Both have risks and benefits – an idea or project in isolation can easily over-estimate its own quality or impact. The decision to self-publish must similarly be weighted against the need for feedback. Furthermore, a self-publication culture would allow researchers to move more quickly from project to publication, ensuring that they are readily credited for their work. And again, as research culture continues to evolve, I believe this concern will increasingly fade. It is notoriously difficult to erase information from The Internet (see the “Streisand effect”) – there is no reason why self-published ideas and data cannot generate direct credit for the authors. Indeed, I envision a world in which these contributions can themselves be independently weighted and credited.

 Prevention of cheating, corruption, self-citations

To some, this will be an inevitable point of departure. Without our time-tested guardian of peer review, what is to prevent a flood of outright fabricated data? My response is: what prevents outright fabrication under the current system? To misquote Jeff Goldblum in Jurassic Park, cheaters will always find a way. No matter how much we tighten our grip, there will be those who respond to the pressures of publication by deliberate misconduct. I believe that the current publication system directly incentivizes such behavior by valuing end product over process. By creating incentives for low-barrier post-publication peer review, pre-registration, and rich pilot data publication, researchers are given the opportunity to generate impact for each step of the research process. When faced with the vast penalties of cheating due to a null finding, versus doing one’s best to turn those data into something useful for someone, I suspect most people will choose the honest and less risky option.

 Corruption and self-citations are perhaps a subtler, more sinister factor. In my discussions with colleagues, a frequent concern is that there is nothing to prevent high-impact “rich club” institutions from banding together to provide glossy post-publication reviews, citation farming, or promoting one another’s research to the top of the pile regardless of content. I again answer: how is this any different from our current system? Papers are submitted to an editor who makes a subjective evaluation of the paper’s quality and impact, before sending it to four out of a thousand possible reviewers who will make an obscure  decision about the content of the paper. Sometimes this system works well, but increasingly it does not2. Many have witnessed great papers rejected for political reasons, or poor ones accepted for the same. Lowering the barrier to post-publication peer review means that even when these factors drive a paper to the top, it will be far easier to contextualize that research with a heavy dose of reality. Over time, I believe self-publication will incentivize good research. Cheating will always be a factor – and this new frontier is unlikely to be a utopia. Rather, I hope to contribute to the development of a bridge between our traditional publishing models and a radically advanced not-too-distant future.


Our current systems of producing, disseminating, and evaluating research increasingly seem to be out of step with cultural and technological realities. To take back the research process and bolster the ailing standard of peer-review I believe research will ultimately adopt an open and largely publisher-free model. In my view, these new practices will be entirely complementary to existing solutions including such as the p-curve5, open-source publication models6–8, and innovative platforms for data and knowledge sharing such as PubPeer, PubMed Commons, and figshare9,10. The next step from here will be to produce useable templates for self-publication. You can expect to see a PDF version of this post in the coming weeks as a further example of self-publishing practices. In attempting to build a bridge to the coming technological and social revolution, I hope to inspire others to join in the conversation so that we can improve all aspects of research.


Thanks to Hauke Hillebrandt, Kate Mills, and Francesca Fardo for invaluable discussion, comments, and edits of this work. Many of the ideas developed here were originally inspired by this post envisioning a self-publication future. Thanks also to PubPeer, PeerJ,  figshare, and others in this area for their pioneering work in providing some valuable tools and spaces to begin engaging with self-publication practices.


Excellent resources already exist for the many of the ideas presented here. I want to give special notice to researchers who have already begun self-publishing their work either as preprints, archives, or as direct blog posts. Parallel publishing is an attractive transitional option where researchers can prepublish their work for immediate feedback before submitting it to a traditional publisher. Special notice should be given to Zen Faulkes whose excellent pioneering blog posts demonstrated that it is reasonably easy to self-produce well formatted publications. Here are a few pioneering self-published papers you can use as examples – feel free to add your own in the comments:

The distal leg motor neurons of slipper lobsters, Ibacus spp. (Decapoda, Scyllaridae), Zen Faulkes


Eklund, Anders (2013): Multivariate fMRI Analysis using Canonical Correlation Analysis instead of Classifiers, Comment on Todd et al. figshare.


Automated removal of independent components to reduce trial-by-trial variation in event-related potentials, Dorothy Bishop


Deep Impact: Unintended consequences of journal rank

Björn Brembs, Marcus Munafò


A novel platform for open peer to peer review and publication:


A platform for open PPPRs:


Another PPPR platform:



1. Henderson, M. Problems with peer review. BMJ 340, c1409 (2010).

2. Ioannidis, J. P. A. Why Most Published Research Findings Are False. PLoS Med 2, e124 (2005).

3. Peters, D. P. & Ceci, S. J. Peer-review practices of psychological journals: The fate of published articles, submitted again. Behav. Brain Sci. 5, 187 (2010).

4. Hunter, J. Post-publication peer review: opening up scientific conversation. Front. Comput. Neurosci. 6, 63 (2012).

5. Simonsohn, U., Nelson, L. D. & Simmons, J. P. P-Curve: A Key to the File Drawer. (2013). at <http://papers.ssrn.com/abstract=2256237>

6.  MacCallum, C. J. ONE for All: The Next Step for PLoS. PLoS Biol. 4, e401 (2006).

7. Smith, K. A. The frontiers publishing paradigm. Front. Immunol. 3, 1 (2012).

8. Wets, K., Weedon, D. & Velterop, J. Post-publication filtering and evaluation: Faculty of 1000. Learn. Publ. 16, 249–258 (2003).

9. Allen, M. PubPeer – A universal comment and review layer for scholarly papers? | Neuroconscience on WordPress.com. Website/Blog (2013). at <https://neuroconscience.com/2013/01/25/pubpeer-a-universal-comment-and-review-layer-for-scholarly-papers/>

10. Hahnel, M. Exclusive: figshare a new open data project that wants to change the future of scholarly publishing. Impact Soc. Sci. blog (2012). at <http://eprints.lse.ac.uk/51893/1/blogs.lse.ac.uk-Exclusive_figshare_a_new_open_data_project_that_wants_to_change_the_future_of_scholarly_publishing.pdf>

11. Yarkoni, T., Poldrack, R. A., Van Essen, D. C. & Wager, T. D. Cognitive neuroscience 2.0: building a cumulative science of human brain function. Trends Cogn. Sci. 14, 489–496 (2010).

12. Bishop, D. BishopBlog: A gentle introduction to Twitter for the apprehensive academic. Blog/website (2013). at <http://deevybee.blogspot.dk/2011/06/gentle-introduction-to-twitter-for.html>

13. Hadibeenareviewer. Had I Been A Reviewer on WordPress.com. Blog/website (2013). at <http://hadibeenareviewer.wordpress.com/>

14. Tressoldi, P. E., Giofré, D., Sella, F. & Cumming, G. High Impact = High Statistical Standards? Not Necessarily So. PLoS One 8, e56180 (2013).

15.  Brembs, B. & Munafò, M. Deep Impact: Unintended consequences of journal rank. (2013). at <http://arxiv.org/abs/1301.3748>

16.  Eisen, J. A., Maccallum, C. J. & Neylon, C. Expert Failure: Re-evaluating Research Assessment. PLoS Biol. 11, e1001677 (2013).


PubPeer – A universal comment and review layer for scholarly papers?

Lately I’ve had a plethora of discussions with colleagues concerning the possible benefits of a reddit-like “democratic review layer”, which would index all scholarly papers and let authenticated users post reviews subject to karma. We’ve navel-gazed about various implementations ranging from a full out reddit clone, a wiki, or even a full blown torrent tracker with rated comments and mass piracy. So you can imagine I was pleasantly surprised to see someone actually went ahead and put together a simple app to do exactly that.


Pubpeer states that it’s mission is to “create an online community that uses the publication of scientific results as an opening for fruitful discussion.” Users create accounts using an academic email address and must have at least one first-author publication to join. Once registered any user can leave anonymous comments on any article, which are themselves subject to up/down votes and replies.

My first action was of course to search for my own name:


Hmm, no comments. Let’s fix that:


Hah! Peer review is easy! Just kidding, I deleted this comment after testing to see if it was possible. Ostensibly this is so authors can reply to comments, but it does raise some concerns that one can just leave whatever ratings you like on your own papers. In theory with enough users, good comments will be quickly distinguished from bad, regardless of who makes them.  In theory… 

This is what an article looks like in PubPeer with a few comments:


Pretty simple- any paper can be found in the database and users then leave comments associated with those papers. On the one hand I really like the simplicity and usability of PubPeer. I think any endeavor along these lines must very much follow the twitter design mentality of doing one (and only one) thing really well. I also like the use of threaded comments and upvote/downvotes but I would like to see child comments being subject to votes. I’m not sure if I favor the anonymous approach the developers went for- but I can see costs and benefits to both public and anonymous comments, so I don’t have any real suggestions there.

What I found really interesting was just to see this idea in practice. While I’ve discussed it endlessly, a few previously unforeseen worries leaped out right away. After browsing a few articles it seems (somewhat unsurprisingly) that most of the comments are pretty negative and nit-picky. Considering that most early adopters of such a system are likely to be graduate students, this isn’t too surprising. For one thing there is no such entity as a perfect paper, and graduate students are often fans of these kind of boilerplate nit-picks that form the ticks and fleas of any paper. If comments add mostly doubt and negativity to papers, it seems like the whole commenting process would become a lot of extra work for little author pay-off, since no matter what your article is going to end up looking bad.

In a traditional review, a paper’s flaws and merits are assessed privately and then the final (if accepted) paper is generally put forth as a polished piece of research that stands on it’s on merits. If a system like PubPeer were popular, becoming highly commented would almost certainly mean having tons of nitpicky and highly negative comments associated to that manuscript. This could manipulate reader perceptions- highly commented PubPeer articles would receive fewer citations regardless of their actual quality.

So that bit seems very counter-productive to me and I am not sure of the solution. It might be something similar to establishing light top-down comment moderation and a sort of “reddiquette” or user code of conduct that emphasizes fair and balanced comments (no sniping). Or, perhaps my “worry” isn’t actually troubling at all. Maybe such a system would be substantially self-policing and refreshing, shifting us from an obsession with ‘perfect papers’ to an understanding that no paper (or review) should be judged on anything but it’s own merits. Given the popularity of pun threads on reddit, i’m not convinced the wholly democratic solution will work. Whatever the result, as with most solutions to scholarly publishing, it seems clear that if PubPeer is to add substantial value to peer review then a critical mass of active users is the crucial missing ingredient.

What do you think? I’d love to hear your thoughts in the comments.

How to reply to #icanhazpdf in 3 seconds

Yesterday my friend Hauke and I theorized about a kind of dream scenario- a totally distributed, easy to use, publication liberation system. This is perhaps not feasible at this point [1]. Today we’re going to present something that will be useful right now. The essential goal here is to make it so that anyone, anywhere, can access the papers they need in a timely manner. The idea is to take advantage of existing strategies and tools to streamline paper sharing as much as possible. Folks already do this- every day on twitter or in private, requests for papers are made and fulfilled. Our goal is to completely streamline this process down to a few clicks of your mouse. That way a small but dedicated group of folks – the Papester Collective – can ensure that #icanhazpdf requests are fulfilled almost instantly. This is a work in progress. Leave comments on how to improve and further streamline this system and join the collective!


Tweet (for example): “#icanhazpdf http://dx.doi.org/10.1523/JNEUROSCI.4568-12.2013

Click: Here you can find more detailed instructions.



  1. Twitter: Monitor #icanhazpdf #requests
  2. Zotero and zotero browser plugin: after clicking on DOI link or abstract page just click on ‘Save to Zotero’ button to auto-grabs PDFs

  3. Zotfile: automatically copies new Zotero pdfs files saved to public Dropbox folder

  4. Dropbox: Cloud storage system to seamlessly share files with anyone without login.

  5. Dropbox linker: automatically adds links from public folder to your clipboard

  6. Reply to request tweets: paste URL from clipboard and if you want #papester

That’s it! Now you can just click request links, click the Zotero get PDF button, and CTRL+V a dropbox direct download link in response!

Click: Here you can find more detailed instructions.

1.The fundamental problem: uploading huge repositories of scientific papers is not sensible for now. It’s too much data (50 million papers * 0.5-1.5 megabytes together make up ~ 25-75 Terrabytes) and the likelihood for every paper to be downloaded is more uniformly distributed than with files traditionally shared like music. For instance, there are 100 million songs x 3.5 mb songs, and it is difficult to find exotic songs online – some songs have decent availability now because there are only a few favourites – not so with favourite papers. Also, fewer people will share papers than songs, so this makes it more even more difficult to sustain a complete repository. Thus, we need a system that fufills requests individually.

Disclaimer: Please make sure you only share papers with friends who also have the copyrights to the papers you share.

Could a papester button irreversibly break down the research paywall?

A friend just sent me the link to the Aaron Swartz Memorial JSTOR liberator. We started talking about it and it led to a pretty interesting idea.

As soon as I saw this it clicked: we need papester. We need a simple browser plugin that can recognize, download and re-upload any research document automatically (think zotero) to BitTorrent (this was Aaron’s original idea, just crowdsourced). These would then be automatically turned into torrents with an associated magnet link. The plugin would interact with a lightweight torrent client, using a set limit of your bandwidth (say 5%) to constantly seed back any files you have in your (zotero) library folder. Also, it would automatically use part of the bandwidth to seed missing papers (first working through a queue of DOIs of papers that were searched for by others and then just for any missing paper in reverse chronological order), so that over time all papers would be on BitTorrent. The links would be archived by google; any search engine could then find them and the plug-in would show the PDF download link.

Once this system is in place, a pirate-bay/reddit mash-up could help sort the magnet links as a meta-data rich papester torrent tracker. Users could posts comments and reviews, which would themselves be subject to karma. Over time a sorting algorithm could give greater weight to reviews from authors who consistently review unretracted papers, creating a kind of front page where “hot” would give you the latest research and “lasting” would give you timeless classics. Separating the sorting mechanism – which can essentially be any tracker – and the rating/meta-data system ensures that neither can be easily brought down. If users wish they could compile independent trackers for particular topics or highly rated papers, form review committees, and request new experiments to address flagged issues in existing articles. In this way we would ensure not only an everlasting and loss-protected research database, but irreversibly push academic publishing into an open-access and democratic review system. Students and people without access to scientific knowledge could easily find forgotten classics and the latest buzz with a simple sort. We need an “research-reddit” rating layer  – why not solve Open Access and peer review in one go?

Is this feasible? There are about 50 million papers in existence[1]. If we estimate about 500 kilobytes on average per paper, that’s 25 million MB of data, or  25 terabytes. While that may sound like a lot, remember that most torrent trackers already list much more data than this and that available bandwidth increases annually. If we can archive a ROM of every videogame created, why not papers? The entire collection of magnet links could take up as little as 1GB of data, making it easy to periodically back up the archive, ensure the system is resilient to take-downs, and re-seed less known or sought after papers. Just imagine it- all of our knowledge stored safely in an completely open collection, backed by the power of the swarm, organized by reviews, comments, and ratings, accessible to all. It would revolutionize the way we learn and share knowledge.

Of course there would be ruthless resistance to this sort of thing from publishers. It would be important to take steps to protect yourself, perhaps through TOR. The small size of the files would facilitate better encryption. When universities inevitably move to block uploads, tags could be used to later upload acquired files quickly on a public-wifi hotspot. There are other benefits as well- currently there are untold numbers of classic papers available online in reference only. What incentive is there for libraries to continue scanning these? A papester-backed uploader karma system could help bring thousands of these documents irreversibly into the fold. Even in the case that publishers found some way to stifle the system, as with Napster  the damage would be done. Just as we were pushed irrevocably towards new forms of music consumption – direct download, streaming, donate-to-listen – big publishers would be forced toward an open access model to recover costs. Finally such a system might move us closer to a self-publishing ARXIV model. In the case that you couldn’t afford open access, you could self-publish your own PDF to the system. User reviews and ratings could serve as a first layer of feedback for you to improve the article. The idea or data – with your name behind it – would be out there fast and free.


Another cool feature would be a DOI search. When a user searches for a paper that isn’t available, papster would automatically add that paper to a request queue.


This is a thought experiment about an illegal solution and it’s possible consequences and benefits. Do with it what you will but recognize the gap between the theoretical and the actual!

Arif Jinha (2010). Article 50 million: an estimate of the number of scholarly articles in existence Learned Publishing, 23 (3), 258-263 DOI: 10.1087/20100308 free pre-print available from author here

Researchers begin posting article PDFs to twitter in #pdftribute to Aaron Swartz

Yesterday, as I was completing my morning coffee and internet ritual, @le_feufollet broke the sad news to me of Aaron Swartz’s death. Aaron was a leader online, a brilliant coder and developer, and sadly a casualty in the fight for freedom of information. He was essential in the development of two tools I use every day (RSS and Reddit), and though his guerilla attempt to upload all papers on JSTOR was perhaps unstrategic, it was certainly noble enough in cause. Before his death Aaron was facing nearly 35 years in prison for his role in the JSTOR debacle, an insane penalty for attempting to share information. We don’t know why Aaron chose to take his life, but when @la_feufollet and I tried to brainstorm a tribute to him, my first thought was a guerilla PDF uploading campaign in honor of his fight for open access. I’m not much of an organizer, so I posted in one of the many rising reddit threads and hoped for the best:


My posts on reddit are usually ignored, so I went about my business and assumed it was the last i’d hear of it. It was amazing to wake up this morning and see that redditors had responded strongly to the idea and that a flood of tweets tagged #pdftribute had appeared:


Eva Vivalt coined the #pdftribute hashtag, and helped bring anonymous onboard. Currently there are hundreds thousands of authors posting their PDFs. It’s amazing to see that the original promise of the internet – the spread of ideas- is thriving. Lately i’ve been feeling a bit pessimistic, worried that the net was becoming an overly gamed, astroturf-ridden meme-preserve for advertisers to groom to their financial needs. It’s great to see that the most exciting power of our newfound connectivity- driving ideas to spread freely and have impact without the restrictions of traditional hierarchical barriers- continues to thrive. I hope #pdftribute lives on in both force and spirit, and that we can all begin working toward a world in which all publicly funded research is available to anyone with net access.

UPDATE 13/1/13 4:00 EST:

For those of you who don’t feel comfortable violating your copyright, but want to join in #pdftribute, your best bet is to check the specifics of your publisher agreement. Most journals allow you to upload a pre-print manuscript to your personal website. Then you can go ahead and tweet the link to your website or the individual pre-print PDFs. Jonathan Eisen has a helpful list of 10 ways to post your papers on twitter here.

Otherwise, hide in the swarm today as a show of support for Aaron. By standing together we show that the future of research publishing is freedom of information. But tomorrow remember that we need to push through real copright reform. You can start by reading Aaron’s wonderful Guerilla Open Access Manifesto. If you are ready to commit to open access, you can sign the petition at http://thecostofknowledge.com/. There is also this We The People petition demanding legislation requiring journals to use an open-access publishing model Woops that petition has expired- start a new one!. As Matthew Green put it, lets push for an Aaron Swartz copyright reform act.


Some nice folks have put together a link scraper to collect PDFs tagged #pdftribute here:

Screen shot 2013-01-13 at 7.46.58 PM


If I may make a humble suggestion- it may be useful to follow a specific format for sharing your papers. This will make them easier to find later, and for journalists to compile some sharing stats. Here is my suggested example.

Screen shot 2013-01-13 at 2.12.23 PM


Eva Vivalt reports #pdftribute getting 500 tweets/hr, >2.5 million impressions!

Screen shot 2013-01-13 at 1.46.05 PM