Language Selection

English French German Italian Portuguese Spanish

Accurate Conclusions from Bogus Data: Methodological Issues in “Collaboration in the open-source arena: The WebKit case”

Filed under
Sci/Tech

Nearly five years ago, when I was in grad school, I stumbled across the paper Collaboration in the open-source arena: The WebKit case when trying to figure out what I would do for a course project in network theory (i.e. graph theory, not computer networking; I’ll use the words “graph” and “network” interchangeably). The paper evaluates collaboration networks, which are graphs where collaborators are represented by nodes and relationships between collaborators are represented by edges. Our professor had used collaboration networks as examples during lecture, so it seemed at least mildly relevant to our class, and I wound up writing a critique on this paper for the class project. In this paper, the authors construct collaboration networks for WebKit by examining the project’s changelog files to define relationships between developers. They perform “community detection” to visually group developers who work closely together into separate clusters in the graphs. Then, the authors use those graphs to arrive at various conclusions about WebKit (e.g. “[e]ven if Samsung and Apple are involved in expensive patent wars in the courts and stopped collaborating on hardware components, their contributions remained strong and central within the WebKit open source project,” regarding the period from 2008 to 2013).

At the time, I contacted the authors to let them know about some serious problems I found with their work. Then I left the paper sitting in a short-term to-do pile on my desk, where it has been sitting since Obama was president, waiting for me to finally write this blog post. Unfortunately, nearly five years later, the authors’ email addresses no longer work, which is not very surprising after so long — since I’m no longer a student, the email I originally used to contact them doesn’t work anymore either — so I was unable to contact them again to let them know that I was finally going to publish this blog post. Anyway, suffice to say that the conclusions of the paper were all correct; however, the networks used to arrive at those conclusions suffered from three different mistakes, each of which was, on its own, serious enough to invalidate the entire work.

So if the analysis of the networks was bogus, how did the authors arrive at correct conclusions anyway? The answer is confirmation bias. The study was performed by visually looking at networks and then coming to non-rigorous conclusions about the networks, and by researching the WebKit community to learn what is going on with the major companies involved in the project. The authors arrived at correct conclusions because they did a good job at the later, then saw what they wanted to see in the graphs.

I don’t want to be too harsh on the authors of this paper, though, because they decided to publish their raw data and methodology on the internet. They even published the python scripts they used to convert WebKit changelogs into collaboration graphs. Had they not done so, there is no way I would have noticed the third (and most important) mistake that I’ll discuss below, and I wouldn’t have been able to confirm my suspicions about the second mistake. You would not be reading this right now, and likely nobody would ever have realized the problems with the paper. The authors of most scientific papers are not nearly so transparent: many researchers today consider their source code and raw data to be either proprietary secrets to be guarded, or simply not important enough to merit publication. The authors of this paper deserve to be commended, not penalized, for their openness. Mistakes are normal in research papers, and open data is by far the best way for us to be able to detect mistakes when they happen.

Read more

More in Tux Machines

digiKam 7.7.0 is released

After three months of active maintenance and another bug triage, the digiKam team is proud to present version 7.7.0 of its open source digital photo manager. See below the list of most important features coming with this release. Read more

Dilution and Misuse of the "Linux" Brand

Samsung, Red Hat to Work on Linux Drivers for Future Tech

The metaverse is expected to uproot system design as we know it, and Samsung is one of many hardware vendors re-imagining data center infrastructure in preparation for a parallel 3D world. Samsung is working on new memory technologies that provide faster bandwidth inside hardware for data to travel between CPUs, storage and other computing resources. The company also announced it was partnering with Red Hat to ensure these technologies have Linux compatibility. Read more

today's howtos

  • How to install go1.19beta on Ubuntu 22.04 – NextGenTips

    In this tutorial, we are going to explore how to install go on Ubuntu 22.04 Golang is an open-source programming language that is easy to learn and use. It is built-in concurrency and has a robust standard library. It is reliable, builds fast, and efficient software that scales fast. Its concurrency mechanisms make it easy to write programs that get the most out of multicore and networked machines, while its novel-type systems enable flexible and modular program constructions. Go compiles quickly to machine code and has the convenience of garbage collection and the power of run-time reflection. In this guide, we are going to learn how to install golang 1.19beta on Ubuntu 22.04. Go 1.19beta1 is not yet released. There is so much work in progress with all the documentation.

  • molecule test: failed to connect to bus in systemd container - openQA bites

    Ansible Molecule is a project to help you test your ansible roles. I’m using molecule for automatically testing the ansible roles of geekoops.

  • How To Install MongoDB on AlmaLinux 9 - idroot

    In this tutorial, we will show you how to install MongoDB on AlmaLinux 9. For those of you who didn’t know, MongoDB is a high-performance, highly scalable document-oriented NoSQL database. Unlike in SQL databases where data is stored in rows and columns inside tables, in MongoDB, data is structured in JSON-like format inside records which are referred to as documents. The open-source attribute of MongoDB as a database software makes it an ideal candidate for almost any database-related project. This article assumes you have at least basic knowledge of Linux, know how to use the shell, and most importantly, you host your site on your own VPS. The installation is quite simple and assumes you are running in the root account, if not you may need to add ‘sudo‘ to the commands to get root privileges. I will show you the step-by-step installation of the MongoDB NoSQL database on AlmaLinux 9. You can follow the same instructions for CentOS and Rocky Linux.

  • An introduction (and how-to) to Plugin Loader for the Steam Deck. - Invidious
  • Self-host a Ghost Blog With Traefik

    Ghost is a very popular open-source content management system. Started as an alternative to WordPress and it went on to become an alternative to Substack by focusing on membership and newsletter. The creators of Ghost offer managed Pro hosting but it may not fit everyone's budget. Alternatively, you can self-host it on your own cloud servers. On Linux handbook, we already have a guide on deploying Ghost with Docker in a reverse proxy setup. Instead of Ngnix reverse proxy, you can also use another software called Traefik with Docker. It is a popular open-source cloud-native application proxy, API Gateway, Edge-router, and more. I use Traefik to secure my websites using an SSL certificate obtained from Let's Encrypt. Once deployed, Traefik can automatically manage your certificates and their renewals. In this tutorial, I'll share the necessary steps for deploying a Ghost blog with Docker and Traefik.