Code and Tools


Overview Link to heading

I have authored and continue to maintain a number of Open Source projects, primarily R code that compiles open access data sets. My open access software repository with CERN can be accessed either via the convenient or through a direct link.

Active development occurs on GitHub.

You can also view and download my Linux configuration (e.g. dot files, package lists, install scripts) in a continuously updated GitHub repository.

Principles Link to heading

When I engage in data science, I am serious about the science part. That is why I am an avid proponent of Open Source software and strive to make my publications based on computational results fully reproducible. I fully endorse the famous admonition of Buckheit and Donoho (2010: 385):

An article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code and data, that produced the result.1

Naturally, this is easier said than done. As a matter of course I try to release open access all datasets I create, publish the full source code (including version numbers of all dependencies) and make available my computational results with stable identifiers in long-term storage on Zenodo, the scientific repository of the high-energy physics research organization CERN.

Favorite Open Source Tools Link to heading

My research is only made possible through the countless Open Source tools that I have been fortunate to be able to access and use for free. I would like to mention a few of my favorites as a way of saying ’thank you’ and just in case someone else finds them as useful as I do.

General Link to heading

  • The Firefox Web Browser is my window to the Web (and probably the world, too, given the importance of the Web). Its speed and features are on par with Chrome, but it is developed Open Source by the non-profit Mozilla foundation and is much more privacy-friendly.

  • The Thunderbird E-Mail Client is a strong Open Source e-mail client that handles multiple e-mail accounts, a calendar and a large address book well. I’ve used Thunderbird for almost two decades and it has always done what I needed it to. Thunderbird is also developed Open Source by a subsidiary of the non-profit Mozilla foundation and is much more privacy-friendly than commercial alternatives.

  • The Mastodon Microblogging Software for the Fediverse makes the top of the list in my current communication toolbox. The Fediverse (decentralized social media) is my favorite place to hang out and engage with fellow scientists, lawyers and the general public. While I do maintain a robust presence on Linkedin, I prefer the avantgarde spirit of the Fediverse and the many fascinating and unique personalities who use Mastodon or any of the other many gateway clients. You’ll likely see me posting my work to Mastodon first and being much more chatty and post-happy than elsewhere.

  • The Signal Messenger App is the primary tool I use to chat with friends and colleagues. While there are other encrypted messenger apps out there, Signal is in all likelihood the best and most trustworthy.

  • The KeepassXC Password Manager handles the ridiculous number of accounts and passwords that the modern web requires of its users. I do not know how many accounts I maintain and at this point I am afraid to count them.

  • The DuD-Poll scheduling service is a zero-footprint, host-proof and Open Source scheduling application that I like to use for scheduling meetings with many participants. It is hosted by the Technical University of Dresden and was developed by a consortium of several European universities.

Operating Systems Link to heading

  • The Debian distributions of the Linux operating system serves my day-to-day and scientific computing needs very well.

  • You can download my current base Linux configuration (e.g. dot files, package lists, install scripts) for Debian from GitHub. For custom OS configurations please check the Docker files of individual projects.

  • Docker and Docker Compose are at the core of my reproducible programming workflow, as they allow me to freeze the versions of the OS, system packages and, via Rocker, the R version and R packages.

  • The Rocker Project (Boettiger and Eddelbuettel 2023) is my first port of call if I need version-controlled Docker images for R. I usually use the simple ‘r-ver’ line of images as a base for my own.

Writing (Editor) Link to heading

Give the Plain Person’s Guide to Plain Text Social Science (Healy 2019) a read to find out whether a plaintext-based workflow is for you!
  • Emacs (GNU Project 2023) is my primary editor for all serious writing and coding.

    • Emacs Speaks Statistics (ESS) (Maechler et al 2021) sees heavy use as an Emacs extension for the R Programming Language.
    • AUCTeX (GNU Project 2023) is my Emacs extension of choice for LaTeX.
    • markdown-mode (Blevins 2017) helps me edit Markdown files in Emacs.
    • Polymode permits the combination of several Emacs major modes. I use it most often to work on R Markdown files (.Rmd).
  • Jabref has loyally managed my literature database for years. I’m especially happy that the database is managed and saved natively in BibTeX, which allows easy integration with LaTeX. My BibTeX database is also easily searchable and editable from Emacs, another plus.

Writing (Markup and Layouting) Link to heading

  • Markdown syntax (Gruber and Swartz 2004) is very good for notes, simple documents and the heart of literating programming with R Markdown

  • The {rmarkdown} (Allaire et al 2023) and {knitr} (Xie 2021) R packages are my go-to solutions for writing reproducible reports and literate programming

  • The LaTeX (Knuth 1978; Lamport 1984) document preparation and typesetting system is my language of choice for writing and layouting complex documents. LaTeX can be intimidating at first, but Wikibooks offers a good introduction and reference.

Data Science (General) Link to heading

On this page I follow the convention of naming R packages with braces. Example: {data.table}
  • The R Programming Language (R Foundation for Statistical Computing 2023) is the core of my data science workflow. I do almost everything complex with data or math in R. R is a fantastic language for statistical computing, no matter what the Python people say.

  • The Stan Modeling Language (Stan Development Team 2023) for Bayesian statistics is one of the most fascinating new tools that I am exploring at the moment. I am deeply dissatisfied with traditional frequentist statistics and it appears that the Bayesian approach provides convincing solutions to many of the challenges I face in my work. There are many excellent interfaces for R and other programming languages to integrate with Stan.

  • The incredibly fast and stable {data.table} (Dowle and Srinivasan 2023) R package is part of every (!) project I work on. The speed is ridiculous, check out the benchmarks. The syntax is also very concise, which is excellent for lazy people such as myself.

  • The {testthat} R package (Wickham 2023) sees more and more use the more experience I gain as a programmer. Automated testing is incredibly important to ensure that the assumptions you make during programming do not surprise you later. Surprises in programming never end well.

Data Science (Workflows) Link to heading

  • {targets} (Landau 2023) is one of my all-time favorite R packages. It forces me to divide my data pipelines into separate components, which can then be run individually in individual sessions. Results are stored for each successfully run component and re-runs of the pipeline only compute the parts for which the code, data or results have changed. The time savings are immense.

  • {tarchetypes} (Landau 2023) provides further useful functions for working with {targets} pipelines. Have a look around once you’ve gotting some basic {targets} workflows up and running!

  • {future} (Bengtsson 2023) is an excellent parallel computing framework for R. I use (almost) nothing else these days (and in fact, {targets} is built on {future} as its parallel framework). The {future} package allows for many different frontend and backend choices to customize parallel computing. I tend to use the base {parallel} or {callr} backends and the {future.apply} frontend API. See the overview of futureverse packages for more advice.

  • {future.apply} (Bengtsson 2023) provides functions that are drop-in replacements for base-R parallel computing functions, such as future.lapply(). I use it a lot.

Data Science (Text Analysis) Link to heading

  • The {quanteda} (Benoit et al 2023) R framework for the quantitative analysis of text is the centerpiece of my natural language processing (NLP) workflows. It is clean, structured, well-documented and interfaces with many other NLP packages. Note that many of the core Quanteda functionalities from earlier versions have been moved to add-on packages like {quanteda.textstats}, {quanteda.textmodels} and {quanteda.textplots}.

  • The {stringi} (Gagolewski 2023) R package is a key building block of quanteda. It is also my package of choice when I write low-level string-processing operations myself.

Data Science (Web Scraping) Link to heading

  • The {rvest} (Wickham 2022) R package usually provides everything I need for web scraping static HTML pages.

  • Hard cases require {RSelenium} (Harrison 2022) which launches the headless Selenium browser that can do pretty much everything a human do, but automatically. Most of the time I use Selenium only to render Javascript pages and then process them with {rvest} .

  • {xml2} (Wickham, Hester and Ooms 2023) is useful for parsing XML files. XML comes up every once in a while, for example with the table of contents and the content files of, the primary source of German federal legislation.

Data Science (Visualization) Link to heading

  • {ggplot2} (Wickham et al 2023) is probably one of the very best data visualization tools out there. It allows you to build beautiful graphics layer-by-layer and a massive number of add-on packages exist that extend the functionality of ggplot.

  • {ggraph} (Pederson 2022) is an extension of ggplot2 for visualizing graphs/networks and this is what I almost always use when I create formal network diagrams for publications.

  • Gephi (Bastian et al 2022) is an excellent GUI tool for plotting graphs/networks and very easy to use. I mainly run Gephi to poke and prod graphs a bit and to get a feel for them, with formal analysis left to {igraph} and {ggraph} . That being said, there are many researchers who don’t code and successfully use Gephi as their primary tool for network analysis.

Website Link to heading

  • The blogdown R package (Xie, Thomas and Hill 2021) was used to create this website. It’s all written in Markdown and then compiled.

  • The Hugo website builder (Hugo Authors 2023) is the workhorse that powers blogdown.

  • Coder (de Prá 2021) is the theme of this website.

  • KaTeX typesets mathematical equations and symbols (Eisenberg and Alpert 2023).

  • Mermaid.js is a Javascript diagramming and charting tool that renders flow charts and UML diagrams.

  1. The original sentiment — paraphrasing an idea of geophysicist Jon Claerbout — was published in Buckheit and Donoho 1995, although the exact quote is from Donoho 2010. See: Buckheit, Jonathan B and David L Donoho. 1995. ‘WaveLab and Reproducible Research’. In Wavelets and Statistics, edited by Anestis Antoniadis and Georges Oppenheim, 55–81. New York: Springer, 1995. See also: Donoho, David L. 2010. ‘An Invitation to Reproducible Computational Research’. Biostatistics 11 (3): 385–388. ↩︎