Programming Leftovers
-
Reproducible man-db databases
I’ve released man-db 2.11.0 (announcement, NEWS), and uploaded it to Debian unstable.
The biggest chunk of work here was fixing some extremely long-standing issues with how the database is built. Despite being in the package name, man-db’s database is much less important than it used to be: most uses of man(1) haven’t required it in a long time, and both hardware and software improvements mean that even some searches can be done by brute force without needing prior indexing. However, the database is still needed for the whatis(1) and apropos(1) commands.
The database has a simple format - no relational structure here, it’s just a simple key-value database using old-fashioned DBM-like interfaces and composing a few fields to form values - but there are a number of subtleties involved. The issues tend to amount to this: what does a manual page name mean? At first glance it might seem simple, because you have file names that look something like /usr/share/man/man1/ls.1.gz and that’s obviously ls(1). Some pages are symlinks to other pages (which we track separately because it makes it easier to figure out which entries to update when the contents of the file system change), and sometimes multiple pages are even hard links to the same file.
-
Bayesian biostatistics procedures matching frequentist confidence intervals
Confidence intervals are commonly misinterpreted as there being, after observing the data, a 95% probability that the true parameter lies within the confidence interval. The usual explanation why this is incorrect is that the true parameter is not random, and so is either inside or outside the confidence interval. This explanation holds in the ‘relative likelihood’ interpretation of probability associated with frequentist statistics.
However, as I have discussed previously, in the ‘subjective’ interpretation of probability associated with Bayesian statistics, we can assign a probability to the true parameter lying within a given interval. To do so implies that we are thinking of a particular likelihood function for the data, and a prior that would allow us to assign a probability to the true parameter lying within the interval before observing any data.
-
Inference on Adaptively Collected Data
It is increasingly common for data to be collected adaptively, where experimental costs are reduced progressively by assigning promising treatments more frequently. However, adaptivity also poses great challenges on post-experiment inference, since observations are dependent, and standard estimates can be skewed and heavy-tailed. We propose a treatment-effect estimator that is consistent and asymptotically normal, allowing for constructing frequentist confidence intervals and testing hypotheses.
-
Highlights from Shiny in Production (2022)
Last week, we were very excited to host our first Shiny in Production conference! Attendees gathered in The Catalyst in Newcastle for two days of workshops and talks focusing on all things related to Shiny, building dashboards, and cool things you can do in R.
-
Common Statistical Tests in R - Part I – Musings on R – A blog on all things R and Data Science by Martin Chan
This post will focus on common statistical tests in R to understand and validate the relationship between two variables.
There must be tons of similar tutorials around, you may be thinking. So why?
The primary (and selfish) goal of the post is to create a guide that is practical enough for myself to refer to from time to time. This post is edited from my own notes from learning statistics and R, and have been applied to a data example/scenario that I am familiar with. This means that the examples should be easily generalisable and mostly consistent with my usual coding approach (mostly ‘tidy’ and using pipes). Along the way, this will hopefully benefit others who are learning statistics and R too.
-
Extract patterns in R? - Data Science Tutorials
Extract patterns in R, R’s str extract() function can be used to extract matching patterns from strings. It is part of the stringr package.
-
The Missing Prelude to The Little Typer's Trickiest Chapter
It’s hard to find a textbook series garnering more effusive praise than The Little Schemer, The Little Prover, The Little Typer & co. The Little Typer introduces dependent type theory and is the first of the series I’ve read. I
-
Hackday - Group Solar Forecasts - Terence Eden’s Blog
Last week, I attended BrumPropHack - a hackathon in Birmingham which looked at problems with retrofitting homes to make them more energy efficient.
There were some great talks about the scale of the problem - both in terms of the number of properties which need improving and the cost of retrofitting. A bunch of teams showed off some impressive demos which aimed to tackle the issues.
-
"A Plea for Lean Software" by Prof. Niklaus Wirth
Memory requirements of today’s workstations typically jump substantially – from several to many megabytes—whenever there’s a new software release. When demand surpasses capacity, it’s time to buy add-on memory. When the system has no more extensibility, it’s time to buy a new, more powerful workstation. Do increased performance and functionality keep pace with the increased demand for resources? Mostly the answer is no. About 25 years ago, an interactive text editor could be designed with as little as 8,000 bytes of storage. (Modern program editors request 100 times that much!) An operating system had to manage with 8,000 bytes, and a compiler had to fit into 32 Kbytes, whereas their modern descendants require megabytes. Has all this inflated software become any faster? On the contrary. Were it not for a thousand times faster hardware, modern software would be utterly unusable.
-
CRAN and the Isoband Incident - Is Your Project at Risk and How to Fix It - R programming
The R community had a recent scare with the isoband package risking archival on CRAN. The reason why this incident made waves is that isoband is a ggplot2 dependency and when a package gets removed from CRAN all other packages that depend on it get removed as well (see CRAN policy). If isoband fell, ggplot2 would be at risk. And this would cascade with the removal of even more packages.
-
Acing Virtual Events with Networking Sessions and Collaboration - R Consortium
The R Consortium recently caught up with Alyssa Columbus of R-Ladies Irvine (also on MeetUp and Twitter) to discuss the group’s progress during the pandemic. Alyssa discussed the group’s efforts to remain active and provide networking opportunities for its members. The group has also formed strong collaborative ties with other R user groups in Southern California.