Programming Leftovers
-
We could run out of data to train AI language programs
The trouble is, the types of data typically used for training language models may be used up in the near future—as early as 2026, according to a paper by researchers from Epoch, an AI research and forecasting organization, that is yet to be peer reviewed. The issue stems from the fact that, as researchers build more powerful models with greater capabilities, they have to find ever more texts to train them on. Large language model researchers are increasingly concerned that they are going to run out of this sort of data, says Teven Le Scao, a researcher at AI company Hugging Face, who was not involved in Epoch’s work.
-
Congestion control algorithms are not fair
The Internet has many flows. It is important to have a mechanism that decides what share of the total available bandwidth is allocated to each flow. Given the large number of flows on the Internet, it is infeasible to do this in a centralized fashion. Hence more distributed and scalable mechanisms are needed. Congestion Control Algorithms (CCAs) form a key component of this infrastructure.
-
Recognizing patterns in memory
Something I find frustrating is how hard it is to teach debugging skills. I think the biggest reason is because there are many things that can only be learned through experience. This is true for anything that requires pattern recognition. Our brains are great at recognizing patterns, but it often takes a large amount of practice to be able to identify useful patterns in data.
I can’t instantly give you pattern recognition skills with a short blog post, but I can tell you about some of the patterns that I look for so you can start to train your brain to see these as well. Recognizing patterns in memory can be useful as it can give you a hint for things like memory corruption, which are often some of the hardest errors to debug from a postmortem analysis. Getting a rough idea of what type data is ovewriting other data in a process can tell you where to look next for the source of memory corruption. It can help narrow down where an issue might be because the bug is usually near the code that wrote this data.
-
Why is it so difficult to retrain neural networks and get the same results?
I’m writing this post because I find it fascinating that our systems have become so complex that an algorithm like matrix multiply that can be described in a few lines of pseudo-code can have implementations that produce such surprising and unexpected results. I’m barely scratching the surface of all the complexities of floating point math here, but many other causes of similar issues are emerging as we do more and more calculations on massive distributed networks of computers. Honestly that’s one reason I enjoy working on embedded systems so much these days, I have a lot more confidence in my mental model of how those chips work. It does feel like it’s no longer possible for any one person to have a full and deep understanding of the whole stack on any modern platform, even just for software. I love digging into the causes of weird and surprising properties like non-determinism in training because it helps me understand more about what I don’t know!
-
Writing docs well: why should a software engineer care?
The first one is about building shared understanding among stakeholders of a document. One of the hardest problems in software engineering is getting multiple people to have a sufficient understanding of some technical aspect, like the actual problem being solved, or a proposed solution. This is ostensibly the only real goal of technical writings.
Shared understanding is related to the idea of common ground that you’ll sometimes hear the safety folks talk about.
-
A Programmer-Friendly I/O Abstraction Over io_uring and kqueue
Consider this tale of I/O and performance. We’ll start with blocking I/O, explore io_uring and kqueue, and take home an event loop very similar to some software you may find familiar.
This is a twist on King’s talk at Software You Can Love Milan ‘22.
-
Animated population tree maps | Guy Abel
The global population hit 8 billion today. To mark the passing an absolute population total I created some animated tree map plots in R to visualize relative past and future population totals for all countries.
-
Cross-validation in Machine Learning - Data Science Tutorials
Cross-validation in Machine Learning, cross-validation is a word that everyone who works with machine learning techniques will come across at some point.
We provide you with a quick overview of cross-validation in this blog post.
-
Rapid dashboard prototyping - scottishsnow
I work in a startup called the Smart Data Foundry. We work with financial data to improve society. As a startup we have lots of ideas to try out, either internally or with collaborators. This is the idea of failing quickly – don’t spend a lot of time trying to find out if an idea works, get it to a testable state as fast as you can so you know if it is worth pursuing.
Here’ a talk I gave at FOSS4GUK local 2022 which explains a couple of ways we’re approaching this. If you’d like to know more about Shiny then check out the Shiny Developer Series.
-
essayreg2: Linear Regression (Cloze with Essay and File Upload)
Exercise template for interpreting a regression with two explanatory variables based on randomly-generated data (with either a linear, semi-logarithmic, or log-log relationship) in form of a cloze including essay and file upload.
-
PCA for Categorical Variables in R
PCA for Categorical Variables in R, Using Principal Component Analysis to minimize the dimensionality of your data frame may have crossed your mind (PCA).
However, can PCA be applied to a data set with categorical variables?
You’ll discover how to apply Principal Component Analysis (PCA) to data frames that include categorical variables in this course.
Additionally, you’ll discover how to use the R programming language to put these alternatives into practice.
-
Kernel 5.15.79 compiled with vmd.ko builtin
Sometime ago, Ramachandra Iyer purchased a new HP laptop, but found that EasyOS and some of the pups would not recognize the NVME SSD at early bootup. This meant that unable to install any of these distros to the internal drive.
-
An objective criteria for deprecating community platforms | dean [blogs.perl.org]
Perl has been around for a couple of years longer than Python and Linux. Perl 5 was released in 1993, the same year as FreeBSD and NetBSD.
In the 90's for Open Source projects the "community platforms" where Usenet newsgroups and mailing lists run on Listserv or Majordomo (Mailman didn't show up until 1999). IRC was used for text based chat but without SSL!. CVS was the open source version control system of choice or you might have been unlucky enough to use Visual Source Safe at work, whilst Subversion wouldn't show up until 2000.
But the 90's are more than 20 years in the past and IPv6 is actually seeing meaningful adoption now. Many of the above technologies are as completely foreign to people with 10+ years of industry experience as Compact Cassettes, VHS, LaserDisc and maybe CDs or even DVDs.
-
Parsing RFC 3339 timestamps using strptime in Perl
An RFC 3339 timestamp can look like this: 2022-11-25T09:26:04+01:00.[2] This is the format required by the Atom and JSONfeed specifications.
-
2022 Malcolm Tredinnick Memorial Prize
Of course I’m happy that what I tried to do for Django was appreciated, and I really thank Sarah Abderemane, not only for the wonderful words she wrote in the nomination, but also for her great commitment to Django, she is a valuable member of our community , and I was very pleased to have met her in person during DjangoCon Europe 2022 in Porto.
-
What is the size of a byte[] array in Java?
How much memory does this array take? If you have answered “4 bytes”, you are wrong. A more likely answer is 24 bytes