~jelford / blog:

Longevity in software engineering

How long should we expect themes and technologies in software engineering to last? In this post, we'll look at how long individual projects "last" based on data from GitHub and try to draw conclusions around what that means for choosing foundational software.

Software comes and goes - except when it doesn't; we can all think of technologies like Fortran and COBOL that have been around forever, and continue to fulfil a useful purpose in their particular niche. When we're building something new, what kind of technologies should we build on? Should we seek maturity and stability - perhaps seeking to reduce the constant churn of keeping up with the latest updates, and benefit from the battle-hardening effect of exposure to the real world and the passing of time; or should we eschew staleness, and build on foundations that aren't already aproaching the end of their shelf-life - benefiting from the latest advances in technology, and learning from the mistakes of those who went before. When we look at our depenedencies on other projects, can we guess how long they will continue to be around for?

It goes without saying that our choice will depend on what it is we're building; are we most concerned with getting an idea in front of users ASAP, or with the ongoing cost of maintenance once we've shipped? Are we reasonably sure about the problem space, or do we expect a lot of iteration before we have the basics right?

what can we know?

One way to start thinking about answering these questions would be to talk about the longevity of tools and projects. Can we make predictions about how long a given technology will be around, given its history? That's what we'll look at in the rest of this post: can we judge whether a project will still be around in a few years' time based on how long it has been around.

prior art

Inspired by both of these ideas, let's make a hypothesis:

The longer a project has been around, the longer its expected remaining life.

Notice, this prediction is obviously false for things with naturally bounded lifespans, since remaining life expectancy must decrease as you approach that upper bound. But what about software projects?

what can we measure?

GitHub provide a BigQuery dataset comprising 3TB of data on activity around GitHub-hosted open source projects.

This dataset gives us a few things, but most usefully (I hope): commit history. Let's simplify things and say that a project's age is the time between the first and last commit, and that it's said to be "done" if there has been no commit in the last 3 months (i.e. exclude still-active projects from the sample).

Here's the query I used to pull information about the first and last commits for each project:

with commits_by_repo as (
  SELECT rname as repo_name, cs.committer.time_sec ts, commit
  FROM `bigquery-public-data.github_repos.commits` cs
  CROSS JOIN unnest(cs.repo_name) as rname
  join `bigquery-public-data.github_repos.sample_repos` sam_rs 
    on rname = sam_rs.repo_name
  where sam_rs.watch_count > 3
  and committer.time_sec >= 2674800
)
select 
  cs.repo_name as project_name, 
  min(cs.ts) earliest_commit_sec, 
  max(cs.ts) latest_commit_sec, 
  count(distinct commit) number_of_commits
from commits_by_repo cs
group by project_name
order by earliest_commit_sec asc

And here's the notebook I used to analyze the results and produce the plots below. If you want to follow along you'll have to pull the data out of BigQuery and point the notebook at your own CSV file.

A few remarks:

In the query above, I've taken a few steps to try to filter out things that I'm guessing are going to be more signal than noise:

So, what do we find?

exploration

Here are some plots of the distributions of project age.

First, a straightforward plot of the distribution of project lifetimes. On the left is the whole dataset, and on the right are just those whose lifetimes exceed 10 years: Graph showing the distribution plot of project lifetimes in years

There are only 32 projects in the dataset with an age > 20 years. That's such a small number compared to the earlier samples that I'm going to exclude anything with age > 20 years from the following distribution plots. Chopping off this long tail may change the shape of the distribution, but I'll justify it by mentioning that even within this small sample, there are quite a few projects that are clearly noise rather than signal:

The cost is excluding (several forks of...) projects like Emacs.

Each of the following graphs is a seaborn distplot showing project age (in years) on the x-axis and the kernel density estimate the y-axis.

First, the distribution of what's left after filtering: Graph showing the distribution plot of project lifetimes in years, excluding those with age > 20 years

Next, the same data on a log scale (filtered for only projects with age < 20 years): Graph showing the distribution plot of project lifetimes in years with a logarithmic y-axis

Finally, the same data on a log-log scale: Graph showing the distribution plot of project lifetimes in years with a logarithmic y-axis and logarithmic x-axis

A couple of observations:

results

Now, back to our hypothesis:

The longer a project has been around, the longer its expected remaining life.

Let's look at two graphs of expected (mean) lifespan (on the y-axis), predicated on current lifespan in years (on the x-axis). The nth-percentile lines show the distribution of our sample data. First, let's zoom out and get the fullest picture we can; we'll re-include the 32 projects with a lifespan > 20 years:

Graph showing a steep upward slope of projects' remaining life expectancy after they pass 6 years of age

Woah there. A couple of things jump out:

What can we do about this?

I've done... a mix:

So, I have imposed an artificial limit on project life (30 years), but hopefully that's high enough that it's not going to skew the results for projects we consider (10 years or younger). I'd love to hear from someone more statistically literate if there's a better way to go about this - I'm sure there must be.

So, finally, here's a graph showing expected remaining life expectancy of a project, given its current age.

Graph showing remaining life expectancy of projects up to 10 years of age

Nothing really jumps out here; so I'll venture some more modest conclusions:

further work

Better analysis:

Sources of bias in the data: