15 Oct 2022

Nobody knows why the rank frequency distribution of instructions in code follows Zipf's law. But does it have to be this way?

All software is made up from the instructions of some programming language. Instructions - comprising operators and core functions - are the atomic elements of code that do things such as adding two numbers or calculating the length of a sequence of characters.

If we look at any large piece of software it is clear that a few instructions are very common while many others are hardly ever used. Plotting the frequency of each instruction in descending order always gives approximately the same inverse power law distribution. This is called Zipf's law.

Zipf's law was originally noted in the context of linguistics where it applies to the word frequency of texts in human languages. Curiously, it is also observed in artificial languages as well as in a wide range of human created systems, human and animal behaviour, and natural phenomena. These include the distribution of share prices, the populations of cities and even the sizes of craters on the moon. Various explanations have been proffered for why this law is so ubiquitous but it remains a mystery. A single cause for such a wide range of distributions would seem unlikely. Also, the tautological view that 'the most common are the most common ...' does nothing to explain the shape of the distribution.

In software development there are many possible factors that might influence the frequency of use of instructions in programs. Different problem domains and programming languages might be expected to impact the distribution of instruction frequency yet this is not the case. The underlying machine instruction set certainly has a strong influence on the availability of programming language instructions but does not in itself dictate how they should be used. This leaves a range of mostly social considerations including developer education, collaboration and various forms of knowledge sharing. In effect, the relative frequency of instructions used in code is most likely to be a set of norms that are developed, disseminated and held collectively across teams, and ultimately the entire software development community. As with human language it could be that focussing on a small vocabulary reduces cognitive load. In any event repeated reuse of the same patterns utilising mostly the same subset of instructions is a self reinforcing feedback process. This can also be viewed as an example of the golden hammer principle.

In any programming language there are often many different ways to write the same program. For a given problem most developers will produce some variation of a standard solution - using mostly identical instructions. At the same time it is also possible to produce many functionally equivalent programs that are very different, and most of these will use different subsets of instructions. Some of these solutions will be better or worse than the standard ones in some respect such as performance. However, the non-standard solutions will exhibit a much larger degree of variety. This increased variety is certain to include different and sometimes better algorithms. Human bias with respect to instruction frequency implicitly constrains how we develop code and means that we are missing out on possible advances in software.

If Zipf's law in software development is largely a form of human bias then will this also apply to software that is produced by an AI? The answer is probably 'yes' if the AI is produced through training on a set of example programs produced by humans. However, it doesn't have to be this way. Zoea is an AI that is built using explicit software development knowledge rather than through training with a set of examples. As a result it is not so constrained in terms of which instructions it uses in the code it creates. Taking this approach increases the potential for AI generated software to produce innovative and optimal results.

In a sense Zipf's law is a manifestation of the status quo in software development. Developer bias with respect to instruction usage is an expedient that likely makes it easier for people to learn and communicate about software development. However, it constrains our ability to deliver better solutions or even any solution to some problems. It probably isn't possible for human developers to operate any other way. AI coders like Zoea on the other hand have the potential to move beyond instruction frequency bias.