11 Sep 23

Instruction subsets generalise beyond the training set due to increased instruction co-occurrence variety obtained through clustering. Zoea uses the same principle to further boost generalisation significantly in a process called amplification.

If fifty years of research has taught us anything, it is that hardest part in generating software automatically is determining which instructions are involved. Clearly, if we could in some way predict the required instructions then the problem becomes much simpler. This insight was the key inspiration for the instruction subset approach in Zoea.

Every high level programming language provides a set of instructions, comprising operators and core functions. We can define any number of subsets of the complete instruction set and treat each of these as a prediction about the required instructions for any problem. However, there are very many possible subsets. So what combinations of instructions do people use when developing software?

For a start, human coders don't use all instructions with equal frequency. A few instructions are really common while most are rarely used. This situation becomes increasingly more extreme when we talk about pairs and larger groups of instructions. That means the vast majority of possible instruction combinations are never used in programs by human developers.

At the same time most programs use relatively few instructions. Many programming languages provide over 200 instructions, while the majority of problems will require less than 20 unique instructions and often less than 10. As a result we can limit and also quantise the size of any subsets we create.

So the key conjecture about instruction subsets is that coders use a relatively small number of combinations of instructions to solve the vast majority of problems. This is easily demonstrated by extracting the instruction subsets employed in a large sample of human originated code. We can also tidy up the subsets we extract in various ways such as by combining smaller subsets and discarding duplicates or any that also occur in larger subsets.

The resulting subsets enable us to produce solutions to any problem requiring a combination of instructions that was also present in the training set. But what about other combinations of instructions that weren't encountered during training? When we train an AI we expect it to generalise beyond the examples it has seen.

It turns out that the act of combining (or clustering) smaller subsets also increases the amount of instruction variety present. This is analogous to the way in which developers combine different pieces of code through composition. As a result, clustering gives a modest but useful increase in the generality of the subsets. A massive increase would be even better.

Now, we know that clustering increases instruction variety but we can also increase variety through the creation of a large number of additional artificial subsets. We do this by simply by combining every combination of two of the smaller subsets with one another. This gives roughly ((N^2)/2)-N artificial subsets - where N is the number of small subsets.

The creation of these artificial subsets is equivalent to the use of a very much larger code sample. This is useful since the subset distribution within code is highly skewed and increasingly rarer combinations of instructions occur with vanishingly smaller frequencies. In effect we have artificially boosted the distribution of rare but plausible combinations of instructions.

We call this artificial subset creation 'amplification' due to the vague similarity with the well known DNA technique. Amplification is an additional step in the subset creation process which occurs after subset extraction and before clustering.

The additional combinations of instructions that are produced through amplification are not random. This is because they are composed of groups that reflect human usage patterns. Instead, it is as though different chunks of code were combined in many novel ways and the resulting code was then added to the sample. For this reason the artificial subsets retain similar instruction frequency and co-occurrence distributions to that of human code.

We can show the benefit of amplification by training with only a small fraction of the code sample and then determining how much of the remaining code could, in principle, be generated using the subsets we created. With only 1% of the code sample we can produce subsets that can generate over 77% of the entire sample.

Amplification causes the subset coverage to quickly converge towards 100%. A large number of additional subsets are created but most of these are subsumed by bigger subsets during clustering. As a result amplification does not increase the number of resulting subsets dramatically. It does, however, have a dramatic impact on the ability of the subsets to generalise and create code using combinations of instructions not previously seen.

Depending on the subset size we can end up with hundreds or thousands of subsets. In a sense these subsets summarise the millions of programs used to produce them. The numbers of subsets involved may seem high, however, only particularly rare combinations of instructions would require most or all of the subsets to be used. Even then, the effort involved is an infinitesimal fraction of that required by other approaches. In the majority cases a solution can be found in the first 10 subsets.

The instruction co-occurrence patterns used by human coders represent tacit programming knowledge. Amplification allows us to expand that knowledge while still retaining the characteristics of human developed code.