28 Jul 23

Implementing instruction digrams in Zoea turns out to be a simple matter of creating additional instruction subsets. These are produced up-front, meaning that instruction digrams have zero additional processing overhead at run-time.

The biggest problem that has always faced inductive programming is the enormous size of the search space. This is because most programming languages contain hundreds of instructions that can be combined in a multitude of different ways. As the size of target programs increase, the corresponding search space grows exponentially. This has led many people to conclude that the problem is basically impossible to solve. Fortunately, this is an incorrect assessment.

While programming languages may contain hundreds of instructions, the vast majority of individual programs use relatively few of them - often less than twenty. Therefore, if it were somehow possible to predict the instructions that are actually required for a given problem, then inductive programming would be much more tractable. This intuition led to the concept of instruction subsets in Zoea, where many subsets correspond to different predictions of the required instructions.

Human coders don't use all instructions with equal frequency and the co-occurrence of instructions within programs is highly skewed. This is effectively tacit programming knowledge and we can capture it by creating instruction subsets from a large sample of existing code. Zoea uses these subsets to reduce the size of the search space, often by tens of orders of magnitude.

Support for instruction subsets was designed into Zoea from the very outset. This means that every Zoea knowledge source can be activated with respect to any subset - which can be a single instruction, the complete instruction set or anything between.

Instruction subsets specify the combinations of instructions that can be used together to form programs. However, they say nothing about the actual structure of programs. Instruction digrams on the other hand prescribe how instructions can be assembled.

Each digram is simply a pair of instruction identifiers, which signify that the output of instruction 1 can be used as the input to instruction 2. If an instruction has multiple inputs then these correspond to different digrams.

Both instruction subsets and digrams can be viewed as constraints that apply to all generated programs. Each candidate solution only uses instructions from one subset, although very often it will use just some of them. Similarly, each instruction application in every program will correspond to one of the digrams.

The frequency distribution of instruction digrams in human code is also highly skewed and this is what makes them useful. Coders combine instructions in relatively few ways. Therefore, Zoea can potentially avoid the many combinations that people never use. As with subsets, the instruction digrams used in Zoea were also derived from a large code sample. Creating digrams is one thing but they also need to be integrated into Zoea in some way.

Implementing instruction digrams on top of Zoea's existing instruction subset mechanism could be accomplished in a number of ways. For example, we might simply enforce every digram as a constraint, for each instruction that we add to a candidate solution. Whilst this approach would work, the high amount of additional processing involved would seriously impact performance. A better approach is to apply instruction digrams through pre-processing.

One way to think about instruction digrams is in terms of their impact on instruction subsets. Although instruction digrams are primarily intended to describe how instructions can be put together, the side effect of applying instruction digrams to any subset is to create another subset.

In data flow terms, we can think of every instruction in a program as being some distance from the input. Instructions that use input directly are level 1, while those that use input with one level of indirection are level 2, and so on.

For any given subset, let us assume that all subset instructions could potentially be applied to program input. These level 1 instructions also dictate which digrams will apply, since instruction 1 in the digram needs to be one of these instructions. Clearly, instruction 2 in the digram also needs to be in the subset, but each of these is also an instruction at level 2. In effect, applying the digrams determines the possible instructions at level 2, and while this new subset can be the same size, it is frequently smaller.

We can repeat this process for any number of levels, producing different subsets for each level. At some stage, usually between levels 3 and 10, the subset size stabilises. What we end up with are variants of all instruction subsets for each data flow level. These can easily be pre-computed.

In effect, we have compiled all of the instruction digrams to produce a larger number of subsets. However, this is not a problem since Zoea already applies an instruction subset at each level, so it doesn't matter if the subsets are different. Also, implementing digrams in this way requires only a trivial change in the Zoea codebase.

With this approach, Zoea doesn't have to do any extra work at run-time in order to enforce instruction digrams. Rather, as we have noted previously, digrams provide a further huge reduction in the size of the search space. As a bonus, since digram compilation transforms digrams to subsets, this approach is applicable to other types of production system, such as rules, rewriting systems and grammars.