Over the last few years Zoea has developed a unique set of heuristics that can reduce the size of the inductive programming search space by up to 100 orders of magnitude. Creating heuristics for Zoea knowledge sources could achieve a further reduction of a similar scale and in a much shorter timeframe.
Zoea is a knowledge-based inductive programming system that can generate code automatically from a set of test cases. It includes a lot of clever and innovative components that encode explicit knowledge about programming languages and software development. These reason about the provided test cases and incrementally assemble many partial and complete candidate solutions when solving a particular problem.
Sometimes, however, there are simply not enough clues in the test cases for any fruitful line of reasoning to be identified. In such cases, all inductive programming systems must resort to search, which involves systematically testing every possible combination of instructions until a solution is found. Very often, the number of possible code solutions that need to be considered can be astronomical. This has led some people to conclude that inductive programming is basically an impossible problem and that the technology is only suitable for toy programs.
From the outset, Zoea was conceived with the intention of overcoming this problem. Rather than simply being an insurmountable law of the universe, our view was that some fresh thinking might be required. After all, people routinely solve problems with apparent ease, which appear virtually impossible when formulated in AI terms. Perhaps, in the fifty or so years that scientists have been investigating this topic, they might have missed something really important.
Zoea is composed of a large number of autonomous experts that we call knowledge sources. Each of these encodes one or more elements of programming language or software development knowledge. Every knowledge source also produces a different type of code fragment that might appear in generated solutions. The knowledge sources are organised using a modern form of the blackboard architecture. Each knowledge source identifies situations where it can contribute to any one of a set of evolving partial solutions. Eventually, one or more of the candidate solutions will exhibit the required behaviour as described in the test cases.
Only a few of the hundreds of Zoea knowledge sources actually engage in search. Nevertheless, these used to consume the vast majority of the time and resources required to produce a solution. Over the last few years Zoea has developed a unique set of powerful heuristics that can shrink the search space - currently by up to 100 orders of magnitude. This means that Zoea can now produce software using search that is much larger than was previously possible.
The heuristics achieve this by splitting the search space in different ways and only exploring the tiny fractions that correspond to human-like code. Effectively, we exclude any code that has non-human-like characteristics - such as that which uses combinations of instructions which people never use. Our assumption is that solutions can be found in the human-like parts of the search space - a minuscule subset of the whole. By definition, every program that people have ever coded manually can also be found in these regions.
The upshot of all this is that search in Zoea requires much less effort than it did previously. While we are by no means done with the current work on search heuristics, it also means that the amount of work required for the majority of non-search-based knowledge sources now deserves some attention. Perhaps knowledge sources could also benefit from having their own heuristics - this is an idea that has been on the backlog for as long as Zoea has existed.
An obvious approach would be to develop a set of heuristics for knowledge sources that are similar to those employed in search. Indeed, there are clear parallels. For example, the key insight that resulted in the development of instruction subsets was based on the observation that each particular solution uses only a tiny selection of the available instructions. Therefore, if we could somehow predict the correct subset of instructions in advance then the search problem becomes very much easier. In the same way, each solution also involves a relatively small subset of all knowledge sources. There are hundreds of instructions in the Zoea instruction set and there are also hundreds of knowledge sources. This suggests that the potential benefits of knowledge source heuristics could also be considerable. Similar reasoning can be used to make a case for knowledge source equivalents of all search heuristics, including digrams, type signatures and instruction probabilities.
However, there is one problem with this approach. All of the existing search heuristics were ultimately derived from instruction co-occurrence patterns that were extracted from a large sample of human code. We don't yet have a sufficiently large corpus of Zoea solutions that we can use as the basis for knowledge source heuristics. It would also take an inordinate amount of time, effort and cost to produce such a dataset. Furthermore, we could never really be sure that we had enough or a sufficiently diverse set of examples.
Yet, there is another way that this can be accomplished. Here is the core principle of the approach - in simple terms. As we noted earlier, each knowledge source corresponds to a resulting fragment of a code solution. These fragments have similarities to the elements in a programming language grammar - but they are not the same thing. If we take any existing program and decompose it into a hierarchy of the right fragments then we can directly identify the knowledge source that corresponds to every fragment. To do this, we need to produce a simple reverse-mapping from each code fragment to the corresponding knowledge source. Matching fragments to knowledge sources can be made easier if the code is first standardised - for example, by replacing identifiers and values with something canonical such as their data type.
In combination, these mappings can be thought of as a very simple rewrite system and they operate in a manner similar to a chart parser. For any given program, this mapping would allow us to identify the set of Zoea knowledge sources that would be required to produce that code, and with little effort.
If we repeat the above process with very many programs then we would have the information required to produce the full set of knowledge source heuristics. Crucially, we would have the knowledge source co-occurrence patterns required to produce all of the solutions in whatever sample of code we used. These co-occurrences could then be clustered to produce knowledge source subsets. All of the other knowledge source heuristics could then be derived in ways that are analogous to the corresponding search heuristics.
As with the search heuristics, we believe that the extraction of knowledge source co-occurrence information from existing code is entirely ethical. We have no interest in anyone else's code or their algorithms and no such information would be captured, stored or integrated in any way. Rather, we would be applying an abstract categorisation and the same end result would be produced for many different programs by different authors. Clearly, a simple list of Zoea knowledge source identifiers cannot be anyone's intellectual property.
The benefits of knowledge source heuristics can only be quantified once they are actually constructed. However, this is expected to be significant - certainly 10s and possibly over 100 orders of magnitude. It could even be considerably more. It is worth noting that this reduction is completely separate to that achieved by the existing search heuristics and is therefore cumulative.
This work may have broader benefits, beyond simply improving performance. In producing these heuristics we are also discovering and surfacing tacit knowledge. Search is conceptually central to much of AI but its use can be an indication that we don't understand enough about the knowledge required to solve a particular problem. Sufficiently powerful heuristics trivialise search to the extent that we no longer need to worry or even think about it. Heuristics - or rather the tacit knowledge behind them - effectively constitute new forms of more or less direct reasoning.
People use a lot of explicit knowledge when they code but we have found that they also employ a great deal of tacit knowledge. Until recently, nobody had a clue that this is the case, let alone what that knowledge is. The same is probably also true in other domains beyond inductive programming.
It took just over 3 years to develop the Zoea search heuristics and achieve a search space size reduction of 100 orders of magnitude. Knowledge source heuristics could deliver similar benefits in a fraction of that time.
© Copyright 2025. All Rights Reserved. Zoea is a registered trademark in the UK.