The Input and Output Encodings of the Neural Network

A neural network is defined by its inputs and outputs. The internal structure of the neural network is a detail.

For our present purposes, I'll assume that ml4debugging is working on programs written in the Python programming language. Python has a module called ast, which stands for Abstract Syntax Trees. ast can parse a fragment of Python code to generate an Abstract Syntax Tree of the code. Then, ast has a "visit" method, which allows ast to visit each node in the tree. We can traverse the tree, and generate a token for each node in the tree. We can have a master list of tokens, each of which is mapped to a numerical vector. So now the sequence of nodes generated by the traversal can be mapped to a sequence of numerical vectors (as is required by a neural network to do its job). Using these tokens to do the mapping is good news, because the master list of tokens is relatively small. If the space of inputs and outputs is small, that's a good thing, because it means that the job can be done by a correspondingly small neural network, despite all the hype about deep neural networks (having a small neural network generates the possibility of developing better intuitions about the internal semantics of the neural network).

Next, we need to feed the neural network, as inputs, the error message generated by the Python interpreter. Again, the dictionary of words that is used to generate Python error messages is not very large, and we can map them to a relatively small space of numerical input vectors.

Finally, we come to the English language outputs of the neural network. The space of words used here is comparatively larger than the other previous two cases, but it is certainly smaller than the space of words that have been used successfully in recent natural language to natural language machine translation projects, such as English to French or French to English.

A Recurrent Neural Network is the natural choice for such a project. To begin with, we will only input the actual line of code which is flagged by the interpreter as containing the bug. This means that the input vector will just consist of a few words, for which we can use a vanilla Recurrent Neural Network, for its simplicity. Later on down the line, when we want to use information from multiple lines of code, and the input space will become correspondingly larger, we can use what are called LSTM networks, which are better suited for larger input spaces. For implementing this neural network, we can use Keras, which is a high quality neural network framework for prototyping and implementing a neural network.

The primary challenge in putting this neural network together is to generate a sufficiently large list of input-output pairs as training data. Ideally, there should be a way to automate some part of this without hand coding all the input-output pairs.