Python Code Suggestions Using a Long Short-Term Memory RNN

This is a guest post by Daniël Heres, a software engineer & Computing Science student. Want to contribute your own how-to post? Let us know contact us.
Algorithm for predicting Python code suggestions

An important tool for programmers is the code editor, or an integrated development environment (IDE). Code editors and IDEs often include, or are extended with, lots of useful features, like syntax highlighting, debugging, and code suggestions.

We created the Python Code Prediction microservice, because code suggestions (or code completions) are often not very smart. The editor just shows you all possible matches without considering the context. Often, they only considering part of the context, like the syntactical structure.

Example Python Code Completion

Example Javascript Code Completion

To make coding smarter, we trained a Recurrent Neural Network model (specifically a LSTM using TensorFlow) on a dataset using open source Python projects available on Github.

The LSTM model learns to predict the next word given the word that came before. For training this model, we used more than 18,000 Python source code files, from 31 popular Python projects on GitHub, and from the Rosetta Code project.

Don’t know what a LSTM is? LSTM stands for Long Short Term Memory, a type of Recurrent Neural Network. Learn more about LSTMs.

From Model File to API Endpoint

We’ll add our model to Algorithmia, where it’ll become an API endpoint we can use to generate code predictions. We’ll go over the basics, but read the guide to hosting your TensorFlow model for in-depth instructions.

Start by creating a free account, and a new data collection. This will make your model file available in the Data API. From the Data Collections page, “Add Collection” and then “My Collections.” Upload your pickled model to your new data collection. The path to your file will look like this:


With your model uploaded, let’s create the algorithm needed to generate a prediction. In this example, we’re writing a Python algorithm. Start by adding your dependencies from PyPi by with a line for TensorFlow (or whatever tools you use) like this:

We’ll then load our model using pickle.load and the TensorFlow checkpoints using the Saver.restore function. The data is accessible in this format:

client = Algorithmia.client()

#load pickle file
file_location = 'data://:username/:collection/:filename'
file_path = client.file(file_path).getFile().name
with client.file(file_path, 'rb') as f:
    my_loaded_file = pickle.load(f)

# Restore TensorFlow model
tf_model = 'data://:username/:collection/model.ckpt'
tf_path = client.file(file_path).getFile().name

saver = tf.train.Saver()

Read this guide to hosting your models for more information. From here, write the rest of your algorithm to process the user input and return the output (prediction). After some testing, the API is ready for release.

Predicting Code Suggestions

Here are some examples of code suggestions our API can predict:

Prediction (next token)
for i in
import numpy as
from __future__
def fun

As you can see, the predictions are pretty smart! Sample a longer sequence from our model by changing the input parameters.

Example API Call

Let’s call our algorithm and predict the next word for the string for i in. In this example, we use the parameters code for our user’s input code, and num_results for the number of samples we want to be returned. We call it like this:

import Algorithmia

input = {"code": "for i in", "num_results": 5}
client = Algorithmia.client('API KEY')
algo = client.algo('PetiteProgrammer/pythoncodeprediction/1.0.3')
print algo.pipe(input)

The result is an array of possible words, and the probability:

  ["range", 0.6302603483200073],
  ["", 0.11741144955158236],
  ["xrange", 0.06367699056863788],
  ["(", 0.02869287505745888],
  ["self", 0.02565359137952328]

Now, let’s predict the next token sequence for the same string. This will return the prediction for completing our for i in statement. We want one result with the five possible tokens. Change the input to this, and call it again:

input = {"code": "for i in", "num_tokens": 5, "num_results": 1}

…and the result is:

[["range", "(", "<NUM>", ")", ":"]]

Meaning, the model predicted that you were going to type:

for i in range(<some numbers>):

Cool, right? We can tune this further. We can adjust how conservative our predictions are by using the diversity parameter. The parameter ranges from 1.0 (most diverse) to 0 (very conservative). The default value is 0.5. A complete list of parameters is available on the Python Code Prediction algorithm page.

Here’s a code snippet showing how to use the Python Code Prediction API in your Python project. Fork a complete example from the Smart Python Code Suggestions GitHub repository.

import Algorithmia

# TODO: Replace this with your API key on
client = Algorithmia.client('')
algo = client.algo('PetiteProgrammer/pythoncodeprediction/1.0.3')

# Retrieve 10 most likely next words
code = 'for i in'
options = {'code': code}

# Get samples of next 2 tokens
options = {'code': code, 'num_tokens': 2, 'diversity': 0.3}

That’s how easy it is to create an API from your trained model. Play with the Python Code Prediction algorithm in the console. Let us know @Algorithmia and @daniel_heres how the code predictions worked for you.

Algorithmia will provide 1M platform credits to any developer that creates a Python code predictions package/plugin for a code editor. If you have a feature request, comment on the the algorithm page.

Daniël is a software engineer & Computing Science student currently researching whether Machine Learning models can improve Source Code Plagiarism Detection. Github

More Posts - Website

Follow Me: