Algorithmia Blog - Deploying AI at scale

Train a Machine to Turn Documents into Keywords, via Document Classification


Figuring out the meaning of a document was once a very hard problem for computers to solve… even for humans, understanding the complexity of natural language can be tricky!

Fortunately, there are some great tools that can help address those concerns. The Document Classifier turns your existing documents and associated keywords into a model which can be used to predict the most appropriate keywords for new blocks of text.

What is the Document Classifier?

The Document Classifier is an algorithm that brings the power of document classification to everyone. It does this by using word vectors and the doc2vec algorithm to generate a latent representation of a document. This latent representation (also known as a document vector) is then stored together along with other training data to create a document vector space, which is saved on our servers as a model file.

After your model file(s) are generated, you can start making predictions! A prediction is done by comparing the unlabelled specimen document with a number of documents in the document vector space by using the K-nearest neighbor algorithm.

TLDR:
• To train the algorithm, you provide documents (blocks of text), each of which is associated with keywords.
• Later, you provide a new document, and it tells you which keywords are most likely associated with it.

What can I do with it?

Since the algorithm is trainable on your own dataset and labels, it can be used for a wide variety of tasks. Here are a few examples…

Help desk auto-assignment


Does your team have a digital help desk? Just sifting through and separating customer feedback can be daunting, and is normally accomplished by forcing the user to create their messages via complicated and annoying support request tickets. With the Document Classifier, it’s possible to create your training dataset from your existing help desk data with little modification! Once you’ve trained a model, you can start making automatic predictions for each message — saving your organization time and money that could be used to improve your products.

Functional chatbots

Have you ever wanted to create your own Alexa or Google Home? The result might be a bit different from conventional classification tasks, but the internal processing is essentially the same. With a dataset encompassing directions and actions, you too can build a chatbot that can automate communications for your team and clients.

How can I use it?

This algorithm has two modes: training and prediction. Before you begin, you’ll need to create your training dataset, similar to this example.

Once your dataset is ready, the most convenient way of passing it into an algorithm is by uploading it to the data API. You have do this via the web interface, or in code:

import Algorithmia
remote_url = 'data://.my/collection/training_data.json'
local_dataset = '/path/to/my/training_dataset.json'
client = Algorithmia.client('YOUR_API_KEY_HERE')
client.file(remote_url).putFile(local_dataset)

Once your data set is uploaded, you’re ready to train your own Document Classification model, which will be stored in its own namespace:

import Algorithmia
client = Algorithmia.client('YOUR_API_KEY_HERE')
input = {
    "data": "data://.my/collection/training_data.json",
    "namespace": "data://.my/my_new_classifier",
    "mode": "train"
}
result = client.algo('algo://nlp/DocumentClassifier/0.3.0').set_options(timeout=3000).pipe(input)

Once the training process completes, you’ll have a brand new classifier model that’s ready for prediction!

import Algorithmia
client = Algorithmia.client('YOUR_API_KEY_HERE')
input = {
    "data": [{
        "text": "Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell."
    }],
    "namespace": "data://.my/my_new_classifier",
    "mode": "predict"
}
result = client.algo('algo://nlp/DocumentClassifier/0.3.0').pipe(input)
print(result)
{
    "text": "Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as the and of. Other words that may seem visual can often be predicted reliably just from the language model e.g., sign after behind a red stop or phone following talking on a cell.",
    "topN": [{
        "prediction": "language",
        "confidence": 0.2066967636346817
    }, {
        "prediction": "deep learning",
        "confidence": 0.2053375542163849
    }, {
        "prediction": "world war 2",
        "confidence": 0.20009714365005493
    }, {
        "prediction": "solar",
        "confidence": 0.19612550735473633
    }, {
        "prediction": "biological",
        "confidence": 0.19174303114414215
    }]
}

Want to improve your model on the fly? Just run the training step again, with new data but the same namespace, and it’ll augment your existing model with your brand new data!

import Algorithmia client = Algorithmia.client('YOUR_API_KEY_HERE')
input = {
    "data": "data://.my/collection/more_training_data.json",
    "namespace": "data://.my/my_new_classifier",
    "mode": "train"
}
result = client.algo('algo://nlp/DocumentClassifier/0.3.0').set_options(timeout=3000).pipe(input)

That’s it! With fewer lines of code than a good algorithms quiz, we’ve created a specialized document classifier that can be continuously improved and will automatically scale to match your organization’s demands. Start using the Document Classifier and become your company’s tech-savvy, timesaving hero!

Algorithm Engineer at Algorithmia, empowering users by building state-of-the-art, production-ready algorithms to solve their unique challenges

More Posts - Website

Follow Me:
TwitterLinkedIn