Introduction to Language Identification

Identifying the language of text programmatically
Quick, what languages are these two sentences written in:

“Hey bana bir sorununuz olur mu?”

What about this one?

“Halló ég er með vandamál getur þú hjálpað mér?”

Not easy, right?

Figuring out a document’s source language is an essential first step for many cross-language tools and that’s why we’ve implemented a Language Identification algorithm.

Detecting languages falls in the category of natural language processing, which is the field of describing how computers can decipher meaning and value from human languages.

What Is Language Identification?

Our implementation of Language Identification is a microservice that utilizes the Apache Tika framework and it’s LangIdentifier module. It uses pre-trained language models in an ensemble to determine which language, or languages, an unknown input document is written in.

This is how the Apache Tika Language Identification works:

First, a language text corpus is generated by combining text data from books, papers, reports, etc to construct a large and diverse dataset as diverse as the language.

Tika then uses the tri-gram model structure, which simply trains a neural network to predict the next character given the previous 3 characters in document. After scanning the whole corpus you end up with a trained language specific character prediction model.

Now that you have a bunch of trained language models, here’s how Tika predicts languages:

Tika starts by iteratively scanning over the unknown input document and passes the current tri-gram sample to each language model.

Each language model then tries to predict the next character in sequence. The errors from each model are calculated and averaged for each tri-gram sample, generating language accuracies.

Finally, after the whole document has been scanned, the language accuracies are compared and the results are returned.

Why You Need Language Identification

Natural language models are generally specific to a particular language. If you aren’t sure what the incoming document’s language is, it’ll be incredibly difficult to provide them with a good experience.

Imagine a technical support chat bot that was able to determine the language a user is speaking in and could reply with documentation in the same language? Or consider a sentiment analysis tool that could automatically detect the language and the sentiment of any human language?

How Do I Use Language Identification?

Using the Language Identification algorithm is easy. All you need to do is provide text as a string form you want to identify, and it’ll take it from there.

Sample Input

import Algorithmia

input = {
  "sentence": "Hi, ben Turkce konusuyorum. Hangi dilde konustugumu anlayabiliyor musun?"

client = Algorithmia.client('your api key here')
algo = client.algo('nlp/LanguageIdentification/1.0.0')
print algo.pipe(input)

Sample Output

[{"confidence": "0.9999969", "language": "tr"}]

That’s it! We have 99% confidence that input text is in Turkish (see the language key for the algorithm here). Now you have a tool you can use to determine the language of any kind of test or document with confidence!

If you want to get even more information from text? Take a look at our implementations of Named Entity Recognition, Parsey McParseface and Text Summarization algorithms to extract even more information from your documents.

Algorithm Engineer at Algorithmia, empowering users by building state-of-the-art, production-ready algorithms to solve their unique challenges

More Posts - Website

Follow Me: