Algorithmia Blog - Deploying AI at scale

Investigating User Experience with Natural Language Analysis

User experience and customer support are integral to every company’s success. But it’s not easy to understand what users are thinking or how they are feeling, even when you read every single user message that comes in through feedback forms or customer support software. With Natural Language Processing and Machine Learning techniques it becomes somewhat easier to understand trends in user sentiment, main topics discussed, and detect anomalies in user message data.

A couple of weeks ago, we gave a talk about investigating user experience with natural language analysis at Sentiment Symposium and thought we’d share the talk, along with the speaker notes for anyone who is interested.

This talk will cover:

  • The benefits of investigating user feedback data
  • What kind of project checklist you should have before you start
  • The types of helpful techniques commonly used in text analysis
  • Steps to take when your analysis fails
  • What to do with your insights

Users have some important insights and suggestions for making your product better because they are the ones using it the most!

By analyzing user feedback you can:

  • Find hidden trends and patterns by doing a sentiment or topic based time series analysis
  • Find bugs and design issues that might have been missed in QA by doing anomaly detection
  • Gain new opportunities to run A/B tests based off of user feedback
  • Learn about feature requests/needs from users
  • Segment your customers for targeted messaging – for instance you can identify all the unhappy customers or people who asked for certain features and reach out to them if you have non-anonymous data

  • Have specific goals – as in finding anomalies in user sentiment in real time
  • Know your data – is it big or small? Is your data coming from multiple data sources or do you have a interrupted time series data?
  • Choose environment/libraries/API’s wisely
  • Choose analysis according to your goals
  • Decide who the analysis is going to and what they are going to do with it – is it an A/B test?Is it going to the product lead? Are you generating reports or a creating a real-time dashboard?

Knowing your data means understanding where your data is coming from. It will help you make smart decisions before you get started on your analysis. If your data is in data streams, you might want to choose to work with Spark Streaming or Apache Flink, especially if you’re working with large data that requires distributed computing resources to process it. If your use case doesn’t require streaming data but that’s your data source, you can always choose to create static files from your streams. This makes sense if you’re generating reports versus working on a real-time dashboard.

What if you have small data? This will affect what models you should consider when you’re doing your analysis – not all of them work well on small datasets or perform well on multi-class problems.

It’s important to understand the types of pre-processing techniques that might be needed to clean your data. User feedback data is notoriously dirty, containing HTML tags, grammatical errors, emoticon-like text (which you would want to transform into a token a sentiment classifier would pick up), and swear words that you would also want to keep when analyzing user sentiment.

You’ll also want to create custom stopword lists and ignore short comments that contain less than some threshold of tokens. “I love this” isn’t insightful data, but if you’re doing a sentiment analysis it will skew your total sentiment scores without giving you any insight into what they’re referring to orwhy they feel the way they do. But, if you’re looking at user sentiment after a big design release, you might want to keep such sentiments if you have a good idea what the user is referring to.

And of course, you should be using a stemmer or lemmatizer and will want to increase sparsity with Scikit-Learn’s CountVectorizer or R’s RemoveSparseWords to help develop meaningful clusters and decrease the chance of creating a model that overfits to your data.

Remember: Garbage in, garbage out.

Beyond your data, you should also think about the third party libraries, APIs, and environments you’re willing to rely on, because a library or API that doesn’t have backwards compatibility from version to version might break your model. This can be a headache if there are features you want with new releases, but you can’t upgrade your version because of breaking changes in the library.

Also, strong community and development support is important because without it, libraries might be put out to pasture (Theano was a great deep learning framework but is no longer in active development due to lack of support and maintenance, so you’re likely to be looking at shifting to a framework that has both a big name company behind it and a large community to support it like Tensorflow or the up and coming PyTorch).

For example, on the slide you’re looking at is an infographic from Docker (a popular container platform) which has an active community of open source development and users from around the world, who go to Docker meetups and conferences and write blog posts and tutorials about using Docker. It’s a testament to what a community can do for a library or platform.

Next we’ll talk about popular environments and libraries before we move onto the types of analysis you might consider.

Now, when talking about environments, I’m talking about everything from Docker containers to Jupyter notebooks, to virtual environments and using pip as a package manager, or Conda.

I’ll do a brief survey of a few of these options, so you can start deciding which ones you want to choose for your project.

Much of this, of course, depends on whether you’re putting your code into production.

If you are, you might want to go with Docker to have a testing and production environment that’s consistent, as that way you won’t have any nasty surprises when pushing to prod. Also, if you’re putting your code into production, re-think using Jupyter notebooks. While they are great for collaboration, education, and experimentation, they don’t always promote writing production worthy code. For instance you likely aren’t using functions or classes in your notebook code and that makes it hard to write tests, which makes your code more likely to break in production.

As for dealing with library dependencies in Python, you’ll probably be choosing between pip (using virtual environments) or Conda, which has become the de-facto package manager for many data science projects.

But, something you might want to think about is that Conda does have a lot of bloat – as in libraries that are automatically installed – which can be annoying because you have to manually remove them. For a smaller footprint, there’s the miniconda option.

Now, a benefit of Conda is that it installs all dependencies – not just Python ones – which is handy, for instance, if you have C dependencies.

Something to note is that you can use pip to install dependencies within a Conda environment, but if you’re using Conda for package management, you can only install within its own environment. Yet, because of the versatility of Conda, it’s become my choice over time.

It’s important to think about your priorities when choosing libraries, because like environments they all have pros and cons, and you don’t want to end up relying on libraries that don’t fit your needs.

For instance there are some important differences between NLTK and SpaCY which are both commonly used for data pre-processing and both have lemmatization, stop word removal, and stemming packages available.

NLTK has more models to choose from versus SpaCy. For instance, NLTK has multiple stemmers available, but it also has machine learning models such as Naive Bayes and Decision Tree models, and languages other than just English; all features which SpaCy lacks.

Also NLTK is (right now at least) more stable than SpaCy. What I mean by stable is that because SpaCy is rather new, they haven’t ironed out some of their main modules, such as their lemmatizer. For instance, they’ve reversed certain decisions such as removing the label -PRON- (for pronoun) in their lemmatizer and if you’re expecting that label and updated to SpaCy2 to take advantage of other features that can break your whole model. Note though, SpaCy has great defaults and mostly works well out of the box, so again, it all depends on what you’re looking for.

When comparing SpaCy and NLTK for model performance, it all depends on the model so test them both out to see what works the best on your data. You might, in some cases, be trading accuracy for performance.

As an endnote to choosing libraries, I want to add that while Pandas (like notebooks) is one of those incredibly useful tools for data analysis and local development, any use of data frames are not great for production, because they are inherently slow. You might want to think about using Numpy arrays for production environments.

So when you’re thinking about investigating your user feedback, you’ll probably start with basic text mining techniques to find patterns in your data using clustering models such as K-means or dendrograms or topic modeling with LDA. Then you can move on to sentiment analysis, document classification, and even time series analysis and anomaly detection.

You can use different unsupervised techniques such as K-means clustering, dendrogram clustering, word clouds, named entity recognition, and more, but I want to point out that a lot of these techniques produce similar results (especially because many use Euclidean distance), and it’s probably not worth your time to create them all for one analysis. Choose the one that best fits your goal.

If you are creating a dashboard or generation reports, then maybe choose one with a visualization like a dendrogram to see clustering of words that often appear together. If you are doing information retrieval and document classification then maybe choose LDA to get documents by topics.

So when you’re thinking about investigating your user feedback, you’ll probably start with basic text mining techniques to find patterns in your data using clustering models such as K-means or dendrograms or topic modeling with LDA. Then you can move on to sentiment analysis, document classification, and even time series analysis and anomaly detection.

You can use different unsupervised techniques such as K-means clustering, dendrogram clustering, word clouds, named entity recognition, and more, but I want to point out that a lot of these techniques produce similar results (especially because many use Euclidean distance), and it’s probably not worth your time to create them all for one analysis. Choose the one that best fits your goal.

If you are creating a dashboard or generation reports, then maybe choose one with a visualization like a dendrogram to see clustering of words that often appear together. If you are doing information retrieval and document classification then maybe choose LDA to get documents by topics.

If you are doing document classification and don’t have labeled data, then you could use LDA, (latent dirichlet allocation) which discovers topics in documents and assigns probabilities to the terms making up the topics.

Notice that on our slide we don’t have a label for our topics yet, but we have terms that might make up a topic. They would have probabilities assigned to them that indicate how well they fit into that grouping.

We could possibly infer that Topic 3 is about people asking questions regarding deploying their models to Algorithmia, while Topic 4 might be about problems learning how to use the Python client. Going back to our term frequency chart we might decide to call Topic 4 “Error Message”, or we could be more specific and call it “Python API Error Message”.

By pulling documents that have frequent terms you can focus on patterns in the UI that users might be having trouble with. For instance I made changes to our docs to make our Python and our model hosting docs clearer.

The above example is the result of mapping documents to their associated topic. This example is from using the Mallet LDA package on Algorithmia which generally gives decent results for an out-of-the-box model.

This document had a 60% probability of belonging in this topic which I’ll call: “Client API Errors” so “node client option request server response cors error” would have the label “Client API Errors.”

I’d then use these mappings to create a training set for document classification since I don’t already have a labeled dataset.

Once you have your labeled documents with their associated topics you can sort the messages by topic, or by sentiment if you’ve done a sentiment analysis, or even by both. That way you can triage your user messages so you can:

In real time, prioritize groups of users that have problems using your product or site and respond to them quickly. This is different then a chatbot system because the purpose of it is to identify larger trends or anomalies in real-time versus responding in an automated way to single messages. For instance you would be interested to know that a large amount of users can’t access one of your API’s or are having issues with a form or a dropdown. Perhaps not surprisingly QA doesn’t catch everything and sometimes it’s simply a friction point in the UI that wasn’t anticipated or confusing text.

Be alerted to issues via anomaly detection of sentiment or topics such as “API errors” and then your pipeline could automatically send those messages to the appropriate teams for immediate action or further investigation.

There are several different classification models available for document or sentiment classification and most depend on how many classes you are trying to classify on and how large of a dataset you have access to, because more classes require more data.

Here are some broad guidelines, but you really have to test your data on various models to find out for sure what will work well for you.

For instance you might want to look at Naive Bayes if you don’t have a ton of data, but if you have more data and want a more robust model you can try Random Forest. Note that choosing random forest might give you a boost in accuracy, but you will likely lose speed which is important in real-world applications.

For decent accuracy but lack of scalability, you might want to try basic Word2VecDoc2Vec, or even LDA2Vec (which combines Word2vec with LDA’s topic modeling).

If you have GPU infrastructure available, a lot of data for each class, and time on your side for training, then you can use deep learning. For natural language processing, particularly convolutional neural nets, you won’t have to focus on feature engineering and you should get good results.

So make sure you understand your priorities – performance or accuracy – and base decisions on your data size as well.

One of the most useful analyses you can do on user data is to discover how user sentiment changes over time.

Even more insightful is when you integrate release dates as benchmarks in order to gauge sentiment changes after a major design release.

Now, depending on the vendor you use for collecting user message or feedback form data, you might have the URI that the form or comment was made on so you can organize sentiment by pages.

For example, you can find the sentiment scores for a landing page and if you’re running an A/B test on different styles, you can gain some extra insight beyond conversion rates and pull up the negative comments to see what issues users had with that style.

But, I want to mention here, that while we are looking at trends first, we are then drilling down with a narrowed scope to view our highest priority areas and then we’re checking those messages manually and responding accordingly. Machine learning and automation only go so far in understanding what users are thinking and how they are feeling.

For sentiment analysis on user feedback data, if you have enough data and have a labeled dataset, you can create your own model using anything from Naive Bayes to a Support Vector Machine. And if you have lots of data and access to GPU’s you can always try neural nets such as LSTM’s (Long Short-Term Memory models) which are a flavor of Recurrent Neural Networks. But when you don’t have enough data to train your own model, don’t discard the many  pre-trained sentiment models out there. Since user feedback data is fairly similar to social data (lots of misspelled words, fragmented sentences, slang, etc.) it’s useful to check out a pre-trained model if the data it was trained on is somewhat similar to your own.

Besides sentiment time series, you can do a time series analysis on topics.

For instance, on this slide, you can see a huge spike that occurred during mid-2017 for the bigram “error message” and this was due to a demo we created of our colorizer algorithm that developers could use in their apps to colorize their black and white photos via our API. The reason we were getting so many user messages that contained the bigram “error message” was that the demo landed on a few genealogy and photography sites and non-developers were sending us messages that they were having difficulties with the demo.

If you have a live data stream and are building out a dashboard, you might want to track a few important n-grams that you get from a simple term frequency chart, or topics from a topic modeler like LDA, so you can get a feel of what users are talking about and set up alerts to notify if it’s over or under a certain threshold for anomaly detection.

You can see that even with this kind of unexpected example, it’s helpful to track what users are talking about in real-time so you can help them use your app, API, or in this case redirect them to an iOS app to colorize their black and white photos when they aren’t your target user.

If you want to understand the overall trends in your data and even forecast future trends in topics or sentiment you might want to try some of these or other time series models:

  • ARIMA (autoregressive integrated moving average)
  • Holt-Winters (a triple exponential smoothing model)
  • Hidden Markov Model

Also, you might want to check out Facebook’s open source time series forecaster called Prophet which deals well with multi-seasonal linear and non-linear data.

If you’re looking to detect anomalies such as our colorizer spike, you could use a support vector machine algorithm, or K-Nearest Neighbors. Twitter has put out a promising anomaly detection package as well that’s available for Python and R called AnomalyDetection built upon Generalized Extreme Studentized Deviate (ESD) and handles more than one outlier.

And of course, when you have a lot of data, a Long Short-Term Memory Model (LSTM) works well for anomaly detection or other classification problems.

By detecting anomalies you can detect bugs or find other issues with your product or site such as a confusing drop down or form.

Sometimes your models don’t perform well at all. Say you’re using a logistic regression model for document classification, your p-values and log odds are within acceptable bounds, and you’ve ran tests and compared models using bigrams versus unigrams. But while everything looks generally ok, you aren’t impressed with the results.

The next step you might want to take is to make sure you’ve processed your data properly. That means reducing noise, deciding on whether to increase the sparsity of your data more, and for sentiment analysis keeping emoticons or other features that relay emotion.

You also might want to implement a part of speech tagger to keep syntax for sentiment analysis or check out a document or phrase similarity model like doc2vec or LDA2vec as mentioned earlier.

For document classification in particular you might want to check out other new models such as Gensim’s word mover distance, which assesses the “distance” between two documents even when they have no words in common. It uses word2vec and has outperformed other models such as K-Nearest Neighbors.

And lastly, I’ve said it before, and I’ll say it again: if you have the data and infrastructure to train huge amounts of data, then take a look at neural nets for both sentiment and document classification. Try experimenting with convolutional or recurrent neural net based models such as LSTM or check out attention hierarchy networks which use the natural hierarchical structure that occurs in text documents. AHN’s also acknowledge that sentences and words have different levels of impact in the document and uses context to understand if a sequence of tokens is relevant or not so you might want to try it out.


Whether you’re building out user dashboards tracking topic-based anomalies or doing a post-design release review of user sentiment, it’s an important and often overlooked step in the customer support pipeline. Hopefully this post will inspire you to check out your user message data to help your customers succeed in their goals when using your product or site.