Algorithmia

Building a Sentiment Analysis Pipeline for Web Scraping

Python Sentiment Analysis PipelineIn an earlier post, we introduced the Sentiment Analysis algorithm and showed how easy it was to retrieve the sentiment score from text content through an API call.

In this post, we’ll show how to build a sentiment analysis pipeline that grabs all the links from a web page, extracts the text content from each URL, and then returns the sentiment of each page.

While sometimes text content is easily retrieved through a database query, other times it’s not so simple to extract. For instance, it’s notoriously difficult to retrieve text content from websites, especially when you don’t want to extract everything from a URL —  just the text content of specific pages.

For example, if you want to derive the sentiment from specific pages on a website, you can easily spend hours finding an appropriate web scraper, and then weeks labeling data and training a model for sentiment analysis.

Maybe you want to find the sentiment of specific pages of your website to make sure your content is sending the right message. Or, perhaps you are trying to analyze news articles by department and sentiment.

Confused? Check out our handy introduction to sentiment analysis.

No matter what your use case is, we’ll show you how to retrieve all the URLs from a website, extract only the text, and then get the sentiment of each page’s content in just a few API calls.

Step 1: Install the Algorithmia Client

This tutorial is in Python. But, it could be built using any of the supported clients, like Javascript, Ruby, Java, etc. Here’s the Python client guide for more information on using the Algorithmia API.

Install the Algorithmia client from PyPi:

pip install algorithmia

You’ll also need a free Algorithmia account, which includes 5,000 free credits a month – more than enough to get started with crawling, extracting, and analyzing web data.

Sign up here, and then grab your API key.

Step 2: Retrieve Links from URL

The first step is to gather our list of links. We’ll use the Get Links algorithm,  which takes a URL and returns a list of links found.

To start using Get Links, replace your_algorithmia_api_key with the key you got in the previous step and change the input to your URL.

import Algorithmia

client = Algorithmia.client("your_api_key")

def get_links():
    """Gets links from URL"""
    input = "https://algorithmia.com/"
    if input.startswith("http") or input.startswith("https"):
        algo = client.algo('web/GetLinks/0.1.5')
        links = algo.pipe(input).result
        return links
    else:
        print("Please enter a properly formed URL")

We’ve used the Algorithmia website as an example and here is a sample of the URLs extracted:

['http://developers.algorithmia.com', 'https://algorithmia.com/terms', 'https://algorithmia.com/algorithms/TimeSeries/OutlierDetection', 'https://algorithmia.com/algorithms/opencv/FaceDetection', 'https://algorithmia.com/algorithms/SummarAI/Summarizer', 'http://www.technologyreview.com/news/530406/a-dating-site-for-algorithms']

Step 3: Extract Content from URLs

Next, we’ll pass the links to the  URL-2-Text algorithm, which takes the URLs returned from Get Links and then extracts the text content for each.

Since we want only the content returned from the Algorithmia blog, we’ll limit the links used to extract content to ones that start with “http://blog.algorithmia.com” and create a dictionary holding the URL and the content.

def get_content():
    """Get text content from URL."""
    data = get_links()
    algo = client.algo("util/Url2Text/0.1.4")
    # Limit content extracted to only blog articles
    content = [{"url": link, "content": algo.pipe(
        link).result} for link in data if link.startswith("http://blog.algorithmia.com")]
    return content

Step 4: Find the Sentiment of Text Content

In our last step, we’re going to pass the extracted text content to our sentiment analysis algorithm. This algorithm takes either a dictionary holding a single document or a list of dictionaries holding many documents, which is what we’ve done. Each document could be a word, a sentence, a paragraph, or even an article.

This algorithm returns a list of dictionaries holding each document’s sentiment and the document itself. This makes it easy to create a new list that holds the URL, content, and the sentiment of each web page that we’ve passed through our algorithms.

def find_sentiment():
    """Get sentiment from web content."""
    data = get_content()
    algo = client.algo("nlp/SentimentAnalysis/1.0.2")
    try:
        # Find the sentiment score for each article
        algo_input = [{"document": item["content"]} for item in data]
        algo_response = algo.pipe(algo_input).result

        algo_final = [{"url": doc["url"], "sent": sent["sentiment"], "content": sent[
            "document"]} for sent in algo_response for doc in data]
        print(algo_final)
        return algo_final

    except Exception as e:
        print(e)

find_sentiment()

Here is a sample of the output with some of the text shortened for brevity:

[{‘content’: “...Face detection also allows us to detect the presence of multiple people in an image...", 'sent': 0.24829046, 'url': 'http://blog.algorithmia.com/post/121967357859/isitnude'}]

And that’s it! You’ve now build a sentiment analysis pipeline that retrieves the links you want from a URL, extracts the text content from each page, and then finds the sentiment for each document.

Next, it might be fun to get the noun phrases from your text by using the named entities algorithm or utilize other natural language processing techniques, such as profanity detection.

Tools Used:

For easy reference, find this code in our Recipes repo, or just copy the whole code snippet:

"""Retrieve urls, text content, and sentiment from URL."""

import Algorithmia

client = Algorithmia.client("your_api_key")


def get_links():
    """Gets links from URL"""
    input = "https://algorithmia.com/"
    if input.startswith("http") or input.startswith("https"):
        algo = client.algo('web/GetLinks/0.1.5')
        links = algo.pipe(input).result
        return links
    else:
        print("Please enter a properly formed URL")


def get_content():
    """Get text content from URL."""
    data = get_links()
    algo = client.algo("util/Url2Text/0.1.4")
    # Limit content extracted to only blog articles
    content = [{"url": link, "content": algo.pipe(
        link).result} for link in data if link.startswith("http://blog.algorithmia.com")]
    return content


def find_sentiment():
    """Get sentiment from web content."""
    data = get_content()
    algo = client.algo("nlp/SentimentAnalysis/1.0.2")
    try:
        # Find the sentiment score for each article
        algo_input = [{"document": item["content"]} for item in data]
        algo_response = algo.pipe(algo_input).result

        algo_final = [{"url": doc["url"], "sent": sent["sentiment"], "content": sent[
            "document"]} for sent in algo_response for doc in data]
        print(algo_final)
        return algo_final

    except Exception as e:
        print(e)

find_sentiment()