Algorithmia

Automatic SlideShare Text Summarization

Summarizing SlideShare Presentations Automatically Using AlgorithmsIn a recent post we introduced an automatic text summarizer algorithm that summarizes documents with a quick API call.

Using a text summarizer allows you to swiftly parse through articles, PDFs, and other documents. But, sometimes getting the content can be just as difficult as summarizing the document.

Luckily, we made both easy for you.

In this recipe we’ll walk you through the process of using Summarizer along with our Extract Text algorithm to show how to scrape and summarize the text from a SlideShare presentation.

Step 1: Install the Algorithmia Client

This tutorial is in Python. But, it could be built using any of the supported clients, like Scala, Ruby, Java, Node and others. Here’s the Python client guide for more information on using the Algorithmia API.

Install the Algorithmia client from PyPi:

pip install algorithmia

You’ll also need a free Algorithmia account, which includes 5,000 free credits a month – more than enough to get started with crawling, extracting, and analyzing web data.

Sign up here, and then grab your API key.

Step 2: Extract Text from a Presentation

Now, we’ll extract the text from a SlideShare presentation. We’ll use our HackerNews Meetup deck as an example, but feel free to try your own.

import re
import Algorithmia

client = Algorithmia.client("your_api_key")

def extract_text(url):
    """Retrieve text content from URL."""
    algo = client.algo("util/ExtractText/0.1.1")
    response = algo.pipe(url).result
    return response

The Extract Text algorithm pulls text from a URL or file such as a PDF or PowerPoint, and more. Passing in our URL, we get all the text content from the SlideShare page, but it’s pretty messy content. In the next step we’ll clean it up to only extract the slide content.

Step 3: Cleaning the Data

Here we’ll employ the non-glamorous technique of regex in order to clean up our data of any non-alphanumeric characters except for periods, find the start of the slides, and retrieve the text of those slides.

def get_slide_text(input):
    """Clean text content and return content only from slides."""
    text = extract_text(input)
    # Remove non-alphanumeric characters
    new_data = re.sub(
      r'[^0-9a-zA-Z . ]', ' ', text)

    # Find all the separate slides (slides are extracted with 1., 2.
    # preceding each slide)
    nums = re.findall(r'[1-9]\.|[1-9][0-9]\.', new_data)

    last_slide = nums[-1]

    # Look for text between 1. and the last slide
    cleaned_data = re.findall(r'1\..*(?=' + last_slide + r')', new_data)

    # Artificially create sentences that the summarizer needs.
    remove_slide_numbers = re.sub(
       r'[1-9]\.|[1-9][0-9]\.', '.', str(cleaned_data))

    # Remove extra whitespace
    super_clean = ' '.join(str(remove_slide_numbers.lower()).split())
    return super_clean

The above code will ignore any other text that our Extract Text algorithm gathered from the URL we passed it, only returning the text of the slides which we created as separate sentences for the Summarizer algorithm to parse.

Step 4: Retrieve the Presentation Summary

Now we’ll get the summary of our slide by passing our cleaned up slide text content through the Summarizer algorithm.

def summarizer(input):
    """Summarize text content from slides."""
    data = get_slide_text(input)
    algo = client.algo('nlp/Summarizer/0.1.3')
    result = algo.pipe(data).result
    print("Summarizer output: ", result)

The above call to the Summarizer is pretty straightforward. We get our content from the previous function, which we set to our data variable, and then pass that into the Summarizer algorithm to return the result.

Save your file as summarizer.py and run it from the command line with:

python summarizer.py

Example Input 1:

summarizer("http://www.slideshare.net/doppenhe/algorithmia-at-hackernews-meetup-seattle")

Example output:

"donald knuth info algorithmia.com . our goal with algorithmia is to make applications smarter by building a community around algorithm development where state of the art algorithms are always live and accessible to anyone. productivity increases when fewer inputs are used in the production of a unit of output we went from hunter gatherer to agriculture to industrial to the next revolution interpretation of data."

Here’s another example:

summarizer("http://www.slideshare.net/MattKiser/algorithms-as-microservices")

Example output:

"algorithmia algorithms as microservices 10 ways the algorithm economy and containers are changing how we build and deploy software today . algorithms are where the real value lies. into a single package running in the cloud ."

Step 5: Create a File and Add to Your Dropbox Folder

For our final step, we’re going to add a bit of code to the summarizer function and create a file to hold our presentations summary, and then download it to a Dropbox folder.

If you’ve never used the Dropbox connector on Algorithmia, check out our docs on how to set up a Dropbox connection. Be sure to make this the default Dropbox data source.

Modify the function you created in Step 4 by adding the exception handling and writing the file to Dropbox:

def summarizer(input):
    """Summarize text content from slides."""
    data = get_slide_text(input)
    algo = client.algo('nlp/Summarizer/0.1.3')
    result = algo.pipe(data).result
    print("Summarizer output: ", result)

    # Add this code for step 5:
    try: 
        # Write the text summary result to Dropbox
        client.file("dropbox://SlideSummary/slide_summary.txt").put(result)
        print("Printed contents to dropbox file")
    except Exception as e:
        print(e)

In this step, we added the code necessary to write our slide presentations summary to Dropbox. Note that while we choose the folder name “SlideSummary,” you can choose any folder that you want to save your file to.

The path will be: “dropbox://SlideSummary/slide_summary.txt” (remember to make the Dropbox connection the default).

The line we add both creates the folder “SlideSummary” and the file “slide_summary.txt”, and writes the our text summary to the file:

client.file("dropbox://SlideSummary/slide_summary.txt").put(result)

And there you have it. In this recipe we covered all the steps necessary to extract text from a SlideShare deck, cleaned up the data, summarized the presentation, and wrote it to a file in your Dropbox folder.

Tools Used:

For ease of reference here is the whole script we created or you can get it via the Algorithmia sample apps repo on Github.

import re
import Algorithmia

client = Algorithmia.client("your_api_key")

def extract_text(url):
    """Retrieve text content from URL."""
    algo = client.algo("util/ExtractText/0.1.1")
    response = algo.pipe(url).result
    return response

def get_slide_text(input):
    """Clean text content and return content only from slides."""
    text = extract_text(input)
    # Remove non-alphanumeric characters
    new_data = re.sub(
      r'[^0-9a-zA-Z . ]', ' ', text)

    # Find all the separate slides (slides are extracted with 1., 2.
    # preceding each slide)
    nums = re.findall(r'[1-9]\.|[1-9][0-9]\.', new_data)

    last_slide = nums[-1]

    # Look for text between 1. and the last slide
    cleaned_data = re.findall(r'1\..*(?=' + last_slide + r')', new_data)

    # Artificially create sentences that the summarizer needs.
    remove_slide_numbers = re.sub(
       r'[1-9]\.|[1-9][0-9]\.', '.', str(cleaned_data))

    # Remove extra whitespace
    super_clean = ' '.join(str(remove_slide_numbers.lower()).split())
    return super_clean

def summarizer(input):
    """Summarize text content from slides."""
    data = get_slide_text(input)
    algo = client.algo('nlp/Summarizer/0.1.3')
    result = algo.pipe(data).result
    print("Summarizer output: ", result)

    # Add this code for step 5:
    try: 
        # Write the text summary result to Dropbox
        client.file("dropbox://SlideSummary/slide_summary.txt").put(result)
        print("Printed contents to dropbox file")
    except Exception as e:
        print(e)