Algorithmia

Extract Structured Data From Web Sites Using Analyze URL

Extract Structured Data With Analyze URLScraping and extracting structured data from web pages can often be a challenge. There’s typically issues with fetching data, dealing with pagination, handling AJAX, and more.

The Analyze URL microservice is a useful tool for transforming messy unstructured data into clean structured data via a simple REST API.

We created this microservice by combining several simple functions into a more complex web service. Check out the Analyze URL source code here.

By chaining these functions together, where the result of one is applied to the next, we get what computer science calls a pipeline.

Let’s take a look at how Analyze URL is an example of a microservice composed of a few functions used to scrape a URL and return structured data from the web page.

tl;dr Don’t like reading? Just want to extract structured data from web pages? We got you. Check out the Web Page Inspector demo built off of Analyze URL. Web Page Inspector instantly retrieves clean, structured data from any URL.

What is Analyze URL?

The Analyze URL microservice is an easy way to scrape metadata from any URL. The service accepts a URL as a string, and returns a summary of the web page (using the Summarizer microservice), timestamp, thumbnail, plain text content, the title, URL, and status code for the page.

The data is returned as a useful JSON object. From here, you could pipe the text into Auto-Tag to generate keyword topic tags. Or, maybe you want to resize and crop the image using Smart Thumbnail.

Either way, the microservice is built from several underlying functions (or algorithms, if you will) that extract and produce structured data.

Here we’re making an important distinction between algorithms and microservices (for purposes of convenience we’ll skip the semantics and use algorithm and function interchangeably)

  • Algorithms are deterministic. Meaning given the same input, it will always produce the same output.
  • When we compose several algorithms together we get a microservice, a sort of self-contained service that achieves a specific task predictably.

When we say Analyze URL is a microservice, we’re implying that it’s the result of several algorithms (or functions) working behind the scenes to produce some desired result. In this case, a bunch of structured data from a URL.

Check out the source code behind Analyze URL to see what we mean by functions working together.

The upshot to all of this is that while Analyze URL is written in Java, the Algorithmia API makes it interoperable. (Side-note: Because Analyze URL is written in Java it’s technically not a function but rather a method.) Meaning, you can call this Java microservice using any of the supported Algorithmia clients, like Python, Javascript, Scala, and more.

Why You Need Analyze URL

While it’s pretty straight-forward to scrape a single web page, it’s not as simple when you want to scrap any URL, and consistently extract structured data. It’s much harder to do this when scraping hundreds or thousands of different URLs at a time.

Analyze URL works as a simple API endpoint that’s always on and available.

How to Extract Structured Data

Create a free Algorithmia account, and install your client. Then, call the Analyze URL microservice by passing in the URL you want to generate structured data for. Boom. You’re done.

Sample Input

import Algorithmia

input = "http://blog.algorithmia.com/predictive-algorithms-track-real-time-health-trends/"
client = Algorithmia.client('YOUR API KEY')
algo = client.algo('web/AnalyzeURL/0.2.14')
print algo.pipe(input)

Sample Output

{
  "summary": "In this tutorial, we're going to build a real-time health dashboard using predictive algorithms to track a person's blood pressure trends over time.",
  "date": "2016-08-15T12:31:55-07:00",
  "thumbnail": "http://blog.algorithmia.com/wp-content/uploads/2016/08/predictive-algorithm-dashboard.jpg",
  "marker": true,
  "text": "Build tomorrow's smart apps today This is a guest post by Chris Hannam, a professional Python and Java developer. Want to contribute your own how-to post Let us know contact us here. We’ve shown how to use predictive algorithms to track economic development. In this tutorial, we’re going to build a real-time health dashboard for tracking a person’s blood pressure readings, do time series analysis, and then graph the trends over time using predictive algorithms. ...",
  "title": "Predictive Algorithms to Track Your Health Data In Real Time - Algorithmia",
  "url": "http://blog.algorithmia.com/predictive-algorithms-track-real-time-health-trends/",
  "statusCode": 200
}

This is just the start of composing algorithms into microservices on Algorithmia. Think of it as Algorithms as a Microservice where you have access to an ever-growing library of more than 2,200 algorithms.

For instance, you could create a new microservice by running a URL through site map, and then piping the output into Analyze URL to extract structured data for an entire domain. Oh, hey, look a simple Python script to do just that.

Because algorithms makes it easy to turn business logic into a web service, you can now instantly deploy your backend code as an API for public or private consumption. Every algorithm runs as it’s own microservice, making them composable, interoperable, and secure. No servers, NoOps. You’re good to go.

Check out the Web Page Inspector demo that uses Analyze URL to instantly retrieves clean, structured data from any URL.

Let us know what you think @Algorithmia.

Product manager at Algorithmia helping to give developers super powers.

More Posts - Website

Follow Me:
TwitterFacebookLinkedIn