Algorithmia

Web Scraping with Python: How To Crawl, Scrape, and Analyze URLs

Web Scraping 101: How to crawl, scrape, and analyze websites in Python

How do you convert an entire website into JSON when an API isn’t available? For many, they’d write a web crawler to first discover every URL on a domain. Then, write a web scraper for each type of page to transform it into structured data. After that, they’d have to de-dupe, strip HTML, and more just to get their data in a structured state. That sounds like a lot of work.

Last week we taught you how to extract structured data from websites using the Analyze URL microservice. Now, let’s use web scraping to crawl and analyze an entire website in less than 50-lines of Python.

When we’re done you’ll have a script that will crawl a domain, scrape the metadata and content, and put it in a useful JSON format.

From here you can do any number of things, from building a content recommender, to doing competitive keyword research, to finding orphaned pages within your domain. You could pair this with a script to find broken links and generate a 404 Not Found email report. Maybe you’re doing a site audit, and need to understand all the pages within the domain.

This script can do the heavy lifting to turn unstructured data into useful structured data.

Step 1: Install the Algorithmia Client

This tutorial is in Python. But, it could easily have been built using any of the supported clients, like Javascript, Ruby, Java, etc. Here’s the Python client guide for more information on using the Algorithmia API.

Install the Algorithmia client from PyPi:

pip install algorithmia

You’ll also need a free Algorithmia account, which includes 5,000 free credits a month – more than enough to get started with crawling, extracting, and analyzing web data.

Sign up here, and then grab your API key.

Step 2: Crawl the Website

The first thing we need to do is crawl our domain using Site Map. This microservice follows every link it finds within the domain and returns a graph representing its link structure.

We set our input to the top-level domain, and optionally set the depth of search. The default is a depth of two. Depth is how many levels away from the starting point Site Map will traverse. We’re setting it to one for purposes of quick and easy demonstration. You’ll see slower performance with each additional depth you crawl at.

Try running the following code:

import Algorithmia
import json

input = ["http://algorithmia.com",1]

client = Algorithmia.client('YOUR API KEY')

res = client.algo('web/SiteMap/0.1.7').pipe(input)

siteMap = res.result

print siteMap

The script will run, piping your input into the Site Map algorithm, which crawls the site and returns an object containing all the links it found. If you changed the depth from one to two, you’ll notice that each link now has an array of additional links. These are the links found on the pages one additional level away from the root.

Step 3: Web Scraping and Analyzing URLs

Next, we need to flatten our site map graph by iterating through all the key-value pairs. We do this so that we have a clean array we can use to iterate through when analyzing our web pages. Each URL gets added to our links array.

Then, we use the list(set(links)) function to de-dupe links from the links array.

Now we have a flat array of URLs from our domain. We then iterate through each link and pipe it into the Analyze URL microservice. The result is an object containing the page summary, date, thumbnail, text content, title, status code, and more. Learn more about how to use analyze URL to extract metadata from any web page.

As the results from Analyze URL are returned, we add them to our output array. We use json.dumps(output, indent=4) to prettify our results.

links = []
output = []

for keyLink in siteMap:
    links.append(keyLink)
    for valLink in siteMap[keyLink]:
        links.append(valLink)

links = list(set(links))

for l in links:
    analyze = client.algo('web/AnalyzeURL/0.2.14').pipe(l)
    output.append(analyze.result)

print json.dumps(output, indent=4)

Save your file as sitemap2analyzeUrl.py and then run the script from the command line using:
python sitemap2analyzeUrl.py

If all goes well, you’ll get a response that’s an array of objects, each representing structured data for the crawled URLs. It’ll look something like this:

[    
    {
        "title": "About - Algorithmia",
        "url": "https://algorithmia.com/about",
        "text": "Algorithmia was born in 2013 with the goal of advancing the art of algorithm …",
        "marker": true,
        "date": "",
        "thumbnail": "",
        "statusCode": 200
    },
    {
        "title": "Use Case Gallery- Algorithmia",
        "url": "https://algorithmia.com/use-cases",
        "text": "We are a community of users ranging from independent developers to established enterprises powering …",
        "marker": true,
        "date": "",
        "thumbnail": "",
        "statusCode": 200
    },
    {
        "title": "Reset Password - Algorithmia",
        "url": "https://algorithmia.com/reset",
        "text": "Toggle navigation Algorithms requests Algorithms Docs More App Developers Getting Started Browse Use Cases Explore Algorithms Pricing …",
        "marker": true,
        "date": "",
        "thumbnail": "",
        "statusCode": 200
    },
    …
]

And there you go. That’s how you get started with web scraping to crawl, and generate structured data from an entire website.

This script could be extended further to generate topic tags for each URL — perfect for competitive SEO research. Or, run each link through Share Counts, to retrieve the Facebook, LinkedIn, and Pinterest social media engagement.

Want to build your own content recommendation engine? There’s just a few extra steps needed:

  • Run the text content for each page through the keywords for document set algorithm. This is a simple implementation of TF-IDF.
  • Then generate recommendations using keyword set similarity. To determine how closely related each page is.
  • Store this all in a SQL table, and incrementally add to it as new pages are added to your site.

Let us know what you think of our sitemap to analyze URL microservice @Algorithmia.

Tools Used:

The full script is available on GitHub, or just copy and paste this:

import Algorithmia
import json

# The domain to crawl and number of links deep
# More here: https://algorithmia.com/algorithms/web/SiteMap
input = ["http://algorithmia.com",1]

# Replace YOUR API KEY with you free Algorithmia key
# https://algorithmia.com/signup
client = Algorithmia.client('YOUR API KEY')

# Here we call the Site Map algorithm
res = client.algo('web/SiteMap/0.1.7').pipe(input)

siteMap = res.result

links = []
output = []

# Iterate through the key-value pairs from the site map graph
# adding every URL to the links array
for keyLink in siteMap:
    links.append(keyLink)
    for valLink in siteMap[keyLink]:
        links.append(valLink)

# Remove duplicate links from the links array 
links = list(set(links))

# Iterate through the links calling Analyze URL on each 
# Then add the object to the output array
# More here: https://algorithmia.com/algorithms/web/AnalyzeURL
for l in links:
    analyze = client.algo('web/AnalyzeURL/0.2.14').pipe(l)
    output.append(analyze.result)

# Clean up JSON and print the result
print json.dumps(output, indent=4)

# Run: python sitemap2analyzeUrl.py

Product manager at Algorithmia helping to give developers super powers.

More Posts - Website

Follow Me:
TwitterFacebookLinkedIn