Algorithmia Blog - Deploying AI at scale

Use the 404 Error Scanner Algorithm to Find Broken Links

Find broken links that result in a 404 error scanner algorithmThe 404 Error Scanner microservice is a useful tool for developers, data analysts, or content managers that need to validate URLs and keep tabs on broken links in a scalable and systematic way. This microservice makes finding and fixing broken links easy. It’s perfect for enterprise sites, social networks, blogs, and publishers.

What the 404 Error Scanner Does

The 404 error scanner microservice uses the site mapper algorithm to crawl through the pages on your website.

It then checks the status code of each URL it finds. If it doesn’t return a 100, 200, or 300-level status code the algorithm presumes the URL is broken – a 400 or 500-level error.

Because the scanner crawls every URL it finds within a domain, the algorithm can take a considerable amount of time to process. For large sites, this could take 10 minutes or more to run.

Why You Need To Check Broken Links

Broken links within your site can lead to a poor and frustrating user experience. It can also damage your SEO, and negatively impact your ability to attract new customers. Plus, broken links are annoying!

Use the 404 error scanner to:

  • Identify link rot on your site
  • Improve search engine rankings
  • Provide a positive experience to users and customers

How to Check for Dead Links

This microservice is provided as an API endpoint that can be implemented in five lines of code. To use it, you’ll need a free Algorithmia account. Next, choose and install the Algorithmia client for the programming language you’re comfortable with.

Now you’re ready to make your first API call. All you need is the base URL of the site want to check. And… that’s it.

Sample API Call

import Algorithmia

input = {
  "url": "https://blog.algorithmia.com/",
  "depth": 2
}
client = Algorithmia.client('API KEY')
algo = client.algo('web/ErrorScanner/0.1.2')
print algo.pipe(input)

Sample Output

{
  "brokenLinks": [
    {
      "brokenLink": "http://developers.algorithmia.com/algorithm-development/guides/nltk-guide/",
      "refPage": "https://blog.algorithmia.com//page/3"
    },
    {
      "brokenLink": "https://algorithmia.com/data#add-stream",
      "refPage": "https://blog.algorithmia.com//page/3"
    },
    {
      "brokenLink": "http://developers.algorithmia.com/algorithm-development/guides/scikit-guide/",
      "refPage": "https://blog.algorithmia.com//page/3"
    },
    ...
    {
      "brokenLink": "http://developers.algorithmia.com/algorithm-development/guides/scikit-guide/",
      "refPage": "https://blog.algorithmia.com/machine-learning-trends-future-artificial-intelligence-2016/"
    }
  ]
}

The 404 Error Scanner returns an array of objects containing each broken link found, and the page it was found on. This makes it easy to identify and fix dead links on your web site.

The microservice will follow every link it finds, checking to see if the status code returns a valid URL. By default, the microservice checks the first two levels of links it finds (i.e. starts on your homepage, and follows all the links it finds there — the first level — and then follows each link it finds on each of the subsequent pages — the second level). You can tune this by setting the depth parameter.

The 404 Error Scanner is a useful tool developers, data analysts, and content managers can integrate into their sites to catch link rot before it becomes a problem. Give a try and let us know what you think @Algorithmia.

Product manager at Algorithmia helping to give developers super powers.

More Posts - Website

Follow Me:
TwitterFacebookLinkedIn