In 1996 Larry Page and Sergey Brin published the Backrub paper, a research project about a new kind of search engine. The concept was a link analysis algorithm that measured relative importance within a set. Based on this one algorithm, the company Google was created and the PageRank index became one of the most famous algorithmic concepts in history.
Knowing how important it is to be indexed for the right thing, we here at Algorithmia were inspired by one of our users when he added an implementation of Page Rank to the marketplace (Note that Google has moved beyond their original algorithm over a decade ago at this point). We realized that by combining various algorithms available in the Algorithmia API with the Page Rank algorithm, we would be able to get an idea of how search engines view a site or domain. So we’ve essentially integrated every intermediate step and process that a crawler goes through when examining a site (using the algorithms already available in our marketplace), which allows us to understand how machines see the web.
Understanding how pages are linked to each other on a site gives us a snapshot of the connectedness of a domain, and a glimpse into how important a search engine might consider each individual page. Additionally, a machine generated summary and topic tags give us an even better picture of how a web crawler might classify your domain.
Modern web crawlers use sophisticated algorithms to link across the entire internet but even limiting our application to pages on just one domain gives us significant insights into how any site might look to one of these crawlers.
Building a site explorer
We broke up the task of exploring a site into three steps:
- Crawl the site and retrieve as a graph
- Analyze the connectivity of the graph
- Analyze each node for its content
We found algorithms for each part already in the Algorihtmia API, which allowed for quickly building out the pipeline:
- GetLinks (retrieves all the URLs at the given URL)
- PageRank (simple implementation of the Backrub algorithm)
- Url2text (converts documents and strips tags to return the text from any URL)
- AutoTag (Uses Latent Dirichlet Allocation from mallet to produce topic tags)
Here is the result (click on a node for more info):
The individual steps
We built this app using three core technologies: AngularJS, D3.js, and the Algorithmia API.
The first thing we needed to do was crawl the domain supplied. We allow the user to determine the number of pages to crawl (limited in our demo to a max of 120 – at this point your browser will really start hurting though). For each page crawled, we retrieve every single link on that page and plot it on the graph. Then, once we have reached the max pages to crawl, we apply PageRank to the result.
Get the links:
Iterate over site:
Apply Page Rank:
Once the graph is built, we render it using a D3 force layout graph. Clicking on any individual node retrieves the content from that page, cleans up the HTML so we are left with just the text, and process the text through both the summarizer and topic tagger algorithms.
Really, the hardest part about building this was figuring out the quirks of D3, since the Algorithmia API just allowed us to stitch together all the algorithms we wanted for the process and start using them, without worrying about a single download or dependency.
Don’t take our word for it try it yourself , we have made the AngularJS code available here, feel free to fork it, modify it and use it in your own applications.
It’s really easy to expand the capabilities of the site mapper, here are some ideas:
- Sentiment by Term (understand how a term is seen across the site e.g.: Apple Watch on GeekWire)
- Count Social Shares (understand how popular any link-node is on a number of social media sites)
- many more that can be found in Algorithmia…
Made it this far?
If you tweet “I am [username] on @algorithmia” we will add another free 5k credits to your account.
– Diego (@doppene)