Algorithmia Blog

Acquiring Data for Document Classification

The Document Classifier is a powerful tool for generating keyword predictions for your documents, whether they are chat transcripts, emails, historical documents, scientific abstracts, or any number of other possible sources.

However, your predictions will only be as good as the dataset you’ve trained on. Since the Document Classifier algorithm supports retraining, this can be done in chunks right on the Algorithmia platform: grab a bunch of data, train your classifier on it, then come back immediately after or weeks later to add even more training data.

Let’s take a look at one possible datasource: we’ll use the publicly-available PubMed Entrez API to search for scientific journals by keyword. We’ll grab the Abstracts, using them as the “text” value in our dataset, and the keyword we searched on as the “label”. We will repeat this process for several different keywords, submitting each batch individually to the Document Classifier so that we stay within the maximum time-limit of 50 minutes (3000 seconds) for each algorithm call.

Since both the PubMed API and Algorithmia are language-agnostic, we could use just about any programming language we want, but this time we’ll stick to a nice straightforward Python script.
 

Step 1: setup

If you already have an Algorithmia account, head to the Credentials page and copy your API key. If not, sign up using the code “docclassify” to get 100,000 free initial credits plus the usual 5,000 free credits monthly.

We’ll be using the Python library BeautifulSoup to parse the data. If you don’t already have it, simply run pip install beautifulsoup4 to do so.

Now, on to the code… we’ll import the libraries we need (including the Algorithmia client), and set up some constants, including the list of keywords we’ll search PubMed for, the number of articles we’ll retrieve for each keyword, and a namespace that will identify this specific classifier. Make sure to replace ‘YOUR_API_KEY’ in the code with your own API Key:

import Algorithmia
import requests
from bs4 import BeautifulSoup
from urllib import quote_plus

#config
client = Algorithmia.client('YOUR_API_KEY')
search_phrases = ['heart disease','cancer','stroke','alzheimer','diabetes','influenza','pneumonia','hiv','copd','hypertension']
articles_per_search = 100
namespace = 'docclassify_pubmed'
search_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax={}&term=science%5bjournal%5d+AND+hasabstract+AND+'.format(articles_per_search)
article_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id='

 

Step 2: retrieve and parse data

The Entrez API is a bit complex, but the search_url above is configured to find science journals which have an Abstract; we’ll append specific keywords to it for our individual searches. This returns (among other things) a list of Article IDs. By appending these IDs individually to the article_url above, we can then get the Abstract for each specific Article.

Since Entrez returns XML by default (JSON is only available for a subset of their APIs), we’ll use BeautifulSoup to extract just the elements we need. The Article IDs are <id> elements inside an <idlist>; we extract and iterate over them, retrieve the article details (article_url+article_id) for each one, and get the text of the <abstracttext> element in that response (stripping out double-quotes, which can interfere with the text parsing).

Lastly, we add the Abstract and the keyword into a list called training_data, in the format which DocumentClassifier expects: {“text”:”(the Abstract’s text)”, “label”:”(the search keyword)”}

#examine each search phrase individually
for search_phrase in search_phrases:
	training_data = []
	#find IDs of articles matching search
	search_xml = requests.get(search_url+quote_plus(search_phrase)).content.decode('utf-8')
	search = BeautifulSoup(search_xml, 'lxml')
	article_ids = []
	for idlist in search.find_all('idlist'):
		for article_id in idlist.find_all('id'):
			article_ids.append(article_id.text)
	print 'Found {} articles for {}'.format(len(article_ids),search_phrase)
	#extract abstracts from articles
	for article_id in article_ids:
		article_xml = requests.get(article_url+article_id).content.decode('utf-8')
		article = BeautifulSoup(article_xml, 'lxml')
		abstract = []
		for a in article.find_all('abstracttext'):
			abstract.append(a.text)
		abstract = ' '.join(abstract)
		if(len(abstract)==0):
			print 'WARNING: no abstract for article {}'.format(article_id)
		else:
			training_data.append({'text':abstract.replace('"',''), 'label':search_phrase})

 

Step 3: train the classifier

Now that we’ve built up our training_data, we simply send it to nlp/DocumentClassifier, along with the namespace we want to use, and “mode”:”train” to indicate that we’re training (or retraining) the classifier.

Note that this segment of code goes inside the loop above, so that it gets run once for each keyword (see it in context at the end of this post). Also take note of set_options(timeout=3000), which increases the call’s timeout to 50 minutes per execution. Most of the time, this will be sufficient for up to several hundred documents, but if we did start hitting timeouts, we’d simply break up the training_data into smaller chunks and run them one after another:

	#train on this search_phrase
	training_input = {  
	   'data':training_data,
	   'namespace':'data://.my/'+namespace,
	   'mode':'train'
	}
	client.algo('nlp/DocumentClassifier/0.3.0').set_options(timeout=3000).pipe(training_input)
	print 'Trained {}'.format(search_phrase)

 

Step 4: make a prediction!

We’re done training! That took a little while, but getting individual predictions is now relatively fast. To get a prediction, we simply call nlp/DocumentClassifier again, but this time we only supply some “text” (not a “label”), and we set “mode”:”predict”:

#test a prediction
testing_input = {
   'data':[{'text':'This treatment targets several specific tumor antigens'}],
   'namespace':'data://.my/'+namespace,
   'mode':'predict'
}
print client.algo('nlp/DocumentClassifier/0.3.0').pipe(testing_input)

This gives us back a list of predicted keywords and a confidence for each, so now we can automatically categorize any new Abstracts or similar scientific text that we come across! We might use it for building a website or other structured repository, for automatically flagging new articles or messages of interest, or for evaluating whether a draft document will be appropriately labelled by humans or other software.

 

Followup: find your own API datasource, or pull from webpages

Now that we’ve tried grabbing data from one API, try finding another and adapting the code. It could be a JSON data API (in which case you can use json.loads instead of BeautifulSoup), or something much more complicated… perhaps the GMail API (your folders become the keywords) or your company’s chat logs with assigned tags. The possibilities are endless, so long as your data has a keyword/label for each text block.

If the data you want to grab exists on the web but doesn’t have an API, you can use a tool such as the HTMLDataExtractor, which allows you to perform XPath Queries on a URL, so you can drill down to specific elements on the page and get them back as structured JSON.

We hope you enjoy making use of the Document Classifier. and look forward to hearing how you put it to work for you!

 

Appendix: the complete code

Grab out the full code here or on GitHub

import Algorithmia
import requests
from bs4 import BeautifulSoup
from urllib import quote_plus

#config
client = Algorithmia.client('YOUR_API_KEY')
search_phrases = ['heart disease','cancer','stroke','alzheimer','diabetes','influenza','pneumonia','hiv','copd','hypertension']
articles_per_search = 100
namespace = 'docclassify_pubmed'
search_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmax={}&term=science%5bjournal%5d+AND+hasabstract+AND+'.format(articles_per_search)
article_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&rettype=abstract&id='

#examine each search phrase individually
for search_phrase in search_phrases:
	training_data = []
	#find IDs of articles matching search
	search_xml = requests.get(search_url+quote_plus(search_phrase)).content.decode('utf-8')
	search = BeautifulSoup(search_xml, 'lxml')
	article_ids = []
	for idlist in search.find_all('idlist'):
		for article_id in idlist.find_all('id'):
			article_ids.append(article_id.text)
	print 'Found {} articles for {}'.format(len(article_ids),search_phrase)
	#extract abstracts from articles
	for article_id in article_ids:
		article_xml = requests.get(article_url+article_id).content.decode('utf-8')
		article = BeautifulSoup(article_xml, 'lxml')
		abstract = []
		for a in article.find_all('abstracttext'):
			abstract.append(a.text)
		abstract = ' '.join(abstract)
		if(len(abstract)==0):
			print 'WARNING: no abstract for article {}'.format(article_id)
		else:
			training_data.append({'text':abstract.replace('"',''), 'label':search_phrase})
	#train on this search_phrase
	training_input = {  
	   'data':training_data,
	   'namespace':'data://.my/'+namespace,
	   'mode':'train'
	}
	client.algo('nlp/DocumentClassifier/0.3.0').set_options(timeout=3000).pipe(training_input)
	print 'Trained {}'.format(search_phrase)

#test a prediction
testing_input = {
   'data':[{'text':'This treatment targets several specific tumor antigens'}],
   'namespace':'data://.my/'+namespace,
   'mode':'predict'
}
print client.algo('nlp/DocumentClassifier/0.3.0').pipe(testing_input)