Algorithmia Blog

Incorporating Datasets from data.world in your Algorithms

We are happy to share a new set of algorithms from our partner data.world.  If you aren’t familiar with data.world, they have an amazing marketplace of datasets available covering the wide spectrum of what’s available.  I can’t think of a better pairing for the data.world datasets than with the many algorithms available in the Algorithmia.com directory!

It’s very easy to consume and publish new datasets via Algorithmia.  The data.world team has published four helper utility algorithms that you can take advantage of in your own algorithms.  Since you can compose, chain, and pipe output to multiple algorithms together easily, you’ll have so many possibilities for processing datasets available from data.world.  You can find all of their new algorithms available on their organization page on Algorithmia.com

First-Time Configuration

Before you start incorporating datasets or publishing to your datasets, you must configure your Algorithmia account to store the data.world credentials.  The easiest way is with the “data.world configure” helper algorithm that will store your specific credentials into a known location in your Hosted Data storage on Algorithmia.com.  Call it once, and you’re all ready to go!

Using Open Datasets in Your Algorithm Pipeline

Once you have your credentials configured, you can simply call the “data.world query” helper algorithm to pull data from data.world directly into any algorithm you create.

Let’s put together a quick NBA Annual Team Attendance History analysis function that takes advantage of several of the most popular Time Series algorithms available on Algorithmia.com.  There’s an awesome dataset on data.world with attendance metrics from Gabe Salzer where we can hone in on the Lakers.

import Algorithmia
import json

def apply(foo):
output = {}
client = Algorithmia.client()

input = {
  "dataset_key": "gmoney/nba-team-annual-attendance",
  "query": "SELECT home_total_attendance FROM `nba_team_annual_attendance` WHERE team='Lakers'",
  "query_type": "sql",
  "parameters": []
}

# load dataset
algo = client.algo("datadotworld/query")
dataset = algo.pipe(input).result["data"]

# process dataset
all_values = [d["home_total_attendance"] for d in dataset]
metrics = client.algo("TimeSeries/TimeSeriesSummary").pipe({"uniformData": all_values}).result

return metrics

This algorithm then returns a summary of Time Series metrics for the annual attendance of the Lakers:

{  
  "correlation": 0.33052435441847194,
  "geometricMean": 766172.0799467923,
  "intercept": 747591.4338235295,
  "kurtosis": 15.675479207885754,
  "max": 778877,
  "mean": 767146.0625000001,
  "min": 626901,
  "populationVariance": 1322295814.9335918,
  "rmse": 34319.671109063434,
  "skewness": -3.9439454970253087,
  "slope": 2607.283823529394,
  "standardDeviation": 37555.94319495252,
  "var": 1410448869.262498
}

Very nice!

There is a really awesome interactive Time Series algorithms demo if you want to check out how they work with a other types of datasets

I’m impressed by the number of datasets already available in the data.world directory.  We are excited to see how developers are going to leverage pairing the algorithms and microservices available from Algorithmia with the wealth of datasets at data.world!  To show how much we appreciate members of the data.world community, we have a special promo code to get 100,000 additional credits after signing up for Algorithmia: DATADOTWORLD

We’re looking forward to seeing how you make use of the new data.world functions!