Algorithmia

Machine Learning Datasets For Data Scientists

Datasets for Machine Learning and Data Scientists Finding a good machine learning dataset is often the biggest hurdle a developer has to cross before starting any data science project. Whether you’re new to machine learning, or a professional data scientist, finding a good machine learning dataset is the key to extracting actionable insights. We’ve compiled 11 of our favorite publicly available datasets below. 


/r/datasets

Reddit’s /r/datasets provides a social way of sharing and requesting datasets. Here, you’ll find a wide variety of data ranging from links to deep learning datasets such as the Indian Movie Face database that has images of 100 Indian actors from various movies, to a DMV dataset of the prohibited license plates in each state. This is probably the most random and unorganized compilation of datasets, but provides some serendipitous finds.


Natural Earth Data

Natural Earth includes a relatively small dataset of geospatial data that’s optimized for use in web mapping applications. These datasets include map data in both raster data layers, and vector layers. They include cultural data, countries, boundary lines, population attribution, water boundaries, and more. The vector data is available as ESRI shape files while the raster data is available in TIFF format. This data source makes it easy to integrate mapping utilities into native or web apps.


UC Irvine Machine Learning Repository

The well-known and heavily utilized UCI Machine Learning Repository contains various datasets useful for machine learning applications. Here you’ll find data sets categorized by task (e.g. classification; regression), data type (e.g. multivariate; time-series), and more. Recent datasets include air quality in Italy, student alcohol consumption, and GPS trajectories.  Other interesting datasets include the chemical composition from wine grown in Italy, Abalone physical characteristics, and heart disease data. Each data set comes with rich metadata, including information about relevant papers, data sources, datatypes, and more. Datasets come in a variety of formats, including .CSV and .ZIP.


Google Trends Datastore

Google Trends is perfect for exploratory and proof of concept exercises using search queries over time. Google Trends shows trending queries through visualizations, regional maps, interest over time, and query history. Refine your searches based on region or categories, like business, entertainment, sports, and more. It’s also interesting to explore how different trends compare with each other over time, such as looking at how the queries for climate change, global warming, recession, and economy compare with one another over the past ten years. Enjoy looking at real-time trending data on the homepage and all datasets are available to download as a CSV file.

Want more? Check out this guide to hacking the Google Trends API.


Machine Learning Data Set Repository

The Machine Learning Data Set Repository is a collection of datasets ranging from labor strike data to network analytics data. The datasets include metadata, like licensing, dependencies, and attribute types. They’re available in various formats like .ZIP, .TAR, .CSV, and .XML. The machine learning datasets are searchable, categorized, and labeled with star ratings, download counts, and comments so you finding what you need should be straight-forward.


USGS.gov

The U.S. Geological Survey is a goldmine for natural resources and geological data. Explore topics ranging from biology to climate change to mineral data. The site also offers access to real-time data, scientific research data, and, of course, GIS datasets. USGS also provides access to various endpoints, like the Waterservices API, offering a data catalog tool for browsing the geospatial and natural resource data. The datasets from the data catalog includes biodiversity counts, groundwater depletion, geothermal data, and more. Each data portal includes extensive metadata documentation and are available as shape files in .ZIP format or as raster datasets with .GXF format. Non-GIS data is available in .CSV, JSON, XML and more.


Deep Learning Datasets

The Deep Learning dataset collection is – not surprising – perfect for deep learning algorithms! These datasets cover everything from symbolic music, to natural images, to faces, text, and speech. Most of these datasets are well known, such as Penn Treebank and MNIST, however having these deep learning datasets in one place is extremely helpful.


Pew Research Center Datasets

The Pew Research Center provides social and demographic trends with datasets on religion, politics, science, tech and media. These machine learning datasets are based on citizen polls, surveys, and questionnaires. The datasets are available for download after filling out a basic form and accepting their use agreement. Since the data is from polls it usually consists of boolean and unstructured text data.


Open Data Network (Socrata)

The Socrata Open Data Network is great for finding and accessing open government data.  Since government sites are notoriously hard to navigate, the Socrata API is great way to find open datasets. Not all agencies, regions, and datasets are available. The Socrata API is also good for querying specific questions such as: What’s the population in Norman, OK? This will return a map and the population data. The API also allows you to compare that dataset with similar towns across the country, and offers other questions regarding the location in your query.


Open Data Stack Exchange

On Stack Exchange there’s a section called Open Data. It’s dedicated to questions and answers about where to find specific datasets. This is an aggregate of questions posted offering ways of finding new datasets. The questions for where to find datasets range from topics such as the physical data of soccer players to data sources of disease counts organized by their ICD-10 codes. Not all questions posted here have been answered.


Data Is Plural

This Google Sheet is a compilation of links to interesting datasets and data sources. It’s regularly updated and also available as a weekly newsletter. Links to data sources range from Flint water samples (this contains multiple data sources with various datasets having to do with Flint water samples) to every place name in the US.


While this is just a sample of machine learning datasets available, we hope it offers a starting point to find the dataset you’re looking for. What’s your favorite source for datasets? Let us know @Algorithmia, and we’ll add it to the list!

When you’re ready to host your machine learning model, we’re here to help. We host scikit-learn, nltk, Caffe, TensorFlow, and Theano models, and turn them into scalable web services.