Apache Spark is one of the most useful tools for large scale data processing. It allows for data ingestion, aggregation, analysis and more on massive amounts of data and has been widely adopted by data engineers and other professionals.
With Spark Streaming and Spark SQL you can perform ETL operations in real-time on data coming from a variety of sources such as Kafka or Flume. And now if you want to do some basic machine learning, you can do that with SparkML, which is a library where they bring core statistical models like KMeans or decision tree models to users in a high level API.
But what if you want to analyze thousands of Tweets in real time, yet you don’t have a trained dataset to discover the sentiment of those tweets. Or maybe you want to classify documents on the fly or remove profanity from text or nudity from images?
Algorithmia’s over 4,000 pre-trained models and functions cover all of the above use cases and perfectly compliment Spark’s core functionality. These pre-trained models can easily integrate into Spark via a REST API endpoint. And just like Spark, Algorithmia has Python, R, Java, and Scala clients so you can stay in the language you’re familiar with while building robust machine and deep learning pipelines that scale with your data.
Lets play a game: can you tell the difference between these two sentences?
“Most of the time, travellers worry about their luggage.”
“Most of the time travellers worry about their luggage.”
Whoa, remove the comma and all of a sudden we’re having an entirely different conversation!
The little nuances of language can be hard enough for a human to understand, let alone a computer! How could we possibly teach a computer to understand the difference?
Just like a music producer creates a beat, then combines it with instrumentals and a baseline to form something catchy that lyrics can be applied to… developers need a way to compose algorithms together in a clean and elegant way.
Whether you’re creating a sentiment analysis pipeline for your social data or doing image processing on thousands of photos, you’ll need an easy way to combine the various tools available so you aren’t writing spaghetti code.
It isn’t always easy to combine the libraries you need. Sometimes a library or machine learning model is written in a different language than the one you’re using. Other times there might simply be a performance difference between languages which (is why we chose Rust to create a Video Metadata Extraction pipeline). And even though GitHub offers thousands of libraries, frameworks, and models to choose from, it’s sometimes difficult to find the one you need to solve your problem.
To solve these problems — and allow you to write elegant code while using machine learning models — Algorithmia provides an easy way to find, combine, and reuse models regardless of language. Each one gets a RES API endpoint, so you can mix & match them with each other and with external code. Read More…
The new Open Images dataset gives us everything we need to train computer vision models, and just happens to be perfect for a demo! Tensorflow’s Object Detection API and its ability to handle large volumes of data make it a perfect choice, so let’s jump right in…
We are happy to share a new set of algorithms from our partner data.world. If you aren’t familiar with data.world, they have an amazing marketplace of datasets available covering the wide spectrum of what’s available. I can’t think of a better pairing for the data.world datasets than with the many algorithms available in the Algorithmia.com directory! Read More…