All posts by Ahmad AlNaimi

Using H2O.ai to Classify Domains in Production

Classify Websites

Modern cyber attacks, such as Botnets and Ransomware, are becoming increasingly dependent on (seemingly) randomly generated domain names. Those domains are used as a way to establish Command & Control with their owners, which is a technique called Domain Fluxing. The recent WannaCry ransomware was famously stopped simply by registering one of those domain names.

The ability to quickly classify a domain name as *safe* or *malicious* is a critical task in the cybersecurity world. It can help alert security experts of any suspicious activity or even block that activity. Such a system will have two requirements:

  • Needs to be accurate, you don’t want to block your users from accessing safe websites
  • Needs to be scalable, able to handle thousands of transactions per second

There are plenty of approaches to this problem, especially in the academic world (S. Yadav – 2010, J. Munro – 2013). The fine folks at H2O.ai also have an excellent code sample we found here. This blog post will briefly describe how H2O’s implementation works and how you can deploy and scale it on Algorithmia. Read More…

Building an Operating System for AI

This post is a summary of a talk Diego Oppenheimer recently gave at the 2017 GeekWire Tech Cloud Summit, titled “Building an Operating System for AI”. You can listen to the original talk from here. For better view of the slides, you can follow along from here.

The operating system on your laptop is running tens or hundreds of processes concurrently. It gives each process just the right amount of resources that it needs (RAM, CPU, IO). It isolates them in their own virtual address space, locks them down to a set of predefined permissions, allows them to inter-communicate, and allow you, the user, to safely monitor and control them. The operating system abstracts away the hardware layer (writing to a flash drive is the same as writing to a hard drive) and it doesn’t care what programming language or technology stack you used to write those apps – it just runs them, smoothly and consistently.

As machine learning penetrates the enterprise, companies will soon find themselves productionizing more and more models and at a faster clip. Deployment efficiency, resource scaling, monitoring and auditing will start to become harder and more expensive to sustain over time. Data scientists from different corners of the company will each have their own set of preferred technology stacks (R, Python, Julia, Tensorflow, Caffe, deeplearning4j, H2O, etc.) and data center strategies will shift from one cloud to hybrid. Running, scaling, and monitoring heterogeneous models in a cloud-agnostic way is a responsibility analogous to an operating system – that’s what we want to talk about. Read More…

Data Science & Machine Learning Platforms for the Enterprise

G19_DataPlatformHero_2040_Medium.jpg

TL;DR A resilient Data Science Platform is a necessity to every centralized data science team within a large corporation. It helps them centralize, reuse, and productionize their models at peta scale. We’ve built Algorithmia Enterprise for that purpose.


You’ve built that R/Python/Java model. It works well. Now what?

It started with your CEO hearing about machine learning and how data is the new oil. Someone in the data warehouse team just submitted their budget for an 1PB Teradata system, and the the CIO heard that FB is using commodity storage with Hadoop, and it’s super cheap. A perfect storm is unleashed and now you have a mandate to build a data-first innovation team. You hire a group of data scientists, and everyone is excited and start coming to you for some of that digital magic to Googlify their business. Your data scientists don’t have any infrastructure and spend all their time building dashboards for the execs, but the return on investment is negative and everyone blames you for not pouring enough unicorn blood over their P&L.” – Vish Nandlall (source)

Sharing, reusing, and running models at peta-scale is not part of the data scientist’s workflow. This inefficiency is amplified in a corporate environment where data scientists need to coordinate every move with IT, continuous deployment is a mess (if not impossible), reusability is low, and the pain snowballs as different corners of the company start to “Googlify their business”. Read More…