All posts by Algorithmia

The state of machine learning in financial services

Machine learning for finance

The financial services industry has often been at the forefront of using new technology to solve business problems. It’s no surprise that many firms in this sector are embracing machine learning, especially now that increased compute power, network connectivity, and cloud infrastructure are cheaper and more accessible. 

This post will detail five important machine learning use cases that are currently providing value within financial services organizations. 

Fraud detection 

The cost of financial fraud for a financial services company jumped 9 percent between 2017 and 2018, resulting in a cost of $2.92 for every dollar of fraud. We have previously discussed machine learning applications in fraud detection in detail, but it’s worth mentioning some additional reasons why this is one of the most important applications for machine learning in this sector. 

Most fraud prevention models are based on a set of human-created rules that result in a binary classification of “fraud” or “not fraud.” The problem with these models is that they can create a high number of false positives. It’s not good for business when customers receive an abnormally high number of unnecessary fraud notifications. Trust is lost, and actual fraud may continue to go on undetected. 

Machine learning clustering and classification algorithms can help reduce the problem of false positives. They continually modify the profile of a customer whenever they take a new action. With these multiple points of data, the machine can take a nuanced approach to determine what is normal and abnormal behavior. 

Creditworthiness

Creditworthiness is a natural and obvious use of machine learning. For decades, banks have used very rudimentary logistic regression models with inputs like income 30-60-90-day payment histories to determine likelihood of default, or the payment and interest terms of a loan. 

The logistic model can be problematic as it can penalize individuals with shorter credit histories or those who work outside of traditional banking systems. Banks also miss out on additional sources of revenue from rejected borrowers who would likely be able to pay.

With the growing number of alternative data points about individuals related to their financial histories (e.g., rent and utility bill payments or social media actions), lenders are able to use more advanced models to make more personalized decisions about creditworthiness. For example, a 2018 study suggests that a neural network machine learning model may be more accurate at predicting likelihood of default as compared to logistic regression or decision-tree modeling. 

Despite the optimism around increased equitability for customers and a larger client base for banks, there is still some trepidation around using black box algorithms for making lending decisions. Regulations, including the Fair Credit Reporting Act, require creditors to give individuals specific reasons for an outcome. This has been a challenge for engineers working with neural networks. 

Credit bureau Equifax suggests that it has found a solution to this problem, releasing a “regulatory-compliant machine learning credit scoring system” in 2018. 

Algorithmic trading

Simply defined, algorithmic trading is automated trading using a defined set of rules. A basic example would be a trader setting up automatic buy and sell rules when a stock falls below or rises above a particular price point. More sophisticated algorithms exploit arbitrage opportunities or predict stock price fluctuations based on real-world events like mergers or regulatory approvals. 

The previously mentioned models require thousands of lines of human-written code and have become increasingly unwieldy. Relying on machine learning makes trading more efficient and less prone to mistakes. It is particularly beneficial in high frequency trading, when large volumes of orders need to be made as quickly as possible. 

Automated trading has been around since the 1970s, but only recently have companies had access to the technological capabilities able to handle advanced algorithms. Many banks are investing heavily in machine learning-based trading. JPMorgan Chase recently launched a foreign exchange trading tool that bundles various algorithms including time-weighted average price and volume-weighted average price along with general market conditions to make predictions on currency values.

Robo-advisors

Robo-advisors have made investing and financial decision-making more accessible to the average person. Their investment strategies are derived from an algorithm based on a customer’s age, income, planned retirement date, financial goals, and risk tolerance. They typically follow traditional investment strategies and asset allocation based on that information. Because robo-advisors automate processes, they also eliminate the conflict of financial advisors not always working in a client’s best interest.

While robo-advisors are still a small portion of assets under management by financial services firms ($426 billion in 2018), this value is expected to more than triple by 2023. Customers are enticed by lower account minimums (sometimes $0), and wealth management companies save on the costs of employing human financial advisors. 

Cybersecurity and threat detection 

Although not unique to the financial services industry, robust cybersecurity protocols are absolutely necessary to demonstrate asset safety to customers. This is also a good use case to demonstrate how machine learning can play a role in assisting humans rather than attempting to replace them. Specific examples of how machine learning is used in cybersecurity include: 

Malware detection: Algorithms can detect malicious files by flagging never-before-seen software attempting to run as unsafe. 

Insider attacks: Monitoring network traffic throughout an organization looking for anomalies like repeated attempts to access unauthorized applications or unusual keystroke behavior

In both cases, the tedious task of constant monitoring is taken out of the hands of an employee and given to the computer. Analysts can then devote their time to conducting thorough investigations and determining the legitimacy of the threats.

It will be important to watch the financial sector closely because its use of machine learning and other nascent applications will play a large role in determining those technologies’ use and regulation across countless other industries.

Best Practices in Machine Learning Infrastructure

Topographic map with binary tree

Developing processes for integrating machine learning within an organization’s existing computational infrastructure remains a challenge for which robust industry standards do not yet exist. But companies are increasingly realizing that the development of an infrastructure that supports the seamless training, testing, and deployment of models at enterprise scale is as important to long-term viability as the models themselves. 

Small companies, however, struggle to compete against large organizations that have the resources to pour into the large, modular teams and processes of internal tool development that are often necessary to produce robust machine learning pipelines.

Luckily, there are some universal best practices for achieving successful machine learning model rollout for a company of any size and means. 

The Typical Software Development Workflow

Although DevOps is a relatively new subfield of software development, accepted procedures have already begun to arise. A typical software development workflow usually looks something like this:

Software Development Workflow: Develop > Build > Test > Deploy

This is relatively straightforward and works quite well as a standard benchmark for the software development process. However, the multidisciplinary nature of machine learning introduces a unique set of challenges that traditional software development procedures weren’t designed to address. 

Machine Learning Infrastructure Development

If you were to visualize the process of creating a machine learning model from conception to production, it might have multiple tracks and look something like these:

Machine Learning Development Life Cycle

Data Ingestion

It all starts with data.

Even more important to a machine learning workflow’s success than the model itself is the quality of the data it ingests. For this reason, organizations that understand the importance of high-quality data put an incredible amount of effort into architecting their data platforms. First and foremost, they invest in scalable storage solutions, be they on the cloud or in local databases. Popular options include Azure Blob, Amazon S3, DynamoDB, Cassandra, and Hadoop.

Often finding data that conforms well to a given machine learning problem can be difficult. Sometimes datasets exist, but are not commercially licensed. In this case, companies will need to establish their own data curation pipelines whether by soliciting data through customer outreach or through a third-party service.

Once data has been cleaned, visualized, and selected for training, it needs to be transformed into a numerical representation so that it can be used as input for a model. This process is called vectorization. The selection process for determining what’s important in the dataset for training is called featurization. While featurization is more of an art then a science, many machine learning tasks possess associated featurization methods that are commonly used in practice.

Since common featurizations exist and generating these features for a given dataset takes time, it behooves organizations to implement their own feature stores as part of their machine learning pipelines. Simply put, a feature store is just a common library of featurizations that can be applied to data of a given type. 

Having this library accessible across teams allows practitioners to set up their models in standardized ways, thus aiding reproducibility and sharing between groups.

Model Selection

Current guides to machine learning tend to focus on standard algorithms and model types and how they can best be applied to solve a given business problem. 

Selecting the type of model to use when confronted with a business problem can often be a laborious task. Practitioners tend to make a choice informed by the existing literature and their first-hand experience about which models they’d like to try first. 

There are some general rules of thumb that help guide this process. For example, Convolutional Neural Networks tend to perform quite well on image recognition and text classification, LSTMs and GRUs are among the go-to choices for sequence prediction and language modeling, and encoder-decoder architectures excel on translation tasks.

After a model has been selected, the practitioner must then decide which tool to implement the chosen model. The interoperability of different frameworks has improved greatly in recent years due to the introduction of universal model file formats such as the Open Neural Network eXchange (ONNX), which allow for the porting of models trained in one library to be exported for use in another. 

What’s more, the advent of machine learning compilers such as Intel’s nGraph, Facebook’s Glow, or the University of Washington’s TVM promise the holy grail of being able to specify your model in a universal language of sorts and have it be compiled to seamlessly target a vast array of different platforms and hardware architectures.

Model Training

Model training constitutes one of the most time consuming and labor-intensive stages in any machine learning workflow. What’s more, the hardware and infrastructure used to train models depends greatly on the number of parameters in the model, the size of the dataset, the optimization method used, and other considerations.

In order to automate the quest for optimal hyperparameter settings, machine learning engineers often perform what’s called a grid search or hyperparameter search. This involves a sweep across parameter space that seeks to maximize some score function, often cross-validation accuracy. 

Even more advanced methods exist that focus on using Bayesian optimization or reinforcement learning to tune hyperparameters. What’s more, the field has recently seen a surge in tools focusing on automated machine learning methods, which act as black boxes used to select a semi-optimal model and hyperparameter configuration. 

After a model is trained, it should be evaluated based on performance metrics including cross-validation accuracy, precision, recall, F1 score, and AUC. This information is used to inform either further training of the same model or the next iterate in the model selection process. Like all other metrics, these should be logged in a database for future use.

Visualization

Model visualization can be integrated at any point in the machine learning pipeline, but proves especially valuable at the training and testing stages. As discussed, appropriate metrics should be visualized after each stage in the training process to ensure that the training procedure is tending towards convergence. 

Many machine learning libraries are packaged with tools that allow users to debug and investigate each step in the training process. For example, TensorFlow comes bundled with TensorBoard, a utility that allows users to apply metrics to their model, view these quantities as a function of time as the model trains, and even view each node in a neural network’s computational graph.

Model Testing

Once a model has been trained, but before deployment, it should be thoroughly tested. This is often done as part of a CI/CD pipeline. Each model should be subjected to both qualitative and quantitative unit tests. Many training datasets have corresponding test sets which consist of hand-labeled examples against which the model’s performance can be measured. If a test set does not yet exist for a given dataset, it can often be beneficial for a team to curate one. 

The model should also be applied to out-of-domain examples coming from a distribution outside of that on which the model was trained. Often, a qualitative check as to the model’s performance, obtained by cross-referencing a model’s predictions with what one would intuitively expect, can serve as a guide as to whether the model is working as hoped. 

For example, if you trained a model for text classification, you might give it the sentence “the cat walked jauntily down the street, flaunting its shiny coat” and ensure that it categorizes this as “animals” or “sass.”

Deployment

After a model has been trained and tested, it needs to be deployed in production. Current practices often push for the deploying of models as microservices, or compartmentalized packages of code that can be queried through and interact via API calls.

Successful deployment often requires building utilities and software that data scientists can use to package their code and rapidly iterate on models in an organized and robust way such that the backend and data engineers can efficiently translate the results into properly architected models that are deployed at scale. 

For traditional businesses, without sufficient in-house technological expertise, this can prove a herculean task. Even for large organizations with resources available, creating a scalable deployment solution is a dangerous, expensive commitment. Building an in-house solution like Uber’s Michelangelo just doesn’t make sense for any but a handful of companies with unique, cutting-edge ML needs that are fundamental to their business. 

Fortunately, commercial tools exist to offload this burden, providing the benefits of an in-house platform without signing the organization up for a life sentence of proprietary software development and maintenance. 

Algorithmia’s AI Layer allows users to deploy models from any framework, language, or platform and connect to most all data sources. We scale model inference on multi-cloud infrastructures with high efficiency and enable users to continuously manage the machine learning life cycle with tools to iterate, audit, secure, and govern.

No matter where you are in the machine learning life cycle, understanding each stage at the start and what tools and practices will likely yield successful results will prime your ML program for sophistication. Challenges exist at each stage, and your team should also be primed to face them. 

Face the challenges

From Crawling to Sprinting: Advances in Natural Language Processing 

Gif highlighting different Natural Language Processing branches e.g. prediction, translation

Natural language processing (NLP) is one of the fastest evolving branches in machine learning and among the most fundamental. It has applications in diplomacy, aviation, big data sentiment analysis, language translation, customer service, healthcare, policing and criminal justice, and countless other industries.

NLP is the reason we’ve been able to move from CTRL-F searches for single words or phrases to conversational interactions about the contents and meanings of long documents. We can now ask computers questions and have them answer. 

Algorithmia hosts more than 8,000 individual models, many of which are NLP models and complete tasks such as sentence parsing, text extraction and classification, as well as translation and language identification. 

Allen Institute for AI NLP Models on Algorithmia 

The Allen Institute for Artificial Intelligence (Ai2), is a non-profit created by Microsoft co-founder Paul Allen. Since its founding in 2013, Ai2 has worked to advance the state of AI research, especially in natural language applications. We are pleased to announce that we have worked with the producers of AllenNLP—one of the leading NLP libraries—to make their state-of-the-art models available with a simple API call in the Algorithmia AI Layer.

Among the algorithms new to the platform are:

  • Machine Comprehension: Input a body of text and a question based on it and get back the answer (strictly a substring of the original body of text).
  • Textual Entailment: Determine whether one statement follows logically from another
  • Semantic role labeling: Determine “who” did “what” to “whom” in a body of text

These and other algorithms are based on a collection of pre-trained models that are published on the AllenNLP website.  

Algorithmia provides an easy-to-use interface for getting answers out of these models. The underlying AllenNLP models provide a more verbose output, which is aimed at researchers who need to understand the models and debug their performance—this additional information is returned if you simply set debug=True.

The Ins and Outs of the AllenNLP Models 

Machine Comprehension: Create natural-language interfaces to extract information from text documents. 

This algorithm provides the state-of-the-art ability to answer a question based on a piece of text. It takes in a passage of text and a question based on that passage, and returns a substring of the passage that is guessed to be the correct answer.

This model could feature into the backend of a chatbot or provide customer support based on a user’s manual. It could also be used to extract structured data from textual documents, such as a collection of doctors’ reports could be turned into a table that says (for every report) the patient’s concern, what the patient should do, and when they should schedule a follow-up appointment.

Machine Comprehension screenshot

Screenshot from Machine Comprehension model on Algorithmia.


Entailment: This algorithm provides state-of-the-art natural language reasoning. It takes in a premise, expressed in natural language, and a hypothesis that may or may not follow up from. It determines whether the hypothesis follows from the premise, contradicts the premise, or is unrelated. The following is an example:

Input
The input JSON blob should have the following fields:

  • premise: a descriptive piece of text
  • hypothesis: a statement that may or may not follow from the premise of the text

Any additional fields will pass through into the AllenNLP model.

Output
The following output field will always be present:

  • contradiction: Probability that the hypothesis contradicts the premise
  • entailment: Probability that the hypothesis follows from the premise
  • neutral: Probability that the hypothesis is independent from the premise
Entailment

Screenshot from Entailment model on Algorithmia.


Semantic role labeling: This algorithm provides state-of-the-art natural language reasoning—decomposing a sentence into a structured representation of the relationships it describes.

The concept of this algorithm is considering a verb and the entities involved in it as its arguments (like logical predicates). The arguments describe who or what does the action of this verb, to whom or what it is done, etc.

Semantic role labeling

Screenshot from Semantic role labeling model on Algorithmia.

NLP Moving Forward

NLP applications are rife in everyday life, and applications will only continue to expand and improve because the possibilities of a computer understanding written and spoken human language and executing on it are endless. 

CTA for browsing NLP algorithms

Data Scientists Should be Able to Deploy and Iterate Their Own Models

Data science in production is still in its infancy. Everything is new and nearly everyone involved is doing this for the first time. Every quarter, we talk with more than a hundred teams working to operationalize their machine learning and we see a clear pathway out of this difficult stage that many companies cannot get past: operationalizing the deployment process.

The problem

Massive effort goes into any well-trained model—research shows that less than 25 percent of a data scientist’s time is spent on training models; the rest is spent collecting and engineering data and conducting DIY DevOps.

Collecting and cleaning data is a fairly easy-to-understand problem with a maturing ecosystem of best practices, products, and training. As such, let’s focus on the problem of DIY DevOps.

DIY DevOps massively increases the risk of product failure.

We often hear from data scientists that they tried to deploy a model themselves, got stuck as they attempted to fit it into existing architecture, and found out that building machine learning infrastructure is a much harder endeavor than they imagined.

A common story arc we hear:
  • A project starts with a clear problem to solve but maybe not a clear path forward
  • Collecting and cleaning the data is difficult but doable
  • Everyone realizes that deploying and iterating will yield real data and improve the model
  • Data scientists are tasked with deploying models for the project themselves
  • Data scientists get spread thin as they work on tasks outside their areas of strength
  • Months later it becomes clear that DevOps needs to be called in
  • Deploying a model gets tasked to an engineer
  • The engineer realizes that the model was created in Python when the application that needs to use it is in Java
  • The engineer heads over to Stack Overflow and decides the best path forward is to rewrite the model in Java
  • The model gets deployed weeks or months later
  • Down the line, a new version of the original Python model is trained

The solution

Companies that are serious about machine learning need an abstraction layer in their machine learning operationalization infrastructure that allows data scientists to deploy and iterate their own models in any language, using any framework.

We think about this abstraction layer as being an operating system for machine learning. But, what does it look like to help data scientists deploy and iterate their own models? It looks like using whatever tools are best for a given project. 

What do you need in your infrastructure?

Model versioning
Models aren’t static. As data science practices mature, there is a constant need to monitor model drift and retrain models as more data is made available. You’ll need a versioning system that allows you to continuously iterate on models without breaking or disrupting applications that are using older versions.

ML for DevOps
Traditional software development has a well-understood DevOps process: developers commit code to a versioned repository, then a Continuous Integration (CI) process builds, tests, and validates the most recent master branch. If everything meets deployment criteria, a Continuous Delivery or Continuous Deployment (CD) pipeline releases the latest valid version of the product to customers. As a result, developers can focus on writing code, release engineers can build and govern the software delivery process, and end users can receive the most functional code available, all within a system that can be audited and rolled back at any time.

Software development life cycle

The machine learning life cycle can, and should, borrow heavily from the traditional CI/CD model, but it’s important to account for several key differences, including:

Workflow patterns
Enterprise and commercial software development is far more likely to have many developers iterating concurrently on a single piece of code. In ML development, there may be far fewer data scientists iterating on individual models, but those models are likely to be pipelined together in myriad ways.

Dependency management
Traditional application development often leverages a single framework and one, or possibly two, languages. For example, many enterprise applications are written in C#, using the .NET framework, with small bits of highly performant code written in C++. ML models, on the other hand, are typically written using a wide range of languages and frameworks (R, Scala, Python, Java, etc.), so each model can have a unique set of dependencies.

Containerizing these models with all necessary dependencies while still optimizing start times and performance is a far more complex task than compiling a monolithic app built in a single framework.

ML development life cycle

The infrastructure for deploying and managing a machine learning portfolio is crucial for data scientists to do their best work. Though it’s a relatively new field, good practices exist for moving data science forward efficiently and effectively.

Data Providers Added for Algorithmia Users

Algorithmia data providers list

Azure Blob and Google Cloud Storage

In an effort to constantly improve products for our customers, this month we introduced two additional data providers into Algorithmia’s data abstraction service: Azure Blob Storage and Google Cloud Storage. This update allows algorithm developers to read and write data without worrying about the underlying data source. Additionally, developers who consume algorithms never need to worry about passing sensitive credentials to an algorithm since Algorithmia securely brokers the connection for them.

How Easy is it?

By creating an Algorithmia account, you automatically have access to our Hosted Data Source where you can store your data or algorithm output. If you have a Dropbox, Azure Blob Storage, Google Cloud Storage, or an Amazon S3 account, you can configure a new data source to permit Algorithmia to read and write files on your behalf. All data sources have a protocol and a label that you will use to reference your data.

We create these labels because you may want to add multiple connections to the same data provider account and they will each need a unique label for later reference in your algorithm. You might want to have multiple connections to the same source so you can set different access permissions to each connection, such as read from one file and write to a different folder.

We’re Flexible

These providers are available now in addition to Amazon S3, Dropbox, and the Algorithmia Hosted Data service. These options will provide our users with even more flexibility when incorporating Algorithmia’s services into their infrastructures.

Learn more about how Algorithmia enables data connection on our site.

We’d love to know which other data providers developers are interested in, and we’ll keep shipping new providers in future releases. Get in touch if you have suggestions!