Optimizing Giant Language Mannequin Efficiency with ONNX on DataRobot MLOps


In our earlier weblog put up we talked about simplify the deployment and monitoring of basis fashions with DataRobot MLOps. We took a pre-trained mannequin from HuggingFace utilizing Tensorflow, and we wrote a easy inference script and uploaded the script and the saved mannequin as a customized mannequin package deal to DataRobot MLOps. We then simply deployed the pre-trained basis mannequin on DataRobot servers in just some minutes.

On this weblog put up, we’ll showcase how one can effortlessly acquire a major enchancment within the inference velocity of the identical mannequin whereas reducing its useful resource consumption. In our walkthrough, you’ll study that the one factor wanted is to transform your language mannequin to the ONNX format. The native assist for the ONNX Runtime in DataRobot will handle the remainder.

Why Are Giant Language Fashions Difficult for Inference?

Beforehand, we talked about what language fashions are. The neural structure of enormous language fashions can have billions of parameters. Having an enormous variety of parameters means these fashions will likely be hungry for assets and gradual to foretell. Due to this, they’re difficult to serve for inference with excessive efficiency. As well as, we would like these fashions to not solely course of one enter at a time, but in addition course of batches of inputs and devour them extra effectively. The excellent news is that we now have a approach of bettering their efficiency and throughput by accelerating their inference course of, because of the capabilities of the ONNX Runtime.

What Is ONNX and the ONNX Runtime?

ONNX (Open Neural Community Trade) is an open customary format to symbolize machine studying (ML) fashions constructed on varied frameworks similar to PyTorch, Tensorflow/Keras, scikit-learn. ONNX Runtime can also be an open supply venture that’s constructed on the ONNX customary. It’s an inference engine optimized to speed up the inference strategy of fashions transformed to the ONNX format throughout a variety of working techniques, languages, and {hardware} platforms. 

ONNX Runtime Execution Providers
ONNX Runtime Execution Suppliers. Supply: ONNX Runtime

ONNX and its runtime collectively type the premise of standardizing and accelerating mannequin inference in manufacturing environments. By sure optimization methods1, ONNX Runtime accelerates mannequin inference on completely different platforms, similar to cellular units, cloud, or edge environments. It supplies an abstraction by leveraging these platforms’ compute capabilities by a single API interface. 

As well as, by changing fashions to ONNX, we acquire the benefit of framework interoperability as we will export fashions educated in varied ML frameworks to ONNX and vice versa, load beforehand exported ONNX fashions into reminiscence, and use them in our ML framework of selection. 

Accelerating Transformer-Primarily based Mannequin Inference with ONNX Runtime

Numerous benchmarks executed by impartial engineering groups within the trade have demonstrated that transformer-based fashions can considerably profit from ONNX Runtime optimizations to cut back latency and enhance throughput on CPUs. 

Some examples embody Microsoft’s work round BERT mannequin optimization utilizing ONNX Runtime2, Twitter benchmarking outcomes for transformer CPU inference in Google Cloud3, and sentence transformers acceleration with Hugging Face Optimum4.

image
Supply: Microsoft5

These benchmarks exhibit that we will considerably enhance throughput and efficiency for transformer-based NLP fashions, particularly by quantization. For instance, in Microsoft workforce’s benchmark above, the quantized BERT 12-layer mannequin with Intel® DL Increase: VNNI and ONNX Runtime can obtain as much as 2.9 instances efficiency positive aspects. 

How Does DataRobot MLOps Natively Help ONNX?

On your modeling or inference workflows, you’ll be able to combine your customized code into DataRobot with these two mechanisms:  

  1. As a customized job: Whereas DataRobot supplies tons of of built-in duties, there are conditions the place you want preprocessing or modeling strategies that aren’t at the moment supported out-of-the-box. To fill this hole, you’ll be able to deliver a customized job that implements a lacking technique, plug that job right into a blueprint inside DataRobot, after which practice, consider, and deploy that blueprint in the identical approach as you’ll for any DataRobot-generated blueprint. You may overview how the method works right here.
  2. As a customized inference mannequin: This may be a pre-trained mannequin or consumer code ready for inference, or a mix of each. An inference mannequin can have a predefined enter/output schema for classification/regression/anomaly detection or be fully unstructured. You may learn extra particulars on deploying your customized inference fashions right here.

In each instances, to be able to run your customized fashions on the DataRobot AI Platform with MLOps assist, you first choose considered one of our public_dropin_environments similar to Python3 + PyTorch, Python3 + Keras/Tensorflow or Python3 + ONNX. These environments every outline the libraries obtainable within the setting and supply a template. Your personal dependency necessities will be utilized to considered one of these base environments to create a runtime setting in your customized duties or customized inference fashions. 

The bonus perk of DataRobot execution environments is that if in case you have a single mannequin artifact and your mannequin conforms to sure enter/output buildings, that means you don’t want a customized inference script to remodel the enter request or the uncooked predictions, you don’t even want to supply a customized script in your uploaded mannequin package deal. Within the ONNX case, should you solely need to predict with a single .onnx file, and this mannequin file conforms to the structured specification, when you choose Python3+ONNX base setting in your customized mannequin package deal, DataRobot MLOps will know load this mannequin into reminiscence and predict with it. To study extra and get your palms on easy-to-reproduce examples, please go to the customized inference mannequin templates part in our customized fashions repository. 

Walkthrough

After studying all this details about the efficiency advantages and the relative simplicity of implementing fashions by ONNX, I’m positive you’re greater than excited to get began.

To exhibit an end-to-end instance, we’ll carry out the next steps: 

  1. Seize the identical basis mannequin in our earlier weblog put up and put it aside in your native drive.
  2. Export the saved Tensorflow mannequin to ONNX format.
  3. Bundle the ONNX mannequin artifact together with a customized inference (customized.py) script.
  4. Add the customized mannequin package deal to DataRobot MLOps on the ONNX base setting.
  5. Create a deployment.
  6. Ship prediction requests to the deployment endpoint.

For brevity, we’ll solely present the extra and modified steps you’ll carry out on prime of the walkthrough from our earlier weblog, however the end-to-end implementation is obtainable on this Google Colab pocket book underneath the DataRobot Neighborhood repository. 

Changing the Basis Mannequin to ONNX

For this tutorial, we’ll use the transformer library’s ONNX conversion device to transform our query answering LLM into the ONNX format as beneath.

FOUNDATION_MODEL = "bert-large-uncased-whole-word-masking-finetuned-squad"

!python -m transformers.onnx --model=$FOUNDATION_MODEL --feature=question-answering $BASE_PATH


Modifying Your Customized Inference Script to Use Your ONNX Mannequin

For inferencing with this mannequin on DataRobot MLOps, we’ll have our customized.py script load the ONNX mannequin into reminiscence in an ONNX runtime session and deal with the incoming prediction requests as follows:

%%writefile $BASE_PATH/customized.py

"""

Copyright 2021 DataRobot, Inc. and its associates.

All rights reserved.

That is proprietary supply code of DataRobot, Inc. and its associates.

Launched underneath the phrases of DataRobot Instrument and Utility Settlement.

"""

import json

import os

import io

from transformers import AutoTokenizer

import onnxruntime as ort

import numpy as np

import pandas as pd

def load_model(input_dir):

    international model_load_duration

    onnx_path = os.path.be a part of(input_dir, "mannequin.onnx")

    tokenizer_path = os.path.be a part of(input_dir)

    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    sess = ort.InferenceSession(onnx_path)

    return sess, tokenizer

def _get_answer_in_text(output, input_ids, idx, tokenizer):

    answer_start = np.argmax(output[0], axis=1)[idx]

    answer_end = (np.argmax(output[1], axis=1) + 1)[idx]

    reply = tokenizer.convert_tokens_to_string(

        tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end])

    )

    return reply

def score_unstructured(mannequin, knowledge, question, **kwargs):

    international model_load_duration

    sess, tokenizer = mannequin

    # Assume batch enter is shipped with mimetype:"textual content/csv"

    # Deal with as single prediction enter if no mimetype is about

    is_batch = kwargs["mimetype"] == "textual content/csv"

    if is_batch:

        input_pd = pd.read_csv(io.StringIO(knowledge), sep="|")

        input_pairs = checklist(zip(input_pd["context"], input_pd["question"]))

        inputs = tokenizer.batch_encode_plus(

            input_pairs, add_special_tokens=True, padding=True, return_tensors="np"

        )

        input_ids = inputs["input_ids"]

        output = sess.run(["start_logits", "end_logits"], input_feed=dict(inputs))

        responses = []

        for i, row in input_pd.iterrows():

            reply = _get_answer_in_text(output, input_ids[i], i, tokenizer)

            response = {

                "context": row["context"],

                "query": row["question"],

                "reply": reply,

            }

            responses.append(response)

        to_return = json.dumps(

            {

                "predictions": responses

            }

        )

    else:

        data_dict = json.hundreds(knowledge)

        context, query = data_dict["context"], data_dict["question"]

        inputs = tokenizer(

            query,

            context,

            add_special_tokens=True,

            padding=True,

            return_tensors="np",

        )

        input_ids = inputs["input_ids"][0]

        output = sess.run(["start_logits", "end_logits"], input_feed=dict(inputs))

        reply = _get_answer_in_text(output, input_ids, 0, tokenizer)

        to_return = json.dumps(

            {

                "context": context,

                "query": query,

                "reply": reply

            }

        )

    return to_return

Creating your customized inference mannequin deployment on DataRobot’s ONNX base setting

As the ultimate change, we’ll create this tradition mannequin’s deployment on the ONNX base setting of DataRobot MLOps as beneath:

deployment = deploy_to_datarobot(BASE_PATH,

                                "ONNX",

                                "bert-onnx-questionAnswering",

                                "Pretrained BERT mannequin, fine-tuned on SQUAD for query answering")

When the deployment is prepared, we’ll take a look at our customized mannequin with our take a look at enter and guarantee that we’re getting our questions answered by our pre-trained LLM:

datarobot_predict(test_input, deployment.id)

Efficiency Comparability

Now that we now have all the pieces prepared, it’s time to match our earlier Tensorflow deployment with the ONNX various.

For our benchmarking functions, we constrained our customized mannequin deployment to solely have 4GB of reminiscence from our MLOps settings in order that we may evaluate the Tensorflow and ONNX options underneath useful resource constraints.

As will be seen from the outcomes beneath, our mannequin in ONNX predicts 1.5x quicker than its Tensorflow counterpart. And this end result will be seen by simply a further primary ONNX export, (i.e. with none additional optimization configurations, similar to quantization). 

Prediction Duration

Relating to useful resource consumption, someplace after ~100 rows, our Tensorflow mannequin deployment begins returning Out of Reminiscence (OOM) errors, that means that this mannequin would require greater than 4GBs of reminiscence to course of and predict for ~100 rows of enter. However, our ONNX mannequin deployment can calculate predictions as much as ~450 rows with out throwing an OOM error. As a conclusion for our use case, the truth that the Tensorflow mannequin handles as much as 100 rows, whereas its ONNX equal handles as much as 450 rows exhibits that the ONNX mannequin is extra useful resource environment friendly, as a result of it makes use of a lot much less reminiscence. 

Begin Leveraging ONNX for Your Customized Fashions

By leveraging the open supply ONNX customary and the ONNX Runtime, AI builders can benefit from the framework interoperability and accelerated inference efficiency, particularly for transformer-based giant language fashions on their most well-liked execution setting. With its native ONNX assist, DataRobot MLOps makes it straightforward for organizations to achieve worth from their machine studying deployments with optimized inference velocity and throughput. 

On this weblog put up, we confirmed how easy it’s to make use of a big language mannequin for query answering in ONNX as a DataRobot customized mannequin and the way a lot inference efficiency and useful resource effectivity you’ll be able to acquire with a easy conversion step. To duplicate this workflow, you’ll find the end-to-end pocket book underneath the DataRobot Neighborhood repo, together with many different tutorials to empower AI builders with the capabilities of DataRobot.

Trial

Arrange your Trial account and expertise the DataRobot AI Platform in the present day


Begin for Free

1 ONNX Runtime, Mannequin Optimizations

2 Microsoft, Optimizing BERT mannequin for Intel CPU Cores utilizing ONNX runtime default execution supplier

3 Twitter Weblog, Rushing up Transformer CPU inference in Google Cloud 

4 Philipp Schmid, Speed up Sentence Transformers with Hugging Face Optimum

5 Microsoft, Optimizing BERT mannequin for Intel CPU Cores utilizing ONNX runtime default execution supplier

Concerning the writer

Aslı Sabancı Demiröz
Aslı Sabancı Demiröz

Senior Machine Studying Engineer, DataRobot

Aslı Sabancı Demiröz is a Senior Machine Studying Engineer at DataRobot. She holds a BS in Laptop Engineering with a double main in Management Engineering from Istanbul Technical College. Working within the workplace of the CTO, she enjoys being on the coronary heart of DataRobot’s R&D to drive innovation. Her ardour lies within the deep studying house and he or she particularly enjoys creating highly effective integrations between platform and software layers within the ML ecosystem, aiming to make the entire larger than the sum of the components.


Meet Aslı Sabancı Demiröz

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles