BentoML Explained: An Optimal Model Serving Framework

Serve ML models with ease using BentoML ! This effective walkthrough explains its core concepts and model serving functionalities with clarity.
model serving


A perfectly packaged ML model, at ease in his bento.

Join us in a fascinating exploration of BentoML, where we ensure every concept, from the simplest to the most complex, is presented with clarity and simplicity, making your learning journey enjoyable and insightful ! 🎉

Without further ado, let’s get started: Model Serving and Deployment are terms often used interchangeably in the machine learning world, yet they encapsulate distinct phases in the transition from model development to production.

Model deployment is the process of transitioning a machine learning model into a production environment, ensuring an appropriate format for practical use and, if necessary, establishing additional infrastructure like servers and databases to support it.

On the other hand, Model serving is the practice of making a machine learning model available for use trought APIs which allows users to input data and receive predictions.

On the free tier, BentoML’s focus is solely on the model serving phase. While it does defined the way a model will interact with its production environment, this tool aims specifically at encapsulating the model into a Docker Image that will be easily deployed in a production environment such as Google Cloud Run or Kubernetes 🐳

Model serving and deployment are vital in machine learning workflows, bridging the gap between experimental models and practical applications by enabling models to deliver real-world predictions and insights.

In this article, we will focus exclusively on BentoML’s model serving capabilities. If you want to explore the deployment options available for your model, feel free to read my article on the subject:

Join us in the next section where we analyze the capabilities of BentoML, ensuring you gain a robust understanding of its functionalities and how it can be a game-changer in your ML journey ! 🚀

BentoML Demystified

Live as if you were to die tomorrow. Learn as if you were to live forever.” — Mahatma Gandhi

In this section, we’ll give a brief introduction of BentoML’s functionalities and features and how it can be evaluated in various machine learning workflows.

Understanding BentoML

BentoML is a library that simplifies the process of deploying machine learning models. It encapsulates models, regardless of their originating framework, into a format that can be deployed, whether in cloud environments, on local machines, or edge devices, offering a versatile approach to model deployment.

By generating a Docker Image of the packaged model, BentoML facilitates a flexible array of deployment options.

Evaluating BentoML

Understanding the strengths and weaknesses of BentoML is crucial to determine if it fits your use case, as evaluating the pros and cons of any tool is essential for informed decision-making.

Advantages of BentoML

BentoML brings several advantages to the table when it comes to deploying machine learning models:

  • Easy Serving: BentoML streamlines the serving process, enabling a smooth transition of ML models into production-ready APIs.
  • Integration Capabilities: It offers robust integration, working seamlessly with various platforms and tools such as ZenML, Airflow, Spark, MLflow and more.
  • Optimized Performance: Through the use of micro-batching, BentoML maximizes resource usage and allows for separate scaling specifically for model inference.
  • Consistent Format: BentoML provides a consistent format for model serving and deployment, ensuring uniformity across different use cases.
  • Platform Flexibility: Not limited to Kubernetes, BentoML supports deployment across a variety of platforms, offering notable flexibility.

Limitations of BentoML

As BentoML focuses specifically on the containerization of machine learning models, it’s essential to note a few drawbacks:

  • Limited Experimentation: BentoML leans heavily towards deployment, leaving experimentation aspects to be managed by additional tools like MLflow.
  • Scaling Concerns: Horizontal scaling is not handled by default in BentoML, which might require additional configurations or tools.
  • Lack of Advanced Features: Certain advanced features, such as multi-model serving and A/B testing, are not supported.
  • Basic Monitoring: While it does provide monitoring and logging, additional effort is required to establish a fully functional system.

Is BentoML the Right Choice for Your Team?

BentoML is a fitting choice for teams that prioritize quick and straightforward model deployment without the need for advanced deployment features. However, it may not respond to teams that require a more complex deployment process, especially those seeking advanced features like multi-model serving and A/B testing.

Fundamental Principles of BentoML

Nope, it’s not confusing ! Follow along ;)

Let’s explore the fundamental principles of BentoML, ensuring a thorough understanding of its key features and functionalities. This section will be as straight-forward a possible to promote clarity between the various concepts.

BentoML Models

In BentoML, a model contains the algorithms and learned parameters from training, enabling predictions on new data.

Model Store

BentoML’s Model Store is a local repository for saving and managing models. Key operations include:

  • Saving a model: upload a model to the Model Store.
import bentoml
saved_model = bentoml.sklearn.save_model("iris_clf", clf)
  • Retrieving a model: download a model from the Model Store.
import bentoml
from sklearn.base import BaseEstimator
model: BaseEstimator = bentoml.sklearn.load_model("iris_clf:latest")
  • Managing a model: the following manipulation are available from the BentoML CLI:
    • bentoml models list
    • bentoml models get
    • bentoml models delete

Model Runners

Runners handle model inference, simplifying direct model interactions. After loading a saved model, you can establish a runner for local inference:

import bentoml

# Retrieve the saved model
bento_model = bentoml.models.get("iris_clf:latest")

# Create a runner from the model
my_runner = bento_model.to_runner()

# Initialize the Runner in the current process (for development and testing only):

# Use the runner for inference (hypothetical example)
predictions = my_runner.predict(input_data)

Model Signature

In BentoML, the model signature specifies the model’s expected input and output formats. It ensures data consistency during inference and aids in error-free deployment by validating and transforming inputs.

     "demo_mnist",  # Model name in the local Model Store
     trained_model,  # Model instance being saved
     signatures={   # Model signatures for Runner inference
         "classify": {
             "batchable": False,


In BentoML, batching allows simultaneous handling of multiple data for faster inference. By setting the batchable parameter to True in a model’s signature, multiple calls can merge into one batched call for efficiency:

     "demo_mnist",  # Model name in the local Model Store
     trained_model,  # Model instance being saved
     signatures={   # Model signatures for Runner inference
         "__call__": {
             "batchable": True,
             "batch_dim": 0,

The batch_dim parameter determines the input’s batching dimension. If set to 0, inputs [1, 2] and [3, 4] become [[1, 2], [3, 4]]. If set to 1, they merge as [1, 2, 3, 4].

Having explored BentoML models, let’s now turn our attention to how Service and APIs play a crucial role in utilizing these models effectively.

Service an APIs

Service to others is the rent you pay for your room here on earth.” - Muhammad Ali

Moving forward, we will explore how to create a service, understand the interaction with runners, dive into service APIs, learn about IO descriptors, and differentiate between synchronous and asynchronous APIs.

In BentoML, the service is the primary structure where users specify the logic for the model to interact with its deployment environment.

Creating a Service

A service is essentially a combination of Runners and APIs:

  • Runners: Specialized components that handle model inference.
  • APIs: Define how external requests interact with the models.

For instance, in the provided example, a service named iris_classifier is created using a runner (iris_clf_runner) for a ScikitLearn model:

svc = bentoml.Service("iris_classifier", runners=[iris_clf_runner])

After initializing the service, use the svc.api decorator to define APIs, set input/output formats, and link a function like classify:

from import NumpyNdarray

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    result =
    return result

The Interaction with Runners

In BentoML, a Runner encapsulates the serving logic of a model, optimizing throughput and resource use. It can be easily created from a saved model:

runner = bentoml.sklearn.get("iris_clf:latest").to_runner()

Runners adapt to the ML framework’s characteristics, ensuring efficient model inference. For debugging or manual serving, you can initialize and use runners as follows:

from service import svc

for runner in svc.runners:

result = svc.apis["my_endpoint"].func(inp)

Service APIs

Inference APIs dictate remote service calls. A service can host multiple APIs, each with its input/output specs and a function definition:

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_array: np.ndarray) -> np.ndarray:
    result =
    return result

Using @svc.api, the function becomes an API endpoint. For instance, the above becomes an HTTP /predict endpoint. The request can be perfomed with:

curl -X POST -H "content-type: application/json" \
    --data "[[5.9, 3, 5.1, 1.8]]" \

IO Descriptors

IO descriptors define the data type for an API’s input and output. They ensure data consistency and conversion between native types. For instance, the classify API uses for both input and output:

import numpy as np
from import NumpyNdarray

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_array: np.ndarray) -> np.ndarray:

BentoML offers various IO descriptors like PandasDataFrame, JSON, Image , Text and File, allowing to easily use predefined types for common inputs.

IO descriptors help specify and validate expected data types and shapes. For instance, with the NumpyNdarray descriptor, you can define data type and shape using dtype and shape arguments. Enforcing strict validation is possible with enforce_shape and enforce_dtype:

import numpy as np
from import NumpyNdarray

svc = bentoml.Service("iris_classifier")

# Define IO using samples
output_descriptor = NumpyNdarray.from_sample(np.array([[1.0, 2.0, 3.0, 4.0]]))

    input=NumpyNdarray(shape=(-1, 4), dtype=np.float32, enforce_dtype=True, enforce_shape=True),
def classify(input_array: np.ndarray) -> np.ndarray:

Synchronous vs Asynchronous APIs

BentoML supports both synchronous and asynchronous APIs. While synchronous APIs are straightforward and suitable for many use cases, asynchronous APIs offer better performance, especially for IO-bound tasks or when invoking multiple runners:

# Synchronous API example
@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def predict(input_array: np.ndarray) -> np.ndarray:

# Asynchronous API example
import aiohttp
import asyncio

runner1 = bentoml.sklearn.get("iris_clf:version1").to_runner()
runner2 = bentoml.sklearn.get("iris_clf:version2").to_runner()

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
async def predict(input_array: np.ndarray) -> np.ndarray:
    async with aiohttp.ClientSession() as session:
        features = await session.get('https://features/get', params=input_array[0])

    results = await asyncio.gather(
        runner1.predict.async_run(input_array, features),
        runner2.predict.async_run(input_array, features),
    return combine_results(results)

Now that we had a grasp on managing services and APIs, let’s turn our attention to Bentos, exploring the processes of building, managing, testing, and integrating them in various scenarios.

Exploring Bentos

Not all those who wander are lost.” – J.R.R. Tolken

A Bento is an archive containing everything needed to run a bentoml.Service, including source code, models, data, and configurations. While bentoml.Service defines the inference API, Bento ensures it can be consistently run in production.

Building a Bento

The bentoml build CLI command creates a Bento using a bentofile.yaml build file:

service: "service:svc"  # Same as the argument passed to `bentoml serve`
    owner: bentoml-team
    stage: dev
- "*.py"  # A pattern for matching which files to include in the bento
    packages:  # Additional pip packages required by the service
    - scikit-learn
    - pandas

This file specifies the service, labels, included files, and required Python packages. Each Bento gets a unique version tag, but you can set a custom version with the --version argument if needed:

bentoml build --version 1.0.1

Managing Bentos

A Bento can be managed locally using bentoml CLI commands in the same fashion as managing Bento Models:

  • bentoml list
  • bentoml get
  • bentoml delete

Bentos can also be managed with the Python API:

import bentoml
bento = bentoml.get("iris_classifier:latest")

Testing Bentos

Before deploying, it is crucial to test Bentos locally to ensure correct behavior.

There is three ways to test a Bento:

  • BentoML CLI: Serve a Bento using the command line (replace BENTO_TAG with your tag, e.g., iris_classifier:latest):
bentoml serve BENTO_TAG
  • bentoml.Server API: For a programmatic approach, use the Python API. Especially useful for debugging:
from bentoml import HTTPServer
import numpy as np

server = HTTPServer("iris_classifier:latest", production=True, port=3000, host='')
client = server.get_client()

with server.start() as client:
    result = client.classify(np.array([[4.9, 3.0, 1.4, 0.2]]))

Pushing & Pulling Bentos

Yatai, an additonal tool built by the same company, offers a Bento repository with APIs and a Web UI, storing Bentos on cloud storage like AWS S3 or GCS. It can auto-build Docker images for new Bentos:

bentoml push iris_classifier:latest
bentoml pull iris_classifier:nvjtj7wwfgsafuqj

Directory Structure

To view Bento’s generated files, use:

» cd $(bentoml get iris_classifier:latest -o path)
» tree
├── apis
   └── openapi.yaml
├── bento.yaml
├── env
   ├── docker
   │   ├── Dockerfile
   │   └──
   └── python
       ├── requirements.lock.txt
       ├── requirements.txt
       └── version.txt
├── models
    └── iris_clf
       ├── latest
       └── nb5vrfgwfgtjruqj
           ├── model.yaml
           └── saved_model.pkl
└── src


  • src: Files from bentofile.yaml’s include field, relative to the code’s current working directory. It allows relative module imports and file paths in user code.
  • models: Contains models needed by the Service, determined from the Service’s runners.
  • apis: API specs generated from the Service’s API definitions.
  • env: Environment files from Bento Build Options in bentofile.yaml.

Now that we’ve explored BentoML’s features in details, let’s shift our focus to investigate how to serve these models in different cloud environments.


In our exploration of BentoML, we’ve dissected its core functionalities, highlighting its capacity to streamline the transition of models from their developmental stage right through to their practical application. 🚀

Its ability to encapsulate models into Docker images not only simplifies deployment across various platforms but also ensures that models are readily accessible and usable in diverse production environments.

This tool offers a robust framework for model serving. Unfortunately, it leaves the deployment concerns entirely to the user 🤔 While it provides foundational monitoring and logging, users must craft a more comprehensive monitoring to fully harness its capabilities in varied contexts.

I warmly encourage you to try BentoML ! As it enabled me to quickly package models to promptly deploy them and create value.

I’m sincerely grateful for your time in exploring BentoML together and I hope it gave you the insights you were searching. 🔍

Stay in touch

I hope you enjoyed this article as much as I enjoyed writing it !
If so, feel free to support my work by interacting with my content on LinkedIn 👀
You can also subscribe to be notified of the latest articles I publish 😌