In the previous post, we learned about Inference and Inference servers. Today, let’s dive into what are some of the considerations one need to keep in mind before hosting models for inference in production.

Before we dive in, and let’s see how hosting statistical and deep learning models are different.

## Statistical vs Deep Learning models Link to heading

Statistical models are models that are trained to describe the relationships between different variables in your data. Statistical models doesn’t require large datasets and are also easier to interpret. Meaning, it’s relatively easier to understand how the model arrived at a prediction.

Examples of statistical models include linear regression for predicting categorical outcomes, logistic regression for predicting continous outcomes for e.g. Classification, and there are decision trees and random forests, that use a tree based algorithms to predict outcomes.

Statistical models typically have fewer parameters (compared to deep learning models), and require less computational power to make predictions. Statistical models are also smaller in size (on disk) compared to deep learning models (neural networks).

Deep Learning (DL) models comprise of neural networks, A *network* is a structure consisting of interconnected computational nodes, or `neurons`

, arranged in layers. These nodes perform mathematical operations on input data, learning some underlying patterns in the data, before producing some output based on those patterns.

DL models are quite intricate and more complex with millions of pieces (parameters). DL models require a lot more data and computational power to learn patterns in the underlying data. Therefore they require more computational power to make predictions during inference.

### Hosting statistical models for inference Link to heading

For e.g., say you have a simple linear regression model trained to predict a student’s test score based on the number of hours they studied, and you want to host it for inference. Typically, to host models of this nature, you could pick any commodity server with a good balance of compute (CPU) and memory (RAM) and host them for inference.

Now, before picking the right hardware for your model, it’s important to gather the following information as this will help you determine the right hardware for your model.

- Identify the amount of
**data**(input size) your model needs to process. This will help determine memory requirements for your model. - Define your
**concurrency**requirements - Is your model going to handle many requests simulatenously, if so how many? - Define your
**latency**requirements - What is the min/max time required to process the request and return predictions to the users?

With the above information, load tests are typically performed on a dev-test inference endpoint to determine which hardware configuration results in an optimal throughput and latency for your inference endpoint.

When hosting models on the cloud, to make things a bit easier, AWS provides Amazon SageMaker Inference Recommender jobs to help find the optimal hardware for a given model. We provide the model and sample inputs to the inference recommender job and the job will run a load test on the model and report on the optimal hardware configuration for your model.

Inference recommender jobs also report on how much it would cost per inference and cost per hour. Here’s an example notebook on how to run an inference recommender job for a XGBoost model.

### Hosting Deep Learning models for inference Link to heading

Deep learning models, because of their complexity and the sheer number of parameters, deep learning models require more computational resources, such as powerful CPUs or even GPUs, to process the data and make predictions.

For e.g. an Image recognition DL model has to understand and process every pixel in the image, figure out patterns like edges, textures, and shapes, and then decide what object is in the picture. To do this quickly and accurately, especially to handle many simulataneous requests (pictures) at the same time, you’d need a much more powerful inference server, possibly with a GPU, to handle the workload.

So, how big are these DL models anyway? Well, they could get pretty big. For e.g. a resnet50 model, which is a popular model used for image recognition, has 235 million parameters.

For e.g., for just the input, model parameters, and intermediate forward activations will need an approx. 10 GiB of GPU memory.

You can imagine that to host such a model we’ll need a powerful GPU instance with at least 16GiB of GPU memory and a beefy CPU to handle the preprocessing and postprocessing tasks. The number of GPU instances required will depend on your concurrency requirements.
Typically, when hosting large models, the input requests are combined as a **batch** and fed to the model as input. The number of data inputs that are sent to the model is referred to as `batch size`

.

Using a batch size larger than one during inference can lead to more efficient use of the computational resources. GPUs are designed to perform parallel operations, which means they can process multiple data samples simultaneously. By grouping samples into batches, you can take advantage of this parallelism, which can result in faster overall processing times compared to processing each sample individually.

Larger batch sizes allow for better GPU utilization and higher throughput. However, if the batch size is too large, then it could contribute to increased inference latency per request. So selecting the right batch size becomes important for inference.

## Hosting Large Language Models (LLMs) for inference Link to heading

Compared to DL models Large Language Models (LLMs) have significantly larger parameters and model sizes. For e.g. GPT-3 has 175 billion parameters!. The reason for these vast number of parameters is because training a model to understand natural language is incredibly complex. The model needs to capture the nuances of language, including grammar, idioms, slang, and the context in which words are used.

Another reason is LLMs are not just trained to predict the next word in a sentence; they are learning to perform a wide range of tasks, such as language translation, question-answering, and summarization, without needing any additional task-specific training. LLMs today are built using the transformer architecture, which is highly flexible for language training but also parameter-heavy.

Here are some of parameters and model size of popular open-source LLMs like Llama-2.

Llama-2 has 3 model variants (as of Mar 2024):

- Llama-2 7B has 7 billion parameters and model weights are
**14GB**in size - Llama-2 13B has 13 billion parameters with model weights adding up to
**26GB**in size - Llama-2 70B has a whopping 70 billion parameters with model weights being
**140GB**in size.

Hosting LLMs requires instances with massive amounts of memory and GPUs. For larger model size variants like Llama-2 70B, a server with *multiple* ( 4-8) high-memory GPUs are required to handle the model weights and batch processing of input data during inference. LLMs are only getting larger and larger by the day so optimizing models for efficient inference is becoming increasingly important.

Optimizing a model can reduce the model size and as a result reduces the compute and memory requirements to host the model, thereby improving inference performance. By optimizing the model, we reduce its complexity and make it more efficient. This results in faster inference times and better performance in real-world scenarios.

In the next post we’ll discuss model optimization techniques like pruning, quantization and distillation that can help reduce model size and complexity.