The Next 700 ML Model Serving Systems

TLDR: A theoretical framework for understanding and comparing machine learning model serving systems in cloud environments, focusing on SageMaker, Vertex AI, and Azure ML. Just as previous "Next 700" papers sought to distill the essence of programming languages, we extract core concepts underlying ML model deployment systems.

1. Introduction

Today's ML engineers must choose between various serving systems, each with their own abstractions, terminology, and trade-offs. These platforms differ in their approaches to fundamental concepts like:

Model containerization and packaging
Scaling and resource allocation
Version management and deployment strategies
Monitoring and observability
Resource optimization and cost management

2. A Calculus for ML Model Serving

Core Concepts

ModelArtifact ::= (code, weights, metadata)
Container ::= (ModelArtifact, runtime, deps)
Endpoint ::= (Container, scaling_config, routing)
Version ::= (Endpoint, traffic_weight)

Operations

package : ModelArtifact → Container
deploy : Container → Endpoint
scale : Endpoint × Config → Endpoint
route : Version × Version × Weight → Version

3. Platform Analysis

3.1 Amazon SageMaker

SageMaker's approach closely mirrors our theoretical model, with explicit container building and endpoint management. Key mappings include:

Model artifacts are packaged into ECR containers
Endpoints provide real-time inference with automatic scaling
Production variants enable traffic splitting

Basic Model Deployment

Theoretical Representation:

# SageMaker strict implementation of core grammar
ModelArtifact ::= (
    code = "s3://bucket/model.tar.gz",    # Model code and artifacts
    weights = "s3://bucket/weights",       # Model weights
    metadata = {                           # Essential metadata only
        "framework": str,     # e.g., "huggingface"
        "version": str,      # e.g., "4.37"
        "py_version": str    # e.g., "py310"
    }
)

Container ::= (
    ModelArtifact,
    runtime = {
        "image": str,           # ECR image URI
        "execution_role": str   # IAM role
    },
    deps = {
        "environment": dict,    # Environment variables
        "entry_point": str     # Inference script
    }
)

Endpoint ::= (
    Container,
    scaling_config = {
        "instance_count": int,
        "instance_type": str
    },
    routing = {
        "variants": list[str],  # Production variant names
        "weights": list[float]  # Traffic weights
    }
)

Version ::= (
    Endpoint,
    traffic_weight = float    # Simple weight for this version
)

# Core operations
package : ModelArtifact → Container     # Create SageMaker model
deploy : Container → Endpoint           # Deploy to endpoint
scale : Endpoint × Config → Endpoint    # Update instance count/type
route : Version × Version × Weight → Version  # Update traffic split

Implementation:

from sagemaker.huggingface import HuggingFaceModel
from sagemaker.huggingface.model import get_huggingface_llm_image_uri

# Define the model image
image_uri = get_huggingface_llm_image_uri(
    "huggingface",
    version="1.4.2"
)

# Create the model (packaging step)
huggingface_model = HuggingFaceModel(
    env=env,  # Environment variables for the container
    role=SAGEMAKER_ROLE,
    transformers_version="4.37",
    pytorch_version="2.1",
    py_version="py310",
    image_uri=image_uri
)

# Deploy the model (endpoint creation)
predictor = huggingface_model.deploy(
    initial_instance_count=deployment.instance_count,
    instance_type=deployment.instance_type,
    endpoint_name=endpoint_name,
)

# Inference invocation
predictor = HuggingFacePredictor(
    endpoint_name=endpoint_name,
    sagemaker_session=sagemaker_session
)
response = predictor.predict(input)

3.2 Azure ML SDK

Azure ML implements a workspace-centric approach with managed online endpoints, emphasizing environment management and model registry integration.

Managed deployments handle container creation implicitly
Scaling is defined through deployment configurations
Blue-green deployments manage version transitions

Theoretical Representation:

# Azure ML implementation of our core grammar
ModelArtifact ::= (
    code = "model/path",          # Local or registry path
    weights = "weights/path",
    metadata = {
        "name": str,              # e.g., "hf-model"
        "type": AssetType,        # e.g., CUSTOM_MODEL
        "description": str,
        "registry": optional[str]  # e.g., "HuggingFace"
    }
)

Container ::= (
    ModelArtifact,
    runtime = {
        "image": str,     # e.g., "mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04"
        "conda_file": {
            "channels": list[str],
            "dependencies": list[str]
        }
    },
    deps = {
        "environment_variables": dict[str, str],
        "pip_packages": list[str]
    }
)

Endpoint ::= (
    Container,
    scaling_config = {
        "instance_type": str,     # e.g., "Standard_DS3_v2"
        "instance_count": int,
        "min_replicas": int,
        "max_replicas": int
    },
    routing = {
        "deployment_name": str,
        "traffic_percentage": int
    }
)

Version ::= (
    Endpoint,
    traffic_weight = {
        "blue_green_config": {
            "active": str,         # blue or green
            "percentage": int,
            "evaluation_rules": dict
        }
    }
)

# Core operations
package(ModelArtifact) → Container  # Creates Azure container environment
deploy(Container) → Endpoint        # Deploys to Azure managed endpoint
scale(Endpoint × Config) → Endpoint # Updates endpoint scaling
route(Version × Weight) → Version   # Updates traffic routing

Implementation:

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml.entities import (
    Environment,
    Model,
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment
)
from azure.ai.ml.constants import AssetTypes

# Initialize workspace client
credential = DefaultAzureCredential()
ml_client = MLClient(
    credential=credential,
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name
)

# Define environment with dependencies
environment = Environment(
    name="bert-env",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
    conda_file={
        "channels": ["conda-forge", "pytorch"],
        "dependencies": [
            "python=3.11",
            "pip",
            "pytorch",
            "transformers",
            "numpy"
        ]
    }
)

# Register model from registry
model = Model(
    path=f"hf://{model_id}",
    type=AssetTypes.CUSTOM_MODEL,
    name="hf-model",
    description="HuggingFace model from Model Hub"
)

# Create and configure endpoint
endpoint_name = f"hf-ep-{int(time.time())}"
ml_client.begin_create_or_update(
    ManagedOnlineEndpoint(name=endpoint_name)
).wait()

# Deploy model
deployment = ml_client.online_deployments.begin_create_or_update(
    ManagedOnlineDeployment(
        name="demo",
        endpoint_name=endpoint_name,
        model=model_id,
        environment=environment,
        instance_type="Standard_DS3_v2",
        instance_count=1,
    )
).wait()

# Update traffic rules
endpoint = ml_client.online_endpoints.get(endpoint_name)
endpoint.traffic = {"demo": 100}
ml_client.begin_create_or_update(endpoint).result()

3.3 Google Cloud Vertex AI

Vertex AI takes a streamlined approach to model deployment, with strong integration with Google Cloud's container infrastructure and emphasis on GPU acceleration.

Theoretical Representation:

# Vertex AI implementation of our core grammar
ModelArtifact ::= (
    code = "gs://model/path",    # GCS path
    weights = "gs://weights/path",
    metadata = {
        "model_id": str,         # e.g., "hf-bert-base"
        "framework": str,        # e.g., "huggingface"
        "generation_config": dict
    }
)

Container ::= (
    ModelArtifact,
    runtime = {
        "image_uri": str,  # e.g., "us-docker.pkg.dev/vertex-ai/prediction/..."
        "accelerator": str # e.g., "NVIDIA_TESLA_A100"
    },
    deps = {
        "env_vars": {
            "MODEL_ID": str,
            "MAX_INPUT_LENGTH": str,
            "MAX_TOTAL_TOKENS": str,
            "NUM_SHARD": str
        }
    }
)

Endpoint ::= (
    Container,
    scaling_config = {
        "machine_type": str,      # e.g., "a2-highgpu-4g"
        "min_replica_count": int,
        "max_replica_count": int,
        "accelerator_count": int
    },
    routing = {
        "traffic_split": dict[str, int],
        "prediction_config": dict
    }
)

Version ::= (
    Endpoint,
    traffic_weight = {
        "split_name": str,
        "percentage": int,
        "monitoring_config": dict
    }
)

# Core operations
package(ModelArtifact) → Container  # Creates Vertex AI container
deploy(Container) → Endpoint        # Deploys to Vertex endpoint
scale(Endpoint × Config) → Endpoint # Updates endpoint scaling
route(Version × Weight) → Version   # Updates traffic routing

Implementation:

from google.cloud import aiplatform

def deploy_hf_model(
    project_id: str,
    location: str,
    model_id: str,
    machine_type: str = "a2-highgpu-4g",
):
    aiplatform.init(project=project_id, location=location)
    
    env_vars = {
        "MODEL_ID": model_id,
        "MAX_INPUT_LENGTH": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "2048",
        "NUM_SHARD": "1"
    }

    # Upload model with container configuration
    model = aiplatform.Model.upload(
        display_name=f"hf-{model_id.replace('/', '-')}",
        serving_container_image_uri=(
            "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/"
            "huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310"
        ),
        serving_container_environment_variables=env_vars
    )
    
    # Deploy model with compute configuration
    endpoint = model.deploy(
        machine_type=machine_type,
        min_replica_count=1,
        max_replica_count=1,
        accelerator_type="NVIDIA_TESLA_A100",
        accelerator_count=1,
        sync=True
    )
    
    return endpoint

def create_completion(
    endpoint,
    prompt: str,
    max_tokens: int = 100,
    temperature: float = 0.7
):
    response = endpoint.predict({
        "text": prompt,
        "parameters": {
            "max_new_tokens": max_tokens,
            "temperature": temperature,
            "top_p": 0.95,
            "top_k": 40,
        }
    })
    return response

4. Hypothetical Frameworks

4.1 ServerlessML

ServerlessML takes a radical approach by completely eliminating the concept of endpoints and containers, instead treating models as pure functions:

Theoretical Representation:

ModelArtifact ::= (code, weights, metadata, scaling_rules)
Function ::= (ModelArtifact, memory_size, timeout)
Invocation ::= (Function, cold_start_policy)

# Key innovation: No explicit container or endpoint
deploy : ModelArtifact → Function
invoke : Function → Response
scale : automatic based on concurrent invocations

Implementation:

from serverlessml import MLFunction

model = MLFunction(
    model_path="model.pkl",
    framework="pytorch",
    memory_size="2GB",
    scaling_rules={
        "cold_start_policy": "eager_loading",
        "max_concurrent": 1000,
        "idle_timeout": "10m"
    }
)

# Deployment is implicit - function is ready to serve
function_url = model.deploy()

Pros:

Zero infrastructure management - models are treated as pure functions
True pay-per-invocation pricing with no idle costs
Automatic scaling from zero to thousands of concurrent requests

Cons:

Cold starts can impact latency-sensitive applications
Limited control over underlying infrastructure
May be more expensive for constant high-throughput workloads

4.2 StatefulML

StatefulML introduces a novel approach by making model state and caching first-class concepts:

Theoretical Representation:

ModelArtifact ::= (code, weights, metadata)
ModelState ::= (cache, warm_weights, dynamic_config)
Container ::= (ModelArtifact, ModelState, runtime)
StateManager ::= (Container, caching_policy, update_strategy)

# Key innovation: Explicit state management
deploy : (ModelArtifact, StateManager) → Container
update_state : (Container, ModelState) → Container
cache_forward : (Container, Request) → Response

Implementation:

from statefulml import MLContainer, StateManager

state_manager = StateManager(
    caching_policy={
        "strategy": "predictive_cache",
        "cache_size": "4GB",
        "eviction_policy": "feature_based_lru"
    },
    update_strategy={
        "type": "incremental",
        "frequency": "5m",
        "warm_up": True
    }
)

model = MLContainer(
    model_path="model.pkl",
    framework="tensorflow",
    state_manager=state_manager,
    dynamic_config={
        "feature_importance_tracking": True,
        "automatic_cache_tuning": True
    }
)

endpoint = model.deploy()

Pros:

Intelligent caching reduces latency for common patterns
State persistence improves warm start performance
Dynamic optimization based on actual usage patterns

Cons:

More complex deployment and management
Higher memory requirements for state maintenance
Potential consistency issues with distributed state

5. Future Directions

DynamicResource ::= (
    GranularAllocation,
    ElasticScaling,
    CostAwareScheduling
)

SharedModelConfig ::= (
    CrossEndpointSharing,
    DynamicModelLoading,
    ResourcePooling
)

EnhancedMonitoring ::= (
    PredictiveAlerts,
    AutomaticDiagnosis,
    AdaptiveOptimization
)

6. Conclusion

This framework provides a way to understand and compare ML serving systems. While current platforms have significant differences, many reflect platform-specific constraints rather than fundamental requirements of the domain. Future systems can benefit from this analysis to provide more consistent and powerful abstractions for ML deployment.