Skip to content

Real-time model inference with Inference Loop

Featurizer utilizes Ray Serve to help users do real-time inference and provides an InferenceLoop abstraction to do asynchronous inference on streaming data.

This allows to bypass a typical MLOps approach of creating Docker containers and FastAPI services and all the burden of maintaining relevant infrastructure.

InferenceLoop

InferenceLoop is a class providing a mechanism for continuous model polling and latest request result storage for downstream processing. In a nutshell it is a separate actor/process that continuously sends requests to a model and outputs last results. When used in offline mode, users can provide Clock instance to synchronize time between Featurizer event processing pipelines and inference.

Some notable benefits of this approach:

  • No need to maintain model containerization pipelines, FastAPI services and model registries. Deploy with simple Python API or yaml
  • Integration with MLFlow
  • Asynchronous inference allows for real-time processing without blocking on model-related calculations
  • Scalable inference using adjustable num_replicas, indicating number of inference workers
  • Decouples event processing and inference
  • Use of special hardware (i.e. GPUs)

Inference Configuration

When running an InferenceLoop in different scenarios (real-time Featurizer pipeline or Backtester), users can configure certain options

  • deployment_name - optional Ray Serve Deployment name
  • model_uri - path to model's artifacts as stored in MLFlow artifact storage
  • predictor_class_name - Ray Serve predictor type (ex. XGBoostPredictor)
  • num_replicas - number of Inference Actors to be used for this loop. Requests to workers are evenly distributed (i.e. round-robin) via http-proxy actor
class InferenceConfig(BaseModel):
    deployment_name: Optional[str]
    model_uri: str
    predictor_class_name: str
    num_replicas: int
...
inference_config:
  model_uri: <your-best-model-uri>
  predictor_class_name: 'XGBoostPredictor'
  num_replicas: <number-of-predictor-replicas>
...