Featurizer Real Time Streaming

Overview

One of the most powerful tools of Featurizer is the ability to seamlessly switch between batch and real-time data processing without changing feature calculation code. Featurizer is built around a custom stream processing engine (built on top of Ray Actors and ZeroMQ) and uses Kappa-architecture to calculate offline features using online pipelines.

Featurizer provides a set of user-facing classes to build, launch and scale real-time streaming pipelines. It is built using Streamz library to declaratively define event processing logic and Ray Actors to scale the workload across cluster/CPUs.

Event Emitters

DataSourceEventEmitter is a foundation of any streaming pipeline. Users define logic to emit Event objects and how to register an arbitrary callback per feature. When building a pipeline from config, Featurizer automatically registers all necessary callbacks to connect dependent Features and calculation nodes (in case of a distributed run)

class DataSourceEventEmitter:

    @classmethod
    def instance(cls) -> 'DataSourceEventEmitter':
        raise NotImplementedError

    def register_callback(self, feature: Feature, callback: Callable[[Feature, Event], Optional[Any]]):
        raise NotImplementedError

    def start(self):
        raise NotImplementedError

    def stop(self):
        raise NotImplementedError

By default, Featurizer uses CryptofeedEventEmitter which is based on a popular Cryptofeed library.

FeatureStreamGraph

FeatureStreamGraph is a class providing simple way to build a streaming pipeline in a non-distributed setting (i.e 1 worker)

fsg = FeatureStreamGraph(
    features_or_config: Union[List[Feature], FeaturizerConfig],
    combine_outputs: bool = False,
    combined_out_callback: Optional[Callable[[GroupedNamedDataEvent], Any]] = None
)

It uses Streamz library to connect Stream objects into a graph.

combine_outputs defines whether output Feature streams should be merged into one
If combine_outputs is True, combined_out_callback: Callable[[GroupedNamedDataEvent], Any] parameter contains user-defined logic to process newly produced combined event. GroupedNamedDataEvent describes data events for all features grouped into a single object. This is useful for real-time ML inference, where models expect all feature values in a single request

There are a number of useful methods to build custom pipelines:

def emit_named_data_event(self, named_event: NamedDataEvent)

Emits new event into the graph
def set_callback(self, feature: Feature, callback: Callable[[Feature, Event], Optional[Any]])

Sets custom callback per feature event
def get_ins(self) -> List[Feature]

Lists input feature streams
def get_outs(self) -> List[Feature]

Lists output feature streams
def get_stream(self, feature: Feature) -> Stream

Gets stream for feature

When initialized with a FeaturizerConfig, Featurizer automatically builds necessary FeatureStreamGraph objects and connects them with corresponding Event Emitters/

Data Recording

Users can configure Featurizer to store features/data sources events to Featurizer Storage for further processing. See Featurizer Real Time Data Recording.

Simulating real-time stream from offline data with OfflineFeatureStreamGenerator

For backtesting/simulation/ML model validation purposes, we often need to be able to simulate a data stream from stored events. Featurizer provides OfflineFeatureStreamGenerator class, which implements a typical generator interface and can be used to run custom logic over stored data stream.

Work In Progress (make OfflineFeatureEventEmitter)

OfflineFeatureStreamGenerator example

See more in featurizer.feature_stream.offline_feature_stream_generator.py

Scalability

WIP

FeatureStreamWorkerGraph

ScalingStrategy

FeaturizerStreamWorkerActor and FeaturizerStreamManagerActor

Transport

Fault Tolerance

Work In Progress

Processing Order Guarantees

Work In Progress

Visualization

Work In Progress