Skip to content

DataSourceDefinition and Data Ingestion pipeline

DataSourceDefinition overview

Similar to FeatureDefinition, DataSourceDefinition defines a blueprint for a data source - a sequence of events which framework treats as an input to our feature calculation pipelines. DataSourceDefinition always plays a role of a terminal node (a leaf) in a FeatureDefinition/Feature tree. Unlike FeatureDefinition, DataSourceDefinition does not have any dependencies and does not perform any feature calculations - the purpose of this class is to hold metadata about event schema, possible preprocessing steps and other info. Let's take a look at an example DataSourceDefinition

class MyDataSourceDefinition(DataSourceDefinition):

    @classmethod
    def event_schema(cls) -> EventSchema:
        # This method defines schema of the event
        raise NotImplemented

    @classmethod
    def preprocess_impl(cls, raw_df: DataFrame) -> DataFrame:
        # Any preprocessing logic to get a sequence of EventSchema-like
        # records in a form of a DataFrame from raw data (raw_df)
        raise NotImplemented

    @classmethod
    def event_emitter_type(cls) -> Type[DataSourceEventEmitter]:
        # provides DataSourceEventEmitter type associated with this data source
        raise NotImplemented

Some examples of common data source definitions

See Featurizer Real Time Streaming for Event Emitter details