Skip to content

Featurizer Quick Start

Featurizer helps creating distributed FeatureLabelSet dataframes from Feature Definitions to be used for analysis, ML training and real-time streaming.

For this example, we will consider a scenario which often occurs in financial markets simulation, however please note that the framework is not limited to financial data and can be used with whatever scenario user provides. Here is an example to construct mid-price and volatility features from partial order book updates, 5 second lookahead label as prediction target, using 1 second granularity data

  • Pick existing or define your own FeatureDefinition (see Feature Definitions)

    • Create FeaturizerConfig

      • Define start and end dates (more in Data Model)
      • Pick which features to store by setting to_store (more in Storage)
      • Define label feature by setting label_feature_index and label_lookahead (more in Labeling)
      • Define features in feature_configs. Each feature is a result of applying params:feature and params:data_source to FeatureDefinition

      Example config:

      start_date: '2023-02-01 10:00:00'
      end_date: '2023-02-01 11:00:00'
      label_feature_index: 0
      label_lookahead: '5s'
      features_to_store: [0, 1]
      feature_configs:
        - feature_definition: price.mid_price_fd.MidPriceFD
          name: mid_price
          params:
            data_source: &id001
              - exchange: BINANCE
                instrument_type: spot
                symbol: BTC-USDT
            feature:
              sampling: 1s
        - feature_definition: volatility.volatility_stddev_fd.VolatilityStddevFD
          params
            data_source: *id001
            feature:
              sampling: 1s
      
      See MidPriceFD and VolatilityStddevFD for implementation details

  • Run Featurizer

    svoe featurizer run <path_to_config> --ray-address <addr> --parallelism <num-workers>
    
    Featurizer.run(path=<path_to_config>, ray_address=<addr>, parallelism=<num_workers>)
    

    Featurizer will compile a graph of tasks, execute it in a distributed manner over the cluster and store the resulted distributed dataframe (FeatureLabelSet) in cluster memory and optionally in persistent storage.

  • Get sampled results. Once calculation is finished, run following command to get sampled FeatureLabelSet dataframe into your local laptop memory

    svoe featurizer get-data --every-n <every_nth_row>
    

    The above config will result in following dataframe:

          timestamp  receipt_timestamp  label_mid_price-mid_price  mid_price-mid_price  feature_VolatilityStddevFD_62271b09-volatility
    0     1.675234e+09       1.675234e+09                  23084.800            23084.435                                        0.000547
    1     1.675234e+09       1.675234e+09                  23083.760            23084.355                                        0.040003
    2     1.675234e+09       1.675234e+09                  23083.505            23084.635                                        0.117757
    3     1.675234e+09       1.675234e+09                  23084.610            23085.020                                        0.257091
    4     1.675234e+09       1.675234e+09                  23084.725            23084.800                                        0.242034
    ...            ...                ...                        ...                  ...                                             ...
    

  • Visualize the results

    svoe featurizer plot --every-n <every_nth_row>