Featurizer Quick Start
Featurizer helps creating distributed FeatureLabelSet
dataframes from Feature Definitions to be used for analysis, ML training
and real-time streaming.
For this example, we will consider a scenario which often occurs in financial markets simulation, however please note that the framework is not limited to financial data and can be used with whatever scenario user provides. Here is an example to construct mid-price and volatility features from partial order book updates, 5 second lookahead label as prediction target, using 1 second granularity data
-
Pick existing or define your own
FeatureDefinition
(see Feature Definitions)-
Create
FeaturizerConfig
- Define start and end dates (more in Data Model)
- Pick which features to store by setting
to_store
(more in Storage) - Define label feature by setting
label_feature_index
andlabel_lookahead
(more in Labeling) - Define features in
feature_configs
. Each feature is a result of applyingparams:feature
andparams:data_source
toFeatureDefinition
Example config:
See MidPriceFD and VolatilityStddevFD for implementation detailsstart_date: '2023-02-01 10:00:00' end_date: '2023-02-01 11:00:00' label_feature_index: 0 label_lookahead: '5s' features_to_store: [0, 1] feature_configs: - feature_definition: price.mid_price_fd.MidPriceFD name: mid_price params: data_source: &id001 - exchange: BINANCE instrument_type: spot symbol: BTC-USDT feature: sampling: 1s - feature_definition: volatility.volatility_stddev_fd.VolatilityStddevFD params data_source: *id001 feature: sampling: 1s
-
-
Run Featurizer
svoe featurizer run <path_to_config> --ray-address <addr> --parallelism <num-workers>
Featurizer.run(path=<path_to_config>, ray_address=<addr>, parallelism=<num_workers>)
Featurizer will compile a graph of tasks, execute it in a distributed manner over the cluster and store the resulted distributed dataframe (
FeatureLabelSet
) in cluster memory and optionally in persistent storage. -
Get sampled results. Once calculation is finished, run following command to get sampled
FeatureLabelSet
dataframe into your local laptop memorysvoe featurizer get-data --every-n <every_nth_row>
The above config will result in following dataframe:
timestamp receipt_timestamp label_mid_price-mid_price mid_price-mid_price feature_VolatilityStddevFD_62271b09-volatility 0 1.675234e+09 1.675234e+09 23084.800 23084.435 0.000547 1 1.675234e+09 1.675234e+09 23083.760 23084.355 0.040003 2 1.675234e+09 1.675234e+09 23083.505 23084.635 0.117757 3 1.675234e+09 1.675234e+09 23084.610 23085.020 0.257091 4 1.675234e+09 1.675234e+09 23084.725 23084.800 0.242034 ... ... ... ... ... ...
-
Visualize the results
svoe featurizer plot --every-n <every_nth_row>