Data Ingest Pipeline

To get input data into SVOE, Featurizer provides two main methods: Data Ingest Pipeline and Real Time Data Recording. This section will describe the former.

Featurizer provides a scalable, configurable and extensible data ingest pipeline which takes (offline) raw user data and puts it in Featurizer storage. It takes care of such things as indexing, compaction, per-data type resource allocation for pipeline workers and many other data engineering related problems. It integrates with DataSourceDefinition class so users can easily add their own processing logic in a modular way without spinning up and maintaining data engineering infrastructure.

Usage

CLI: svoe featurizer run-data-ingest <path_to_config>
API: FeaturizerDataIngestPipelineRunner.run(path_to_config)

Config

TODO describe options

provider_name: cryptotick
batch_size: 12
max_executing_tasks: 10
data_source_files:
  - data_source_definition: featurizer.data_definitions.common.l2_book_incremental.cryptotick.cryptotick_l2_book_incremental.CryptotickL2BookIncrementalData
    files_and_sizes:
      - ['limitbook_full/20230201/BINANCE_SPOT_BTC_USDT.csv.gz', 252000]
      - ['limitbook_full/20230202/BINANCE_SPOT_BTC_USDT.csv.gz', 252000]
      - ['limitbook_full/20230203/BINANCE_SPOT_BTC_USDT.csv.gz', 252000]
      - ['limitbook_full/20230204/BINANCE_SPOT_BTC_USDT.csv.gz', 252000]