Data

Dataset

class replay.data.Dataset(feature_schema, interactions, query_features=None, item_features=None, check_consistency=True, categorical_encoded=False)

Universal dataset for feeding data to models.

cache()

Persists the SparkDataFrame with the default storage level (MEMORY_AND_DISK) for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

property feature_schema: FeatureSchema
Returns

List of features.

property interactions: Union[DataFrame, DataFrame, DataFrame]
Returns

interactions dataset.

property is_categorical_encoded: bool
Returns

is categorical features are encoded.

property item_count: int
Returns

The number of items.

property item_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
Returns

item features dataset.

property item_ids: Union[DataFrame, DataFrame, DataFrame]
Returns

dataset with unique item ids.

classmethod load(path, dataframe_type=None)

Load the Dataset from the provided path.

Parameters

path (str) – The file path

Dataframe_type

Dataframe type to use to store internal data. Can be spark|pandas|polars|None. If not provided automatically sets to the one used when the Dataset was saved.

Returns

Loaded Dataset.

Return type

Dataset

persist(storage_level=StorageLevel(True, True, False, True, 1))

Sets the storage level to persist SparkDataFrame for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters

storage_level (StorageLevel) – storage level to set for persistance. default: `MEMORY_AND_DISK_DESER`.

property query_count: int
Returns

the number of queries.

property query_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
Returns

query features dataset.

property query_ids: Union[DataFrame, DataFrame, DataFrame]
Returns

dataset with unique query ids.

save(path)

Save the Dataset to the provided path.

Parameters

path (str) – Path to save the Dataset to.

subset(features_to_keep)

Returns subset of features. Keeps query and item IDs even if the corresponding sources are not explicitly passed to this functions.

Parameters

features_to_keep (Iterable[str]) – sequence of features to keep.

Returns

new Dataset with given features.

Return type

Dataset

to_pandas()

Convert internally stored dataframes to pandas.DataFrame.

to_polars()

Convert internally stored dataframes to polars.DataFrame.

to_spark()

Convert internally stored dataframes to pyspark.sql.DataFrame.

unpersist(blocking=False)

Marks SparkDataFrame as non-persistent, and remove all blocks for it from memory and disk for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters

blocking (bool) – whether to block until all blocks are deleted. default: `False`.

DatasetLabelEncoder

class replay.data.dataset_utils.DatasetLabelEncoder(handle_unknown_rule='error', default_value_rule=None)

Categorical features encoder for the Dataset class

fit(dataset)

Fits an encoder by the input Dataset for categorical features.

Parameters

dataset (Dataset) – the Dataset object.

Returns

fitted DatasetLabelEncoder.

Raises
AssertionError: if any of dataset categorical features contains

invalid FeatureSource type.

Return type

DatasetLabelEncoder

fit_transform(dataset)

Fits an encoder and transforms the input Dataset categorical features.

Parameters

dataset (Dataset) – the Dataset object.

Returns

transformed dataset.

Return type

Dataset

get_encoder(columns)

Get the encoder of fitted Dataset for columns.

Parameters

columns (Union[str, Iterable[str]]) – columns to filter by.

Returns

LabelEncoder.

Return type

Optional[LabelEncoder]

property interactions_encoder: Optional[LabelEncoder]
Returns

interactions LabelEncoder.

property item_features_encoder: Optional[LabelEncoder]
Returns

item features LabelEncoder.

property item_id_encoder: LabelEncoder
Returns

item id LabelEncoder.

property query_and_item_id_encoder: LabelEncoder
Returns

query id and item id LabelEncoder.

property query_features_encoder: Optional[LabelEncoder]
Returns

query features LabelEncoder.

property query_id_encoder: LabelEncoder
Returns

query id LabelEncoder.

transform(dataset)

Transforms the input Dataset categorical features by rules.

Parameters

dataset (Dataset) – The Dataset object.

Returns

transformed dataset.

Return type

Dataset

FeatureType

final class replay.data.FeatureType(value)

Type of Feature.

CATEGORICAL= categorical

Type of Feature.

CATEGORICAL_LIST= categorical_list

Type of Feature.

NUMERICAL= numerical

Type of Feature.

NUMERICAL_LIST= numerical_list

Type of Feature.

FeatureSource

final class replay.data.FeatureSource(value)

Name of DataFrame.

ITEM_FEATURES= item_features

Name of DataFrame.

QUERY_FEATURES= query_features

Name of DataFrame.

INTERACTIONS= interactions

Name of DataFrame.

FeatureHint

final class replay.data.FeatureHint(value)

Hint to algorithm about column.

ITEM_ID= item_id

Hint to algorithm about column.

QUERY_ID= query_id

Hint to algorithm about column.

RATING= rating

Hint to algorithm about column.

TIMESTAMP= timestamp

Hint to algorithm about column.

FeatureInfo

class replay.data.FeatureInfo(column, feature_type, feature_hint=None, feature_source=None, cardinality=None)

Information about a feature.

property cardinality: Optional[int]
Returns

cardinality of the feature.

property column: str
Returns

the feature name.

property feature_hint: Optional[FeatureHint]
Returns

the feature hint.

property feature_source: Optional[FeatureSource]
Returns

the name of source dataframe of feature.

property feature_type: FeatureType
Returns

the type of feature.

reset_cardinality()

Reset cardinality of the feature to None.

FeatureSchema

class replay.data.FeatureSchema(features_list)

Key-value like collection with information about all dataset features.

property all_features: Sequence[FeatureInfo]
Returns

sequence of all features.

property categorical_features: FeatureSchema
Returns

sequence of categorical features in a schema.

property columns: Sequence[str]
Returns

list of all feature’s column names.

copy()

Creates a copy of all features. For the returned copy, all cardinality values will be undefined.

Returns

copy of the initial feature schema.

Return type

FeatureSchema

drop(column=None, feature_hint=None, feature_source=None, feature_type=None)

Drop features from list by column, feature_source, feature_type and feature_hint.

Parameters
  • column (Optional[str]) – Column name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Returns

new filtered feature schema without selected features.

Return type

FeatureSchema

filter(column=None, feature_hint=None, feature_source=None, feature_type=None)

Filter list by column, feature_source, feature_type and feature_hint.

Parameters
  • column (Optional[str]) – Column name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Returns

new filtered feature schema.

Return type

FeatureSchema

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
Return type

Optional[FeatureInfo]

property interaction_features: FeatureSchema
Returns

sequence of interaction features in a schema.

property interactions_rating_column: Optional[str]
Returns

interactions-rating column name.

property interactions_rating_features: FeatureSchema
Returns

sequence of interactions-rating features in a schema.

property interactions_timestamp_column: Optional[str]
Returns

interactions-timestamp column name.

property interactions_timestamp_features: FeatureSchema
Returns

sequence of interactions-timestamp features in a schema.

item()
Returns

extract a feature information from a schema.

Return type

FeatureInfo

property item_features: FeatureSchema
Returns

sequence of item features in a schema.

property item_id_column: str
Returns

item id column name.

property item_id_feature: FeatureInfo
Returns

sequence of item id features in a schema.

items() a set-like object providing a view on D's items
Return type

ItemsView[str, FeatureInfo]

keys() a set-like object providing a view on D's keys
Return type

KeysView[str]

property numerical_features: FeatureSchema
Returns

sequence of numerical features in a schema.

property query_features: FeatureSchema
Returns

sequence of query features in a schema.

property query_id_column: str
Returns

query id column name.

property query_id_feature: FeatureInfo
Returns

sequence of query id features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters

features_to_keep (Iterable[str]) – a sequence of feature columns in original schema to keep in subset.

Returns

new feature schema of given features.

Return type

FeatureSchema

values() an object providing a view on D's values
Return type

ValuesView[FeatureInfo]

GetSchema

replay.data.get_schema(query_column='query_id', item_column='item_id', timestamp_column='timestamp', rating_column='rating', has_timestamp=True, has_rating=True)

Get Spark Schema with query_id, item_id, rating, timestamp columns

Parameters
  • query_column (str) – column name with query ids

  • item_column (str) – column name with item ids

  • timestamp_column (str) – column name with timestamps

  • rating_column (str) – column name with ratings

  • has_rating (bool) – flag to add rating to schema

  • has_timestamp (bool) – flag to add tomestamp to schema

Neural Networks

This submodule is only available when the PyTorch is installed.

TensorFeatureInfo

class replay.data.nn.TensorFeatureInfo(name, feature_type, is_seq=False, feature_hint=None, feature_sources=None, cardinality=None, padding_value=0, embedding_dim=None, tensor_dim=None)

Information about a tensor feature.

property cardinality: Optional[int]
Returns

Cardinality of the feature.

property embedding_dim: Optional[int]
Returns

Embedding dimensions of the feature.

property feature_hint: Optional[FeatureHint]
Returns

The feature hint.

property feature_source: Optional[TensorFeatureSource]
Returns

Dataframe info of feature.

property feature_sources: Optional[list[replay.data.nn.schema.TensorFeatureSource]]
Returns

List of sources feature came from.

property feature_type: FeatureType
Returns

The type of feature.

property is_cat: bool
Returns

Flag that feature is categorical.

property is_list: bool
Returns

Flag that feature is numerical list or categorical list.

property is_num: bool
Returns

Flag that feature is numerical.

property is_seq: bool
Returns

Flag that feature is sequential.

Sequential means that the value of the feature will be determined for each element of the user’s sequence.

property name: str
Returns

The feature name.

property padding_value: int
Returns

value to pad sequences to desired length.

property tensor_dim: Optional[int]
Returns

Dimensions of the numerical feature.

TensorFeatureSource

class replay.data.nn.TensorFeatureSource(source, column, index=None)

Describes source of a feature

property column: str
Returns

column name

property index: Optional[int]
Returns

provided index

property source: FeatureSource
Returns

feature source

TensorSchema

class replay.data.nn.TensorSchema(features_list)

Key-value like collection that stores tensor features

property all_features: Sequence[TensorFeatureInfo]
Returns

Sequence of all features.

property categorical_features: TensorSchema
Returns

Sequence of categorical features in a schema.

filter(name=None, feature_hint=None, is_seq=None, feature_type=None)

Filter list by name, feature_type, is_seq and feature_hint.

Parameters
  • name (Optional[str]) – Feature name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Returns

New filtered feature schema.

Return type

TensorSchema

get(k[, d]) D[k] if k in D, else d.  d defaults to None.
Return type

Optional[TensorFeatureInfo]

item()
Returns

Extract single feature from a schema.

Return type

TensorFeatureInfo

property item_id_feature_name: Optional[str]
Returns

Item id feature name.

property item_id_features: TensorSchema
Returns

Sequence of item id features in a schema.

items() a set-like object providing a view on D's items
Return type

ItemsView[str, TensorFeatureInfo]

keys() a set-like object providing a view on D's keys
Return type

KeysView[str]

property names: Sequence[str]
Returns

List of all feature’s names.

property numerical_features: TensorSchema
Returns

Sequence of numerical features in a schema.

property query_id_feature_name: Optional[str]
Returns

Query id feature name.

property query_id_features: TensorSchema
Returns

Sequence of query id features in a schema.

property rating_feature_name: Optional[str]
Returns

Rating feature name.

property rating_features: TensorSchema
Returns

Sequence of rating features in a schema.

property sequential_features: TensorSchema
Returns

Sequence of sequential features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters

features_to_keep (Iterable[str]) – A sequence of feature names in original schema to keep in subset.

Returns

New tensor schema of given features.

Return type

TensorSchema

property timestamp_feature_name: Optional[str]
Returns

Timestamp feature name.

property timestamp_features: TensorSchema
Returns

Sequence of timestamp features in a schema.

values() an object providing a view on D's values
Return type

ValuesView[TensorFeatureInfo]

SequenceTokenizer

class replay.data.nn.SequenceTokenizer(tensor_schema, handle_unknown_rule='error', default_value_rule=None, allow_collect_to_master=False)

Data tokenizer for transformers; Encodes all categorical features (the ones marked as FeatureType.CATEGORICAL in the FeatureSchema) and stores all data as items sequences (sorted by time if a feature of type FeatureHint.TIMESTAMP is provided, unsorted otherwise).

fit(dataset)
Parameters

dataset (Dataset) – input dataset to fit

Returns

fitted SequenceTokenizer

Return type

SequenceTokenizer

fit_transform(dataset)
Parameters

dataset (Dataset) – input dataset to transform

Returns

SequentialDataset

Return type

SequentialDataset

property interactions_encoder: Optional[LabelEncoder]
Returns

encoder for interactions

property item_features_encoder: Optional[LabelEncoder]
Returns

encoder for item features

property item_id_encoder: LabelEncoder
Returns

encoder for item id

classmethod load(path, use_pickle=False, **kwargs)

Load tokenizer object from the given path.

Parameters
  • path (str) – Path to load the tokenizer.

  • use_pickle (bool) – If False - tokenizer will be loaded from .replay directory. If True - tokenizer will be loaded with pickle. Default: False.

Returns

Loaded tokenizer object.

Return type

SequenceTokenizer

property query_and_item_id_encoder: LabelEncoder
Returns

encoder for query and item id

property query_features_encoder: Optional[LabelEncoder]
Returns

encoder for query features

property query_id_encoder: LabelEncoder
Returns

encoder for query id

save(path, use_pickle=False)

Save the tokenizer to the given path.

Parameters
  • path (str) – Path to save the tokenizer.

  • use_pickle (bool) – If False - tokenizer will be saved in .replay directory. If True - tokenizer will be saved with pickle. Default: False.

property tensor_schema: TensorSchema
Returns

tensor schema

transform(dataset, tensor_features_to_keep=None)
Parameters
  • dataset (Dataset) – input dataset to transform

  • tensor_features_to_keep (Optional[Sequence[str]]) – specified feature names to transform

Returns

SequentialDataset

Return type

SequentialDataset

PandasSequentialDataset

class replay.data.nn.PandasSequentialDataset(tensor_schema, query_id_column, item_id_column, sequences)

Sequential dataset that stores sequences in PandasDataFrame format.

filter_by_query_id(query_ids_to_keep)

Returns a SequentialDataset that contains only query ids from the specified list.

Parameters

query_ids_to_keep (ndarray) – list of query ids.

Return type

PandasSequentialDataset

get_all_query_ids()

Getting a list of all query ids.

Return type

ndarray

get_max_sequence_length()

Returns the maximum length among all sequences from the SequentialDataset.

Return type

int

get_query_id(index)

Getting a query id for a given index.

Parameters

index (int) – the row number in the dataset.

Return type

int

get_sequence(index, feature_name)

Getting a sequence based on a given index and feature name.

Parameters
  • index (Union[int, ndarray]) – single index or list of indices.

  • feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_by_query_id(query_id, feature_name)

Getting a sequence based on a given query id and feature name.

Parameters
  • query_id (Union[int, ndarray]) – single query id or list of query ids.

  • feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_length(index)

Returns the length of the sequence at the specified index.

Parameters

index (int) – the row number in the dataset.

Return type

int

static keep_common_query_ids(lhs, rhs)

Returns SequentialDatasets that contain query ids from both datasets.

Parameters
  • lhs (SequentialDataset) – SequentialDataset.

  • rhs (SequentialDataset) – SequentialDataset.

Return type

tuple[‘SequentialDataset’, ‘SequentialDataset’]

classmethod load(path, **kwargs)

Method for loading PandasSequentialDataset object from .replay directory.

Return type

PandasSequentialDataset

property schema: TensorSchema
Returns

List of tensor features.

TorchSequentialBatch

class replay.data.nn.TorchSequentialBatch(query_id, padding_mask, features)

Batch of TorchSequentialDataset

features: TensorMap

Alias for field number 2

padding_mask: BoolTensor

Alias for field number 1

query_id: LongTensor

Alias for field number 0

TorchSequentialDataset

class replay.data.nn.TorchSequentialDataset(sequential, max_sequence_length, sliding_window_step=None, padding_value=None)

Torch dataset for sequential recommender models

__init__(sequential, max_sequence_length, sliding_window_step=None, padding_value=None)
Parameters
  • sequential (SequentialDataset) – sequential dataset

  • max_sequence_length (int) – the maximum length of sequence

  • sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None

  • padding_value (Optional[int]) – value to pad sequences to desired length

TorchSequentialValidationBatch

class replay.data.nn.TorchSequentialValidationBatch(query_id, padding_mask, features, ground_truth, train)

Batch of TorchSequentialValidationDataset

features: TensorMap

Alias for field number 2

ground_truth: LongTensor

Alias for field number 3

padding_mask: BoolTensor

Alias for field number 1

query_id: LongTensor

Alias for field number 0

train: LongTensor

Alias for field number 4

TorchSequentialValidationDataset

class replay.data.nn.TorchSequentialValidationDataset(sequential, ground_truth, train, max_sequence_length, padding_value=None, sliding_window_step=None, label_feature_name=None)

Torch dataset for sequential recommender models that additionally stores ground truth

__init__(sequential, ground_truth, train, max_sequence_length, padding_value=None, sliding_window_step=None, label_feature_name=None)
Parameters
  • sequential (SequentialDataset) – validation sequential dataset

  • ground_truth (SequentialDataset) – validation ground_truth sequential dataset

  • train (SequentialDataset) – train sequential dataset

  • max_sequence_length (int) – the maximum length of sequence

  • padding_value (Optional[int]) – value to pad sequences to desired length

  • sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None

  • label_feature_name (Optional[str]) – the name of the column containing the sequence of items.

Parquet processing

This module contains the implementation of ParquetDataset - a combination of PyTorch-compatible dataset and sampler designed for working with the Parquet file format. The main advantages offered by this dataset are:

  1. Batch-wise reading and processing of data, allowing it to work with large datasets in memory-constrained settings.

  2. Full built-in support for Torch’s Distributed Data Parallel mode.

  3. Automatic padding of data according to the provided schema.

ParquetDataset is primarily configured using column schemas - dictionaries containing target columns as keys and their shape/padding specifiers as values. An example column schema:

schema = {
    "user_id": {} # Empty metadata represents a non-array column.
    "seq_1": {"shape": 5} # 1-D sequences of length 5 using default padding value as -1.
    "seq_2": {"shape": [5, 6], "padding": -2} # 2-D sequences with custom padding values
}

ParquetDataset

class replay.data.nn.parquet.ParquetDataset(source, metadata, partition_size, batch_size, filesystem=<pyarrow._fs.LocalFileSystem object>, make_mask_name=<function default_make_mask_name.<locals>.function>, device=device(type='cpu'), generator=None, replicas_info=<replay.data.nn.parquet.info.replicas.ReplicasInfo object>, collate_fn=<function general_collate>, **kwargs)

Combination dataset and sampler for batch-wise reading and processing of Parquet files.

This implementation allows one to read data using a PyArrow Dataset, convert it into structured columns, split it into partitions, and then into batches needed for model training. Supports distributed training and reproducible random shuffling.

During data loader operation, a partition of size partition_size is read. There may be situations where the size of the read partition is less than partition_size - this depends on the number of rows in the data fragment. A fragment is a single Parquet file in the file system.

The partition will be read by every worker, split according to their replica ID, processed and the result will be returned as a batch of size batch_size. Please note that the resulting batch size may be less than batch_size.

For maximum efficiency when reading and processing data, as well as imporved data shuffling, it is recommended to set partition_size to several times larger than batch_size.

Note:

  • ParquetDataset supports only numeric values (boolean/integer/float), therefore, the data paths passed as arguments must contain encoded data.

  • For optimal performance, set the OMP_NUM_THREADS and ARROW_IO_THREADS to match the number of available CPU cores.

__init__(source, metadata, partition_size, batch_size, filesystem=<pyarrow._fs.LocalFileSystem object>, make_mask_name=<function default_make_mask_name.<locals>.function>, device=device(type='cpu'), generator=None, replicas_info=<replay.data.nn.parquet.info.replicas.ReplicasInfo object>, collate_fn=<function general_collate>, **kwargs)
Parameters
  • source (Union[str, list[str]]) – The path or list of paths to files/directories containing data in Parquet format.

  • metadata (dict[str, dict[str, Union[bool, int, float, str]]]) –

    Metadata describing the data structure. The structure of each column is defined by the following values:

    shape - the dimension of the column being read.

    If the column contains only one value, this parameter does not need to be specified. If the column contains a one-dimensional array, the parameter must be a number or an array containing one number. If the column contains a two-dimensional array, the parameter must be an array containing two numbers.

    padding - padding value that will fill the arrays if their length is less

    than that specified in the shape parameter.

  • partition_size (int) – Partition size when reading data from Parquet files.

  • batch_size (int) – The size of the batch that will be returned during iteration.

  • filesystem (Union[str, FileSystem]) – A PyArrow’s Filesystem object used to access data, or a URI-based path to infer the filesystem from. Default: value of DEFAULT_FILESYSTEM.

  • make_mask_name (Callable[[str], str]) – Mask name generation function. Default: value of DEFAULT_MAKE_MASK_NAME.

  • device (device) – The device on which the data will be generated. Defaults: value of DEFAULT_DEVICE.

  • generator (Optional[Generator]) – Random number generator for batch shuffling. If None, shuffling will be disabled. Default: None.

  • replicas_info (ReplicasInfoProtocol) – A connector object capable of fetching total replica count and replica id during runtime. Default: value of DEFAULT_REPLICAS_INFO - a pre-built connector which assumes standard Torch DDP mode. torch.utils.data and torch.distributed modules.

  • collate_fn (Callable[[dict[str, Union[torch.Tensor, dict[str, Union[torch.Tensor, ForwardRef('GeneralBatch')]]]]], dict[str, Union[torch.Tensor, dict[str, Union[torch.Tensor, ForwardRef('GeneralBatch')]]]]]) – Collate function for merging batches. Default: value of DEFAULT_COLLATE_FN.

ParquetModule (Lightning DataModule)

class replay.data.nn.ParquetModule(batch_size, metadata, transforms, config=None, *, train_path=None, validate_path=None, test_path=None, predict_path=None)

Standardized DataModule with batch-wise support via ParquetDataset.

Allows for unified access to all data splits across the training/inference pipeline without loading full dataset into memory. See the Parquet processing section for details.

ParquetModule provides per batch data loading and preprocessing via transform pipelines. See the Transforms for ParquetModule section for getting info about available batch transforms.

Note:

  • ParquetModule supports only numeric values (boolean/integer/float), therefore, the data paths passed as arguments must contain encoded data.

  • For optimal performance, set the OMP_NUM_THREADS and ARROW_IO_THREADS to match the number of available CPU cores.

  • It’s possible to use all train/validate/test/predict splits, then paths to splits should be passed as corresponding arguments of ParquetModule. Alternatively, all the paths to the splits may be not specified but then do not forget to configure the Pytorch Lightning Trainer’s instance accordingly. For example, if you don’t want use validation data, you are able not to set validate_path parameter in ParquetModule and set limit_val_batches=0 in Ligthning.Trainer.

__init__(batch_size, metadata, transforms, config=None, *, train_path=None, validate_path=None, test_path=None, predict_path=None)
Parameters
  • batch_size (int) – Target batch size.

  • metadata (dict) –

    A dictionary that each data split maps to a dictionary of feature names with each feature is associated with its shape and padding_value.

    Example: {“train”: {“item_id” : {“shape”: 100, “padding_value”: 7657}}}.

    For details, see the section Parquet processing.

  • config (Optional[dict]) –

    Dict specifying configuration options of ParquetDataset (generator, filesystem, collate_fn, make_mask_name, replicas_info) for each data split. Default: DEFAULT_CONFIG.

    In most scenarios, the default configuration is sufficient.

  • transforms (dict[Literal['train', 'validate', 'test', 'predict'], list[torch.nn.modules.module.Module]]) – Dict specifying sequence of Transform modules for each data split.

  • train_path (Optional[str]) – Path to the Parquet file containing train data split. Default: None.

  • validate_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing validation data split. Default: None.

  • test_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing testing data split. Default: None.

  • predict_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing prediction data split. Default: None.

Example

This is a minimal usage example of ParquetModule. It uses train data only, and the Transforms are defined to support further training of the SasRec model.

See the full example in examples/09_sasrec_example.ipynb.

from replay.data.nn import ParquetModule
from replay.nn.transform.template import make_default_sasrec_transforms

metadata = {
    "user_id": {},
    "item_id": {"shape": 50, "padding": 51},
}
transforms = make_default_sasrec_transforms(tensor_schema, query_column="user_id")
parquet_datamodule = ParquetModule(
    batch_size=64,
    metadata=metadata,
    transforms=transforms,
    train_path="data/train.parquet",
)