Data

Dataset

class replay.data.Dataset(feature_schema, interactions, query_features=None, item_features=None, check_consistency=True, categorical_encoded=False)

Universal dataset for feeding data to models.

cache()

Persists the SparkDataFrame with the default storage level (MEMORY_AND_DISK) for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Return type: None

property feature_schema: FeatureSchema

Returns: List of features.

property interactions: Union[DataFrame, DataFrame, DataFrame]

Returns: interactions dataset.

property is_categorical_encoded: bool

Returns: is categorical features are encoded.

property item_count: int

Returns: The number of items.

property item_features: Optional[Union[DataFrame, DataFrame, DataFrame]]

Returns: item features dataset.

property item_ids: Union[DataFrame, DataFrame, DataFrame]

Returns: dataset with unique item ids.

classmethod load(path, dataframe_type=None)

Load the Dataset from the provided path.

Parameters: path (str) – The file path
Dataframe_type: Dataframe type to use to store internal data. Can be spark|pandas|polars|None. If not provided automatically sets to the one used when the Dataset was saved.
Return type: Dataset
Returns: Loaded Dataset.

persist(storage_level=StorageLevel(True, True, False, True, 1))

Sets the storage level to persist SparkDataFrame for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters: storage_level (StorageLevel) – storage level to set for persistance. default: `MEMORY_AND_DISK_DESER`.
Return type: None

property query_count: int

Returns: the number of queries.

property query_features: Optional[Union[DataFrame, DataFrame, DataFrame]]

Returns: query features dataset.

property query_ids: Union[DataFrame, DataFrame, DataFrame]

Returns: dataset with unique query ids.

save(path)

Save the Dataset to the provided path.

Parameters: path (str) – Path to save the Dataset to.
Return type: None

subset(features_to_keep)

Returns subset of features. Keeps query and item IDs even if the corresponding sources are not explicitly passed to this functions.

Parameters: features_to_keep (Iterable[str]) – sequence of features to keep.
Return type: Dataset
Returns: new Dataset with given features.

to_pandas()

Convert internally stored dataframes to pandas.DataFrame.

Return type: None

to_polars(): Convert internally stored dataframes to polars.DataFrame.

to_spark(): Convert internally stored dataframes to pyspark.sql.DataFrame.

unpersist(blocking=False)

Marks SparkDataFrame as non-persistent, and remove all blocks for it from memory and disk for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters: blocking (bool) – whether to block until all blocks are deleted. default: `False`.
Return type: None

DatasetLabelEncoder

class replay.data.dataset_utils.DatasetLabelEncoder(handle_unknown_rule='error', default_value_rule=None)

Categorical features encoder for the Dataset class

fit(dataset)

Fits an encoder by the input Dataset for categorical features.

Parameters

dataset (Dataset) – the Dataset object.

Return type

DatasetLabelEncoder

Returns

fitted DatasetLabelEncoder.

Raises

AssertionError: if any of dataset categorical features contains: invalid FeatureSource type.

fit_transform(dataset)

Fits an encoder and transforms the input Dataset categorical features.

Parameters: dataset (Dataset) – the Dataset object.
Return type: Dataset
Returns: transformed dataset.

get_encoder(columns)

Get the encoder of fitted Dataset for columns.

Parameters: columns (Union[str, Iterable[str]]) – columns to filter by.
Return type: Optional[LabelEncoder]
Returns: LabelEncoder.

property interactions_encoder: Optional[LabelEncoder]

Returns: interactions LabelEncoder.

property item_features_encoder: Optional[LabelEncoder]

Returns: item features LabelEncoder.

property item_id_encoder: LabelEncoder

Returns: item id LabelEncoder.

property query_and_item_id_encoder: LabelEncoder

Returns: query id and item id LabelEncoder.

property query_features_encoder: Optional[LabelEncoder]

Returns: query features LabelEncoder.

property query_id_encoder: LabelEncoder

Returns: query id LabelEncoder.

transform(dataset)

Transforms the input Dataset categorical features by rules.

Parameters: dataset (Dataset) – The Dataset object.
Return type: Dataset
Returns: transformed dataset.

FeatureType

final class replay.data.FeatureType(value)

Type of Feature.

CATEGORICAL= categorical: Type of Feature.
CATEGORICAL_LIST= categorical_list: Type of Feature.
NUMERICAL= numerical: Type of Feature.
NUMERICAL_LIST= numerical_list: Type of Feature.

FeatureSource

final class replay.data.FeatureSource(value)

Name of DataFrame.

ITEM_FEATURES= item_features: Name of DataFrame.
QUERY_FEATURES= query_features: Name of DataFrame.
INTERACTIONS= interactions: Name of DataFrame.

FeatureHint

final class replay.data.FeatureHint(value)

Hint to algorithm about column.

ITEM_ID= item_id: Hint to algorithm about column.
QUERY_ID= query_id: Hint to algorithm about column.
RATING= rating: Hint to algorithm about column.
TIMESTAMP= timestamp: Hint to algorithm about column.

FeatureInfo

class replay.data.FeatureInfo(column, feature_type, feature_hint=None, feature_source=None, cardinality=None)

Information about a feature.

property cardinality: Optional[int]

Returns: cardinality of the feature.

property column: str

Returns: the feature name.

property feature_hint: Optional[FeatureHint]

Returns: the feature hint.

property feature_source: Optional[FeatureSource]

Returns: the name of source dataframe of feature.

property feature_type: FeatureType

Returns: the type of feature.

reset_cardinality()

Reset cardinality of the feature to None.

Return type: None

FeatureSchema

class replay.data.FeatureSchema(features_list)

Key-value like collection with information about all dataset features.

property all_features: Sequence[FeatureInfo]

Returns: sequence of all features.

property categorical_features: FeatureSchema

Returns: sequence of categorical features in a schema.

property columns: Sequence[str]

Returns: list of all feature’s column names.

copy()

Creates a copy of all features. For the returned copy, all cardinality values will be undefined.

Return type: FeatureSchema
Returns: copy of the initial feature schema.

drop(column=None, feature_hint=None, feature_source=None, feature_type=None)

Drop features from list by column, feature_source, feature_type and feature_hint.

Parameters

column (Optional[str]) – Column name to filter by. default: None.
feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.
feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.
feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

FeatureSchema

Returns

new filtered feature schema without selected features.

filter(column=None, feature_hint=None, feature_source=None, feature_type=None)

Filter list by column, feature_source, feature_type and feature_hint.

Parameters

column (Optional[str]) – Column name to filter by. default: None.
feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.
feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.
feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

FeatureSchema

Returns

new filtered feature schema.

get(key, default=None)

Return type: Optional[FeatureInfo]

property interaction_features: FeatureSchema

Returns: sequence of interaction features in a schema.

property interactions_rating_column: Optional[str]

Returns: interactions-rating column name.

property interactions_rating_features: FeatureSchema

Returns: sequence of interactions-rating features in a schema.

property interactions_timestamp_column: Optional[str]

Returns: interactions-timestamp column name.

property interactions_timestamp_features: FeatureSchema

Returns: sequence of interactions-timestamp features in a schema.

item()

Return type: FeatureInfo
Returns: extract a feature information from a schema.

property item_features: FeatureSchema

Returns: sequence of item features in a schema.

property item_id_column: str

Returns: item id column name.

property item_id_feature: FeatureInfo

Returns: sequence of item id features in a schema.

items()

Return type: ItemsView[str, FeatureInfo]

keys()

Return type: KeysView[str]

property numerical_features: FeatureSchema

Returns: sequence of numerical features in a schema.

property query_features: FeatureSchema

Returns: sequence of query features in a schema.

property query_id_column: str

Returns: query id column name.

property query_id_feature: FeatureInfo

Returns: sequence of query id features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters: features_to_keep (Iterable[str]) – a sequence of feature columns in original schema to keep in subset.
Return type: FeatureSchema
Returns: new feature schema of given features.

values()

Return type: ValuesView[FeatureInfo]

GetSchema

replay.data.get_schema(query_column='query_id', item_column='item_id', timestamp_column='timestamp', rating_column='rating', has_timestamp=True, has_rating=True)

Get Spark Schema with query_id, item_id, rating, timestamp columns

Parameters

query_column (str) – column name with query ids
item_column (str) – column name with item ids
timestamp_column (str) – column name with timestamps
rating_column (str) – column name with ratings
has_rating (bool) – flag to add rating to schema
has_timestamp (bool) – flag to add tomestamp to schema

Neural Networks

This submodule is only available when the PyTorch is installed.

TensorFeatureInfo

class replay.data.nn.TensorFeatureInfo(name, feature_type, is_seq=False, feature_hint=None, feature_sources=None, cardinality=None, padding_value=0, embedding_dim=None, tensor_dim=None)

Information about a tensor feature.

property cardinality: Optional[int]

Returns: Cardinality of the feature.

property embedding_dim: Optional[int]

Returns: Embedding dimensions of the feature.

property feature_hint: Optional[FeatureHint]

Returns: The feature hint.

property feature_source: Optional[TensorFeatureSource]

Returns: Dataframe info of feature.

property feature_sources: Optional[list[replay.data.nn.schema.TensorFeatureSource]]

Returns: List of sources feature came from.

property feature_type: FeatureType

Returns: The type of feature.

property is_cat: bool

Returns: Flag that feature is categorical.

property is_list: bool

Returns: Flag that feature is numerical list or categorical list.

property is_num: bool

Returns: Flag that feature is numerical.

property is_seq: bool

Returns: Flag that feature is sequential.

Sequential means that the value of the feature will be determined for each element of the user’s sequence.

property name: str

Returns: The feature name.

property padding_value: int

Returns: value to pad sequences to desired length.

property tensor_dim: Optional[int]

Returns: Dimensions of the numerical feature.

TensorFeatureSource

class replay.data.nn.TensorFeatureSource(source, column, index=None)

Describes source of a feature

property column: str

Returns: column name

property index: Optional[int]

Returns: provided index

property source: FeatureSource

Returns: feature source

TensorSchema

class replay.data.nn.TensorSchema(features_list)

Key-value like collection that stores tensor features

property all_features: Sequence[TensorFeatureInfo]

Returns: Sequence of all features.

property categorical_features: TensorSchema

Returns: Sequence of categorical features in a schema.

filter(name=None, feature_hint=None, is_seq=None, feature_type=None)

Filter list by name, feature_type, is_seq and feature_hint.

Parameters

name (Optional[str]) – Feature name to filter by. default: None.
feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.
feature_source – Feature source to filter by. default: None.
feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

TensorSchema

Returns

New filtered feature schema.

get(key, default=None)

Return type: Optional[TensorFeatureInfo]

item()

Return type: TensorFeatureInfo
Returns: Extract single feature from a schema.

property item_id_feature_name: Optional[str]

Returns: Item id feature name.

property item_id_features: TensorSchema

Returns: Sequence of item id features in a schema.

items()

Return type: ItemsView[str, TensorFeatureInfo]

keys()

Return type: KeysView[str]

property names: Sequence[str]

Returns: List of all feature’s names.

property numerical_features: TensorSchema

Returns: Sequence of numerical features in a schema.

property query_id_feature_name: Optional[str]

Returns: Query id feature name.

property query_id_features: TensorSchema

Returns: Sequence of query id features in a schema.

property rating_feature_name: Optional[str]

Returns: Rating feature name.

property rating_features: TensorSchema

Returns: Sequence of rating features in a schema.

property sequential_features: TensorSchema

Returns: Sequence of sequential features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters: features_to_keep (Iterable[str]) – A sequence of feature names in original schema to keep in subset.
Return type: TensorSchema
Returns: New tensor schema of given features.

property timestamp_feature_name: Optional[str]

Returns: Timestamp feature name.

property timestamp_features: TensorSchema

Returns: Sequence of timestamp features in a schema.

values()

Return type: ValuesView[TensorFeatureInfo]

SequenceTokenizer

class replay.data.nn.SequenceTokenizer(tensor_schema, handle_unknown_rule='error', default_value_rule=None, allow_collect_to_master=False)

Data tokenizer for transformers; Encodes all categorical features (the ones marked as FeatureType.CATEGORICAL in the FeatureSchema) and stores all data as items sequences (sorted by time if a feature of type FeatureHint.TIMESTAMP is provided, unsorted otherwise).

fit(dataset)

Parameters: dataset (Dataset) – input dataset to fit
Return type: SequenceTokenizer
Returns: fitted SequenceTokenizer

fit_transform(dataset)

Parameters: dataset (Dataset) – input dataset to transform
Return type: SequentialDataset
Returns: SequentialDataset

property interactions_encoder: Optional[LabelEncoder]

Returns: encoder for interactions

property item_features_encoder: Optional[LabelEncoder]

Returns: encoder for item features

property item_id_encoder: LabelEncoder

Returns: encoder for item id

classmethod load(cls, path, use_pickle=False, **kwargs)

Load tokenizer object from the given path.

Parameters

path (str) – Path to load the tokenizer.
use_pickle (bool) – If False - tokenizer will be loaded from .replay directory. If True - tokenizer will be loaded with pickle. Default: False.

Return type

SequenceTokenizer

Returns

Loaded tokenizer object.

property query_and_item_id_encoder: LabelEncoder

Returns: encoder for query and item id

property query_features_encoder: Optional[LabelEncoder]

Returns: encoder for query features

property query_id_encoder: LabelEncoder

Returns: encoder for query id

save(path, use_pickle=False)

Save the tokenizer to the given path.

Parameters

path (str) – Path to save the tokenizer.
use_pickle (bool) – If False - tokenizer will be saved in .replay directory. If True - tokenizer will be saved with pickle. Default: False.

Return type

None

property tensor_schema: TensorSchema

Returns: tensor schema

transform(dataset, tensor_features_to_keep=None)

Parameters

dataset (Dataset) – input dataset to transform
tensor_features_to_keep (Optional[Sequence[str]]) – specified feature names to transform

Return type

SequentialDataset

Returns

SequentialDataset

PandasSequentialDataset

class replay.data.nn.PandasSequentialDataset(tensor_schema, query_id_column, item_id_column, sequences)

Sequential dataset that stores sequences in PandasDataFrame format.

filter_by_query_id(query_ids_to_keep)

Returns a SequentialDataset that contains only query ids from the specified list.

Parameters: query_ids_to_keep (ndarray) – list of query ids.
Return type: PandasSequentialDataset

get_all_query_ids()

Getting a list of all query ids.

Return type: ndarray

get_max_sequence_length()

Returns the maximum length among all sequences from the SequentialDataset.

Return type: int

get_query_id(index)

Getting a query id for a given index.

Parameters: index (int) – the row number in the dataset.
Return type: int

get_sequence(index, feature_name)

Getting a sequence based on a given index and feature name.

Parameters

index (Union[int, ndarray]) – single index or list of indices.
feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_by_query_id(query_id, feature_name)

Getting a sequence based on a given query id and feature name.

Parameters

query_id (Union[int, ndarray]) – single query id or list of query ids.
feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_length(index)

Returns the length of the sequence at the specified index.

Parameters: index (int) – the row number in the dataset.
Return type: int

static keep_common_query_ids(lhs, rhs)

Returns SequentialDatasets that contain query ids from both datasets.

Parameters

lhs (SequentialDataset) – SequentialDataset.
rhs (SequentialDataset) – SequentialDataset.

Return type

tuple[SequentialDataset, SequentialDataset]

classmethod load(path, **kwargs)

Method for loading PandasSequentialDataset object from .replay directory.

Return type: PandasSequentialDataset

property schema: TensorSchema

Returns: List of tensor features.

TorchSequentialBatch

class replay.data.nn.TorchSequentialBatch(query_id: LongTensor, padding_mask: BoolTensor, features: TensorMap)

Batch of TorchSequentialDataset

features: TensorMap: Alias for field number 2

padding_mask: BoolTensor: Alias for field number 1

query_id: LongTensor: Alias for field number 0

TorchSequentialDataset

class replay.data.nn.TorchSequentialDataset(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)

Torch dataset for sequential recommender models

__init__(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)

Parameters

sequential (SequentialDataset) – sequential dataset
max_sequence_length (int) – the maximum length of sequence
sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None
padding_value (int) – value to pad sequences to desired length

TorchSequentialValidationBatch

class replay.data.nn.TorchSequentialValidationBatch(query_id: LongTensor, padding_mask: BoolTensor, features: TensorMap, ground_truth: LongTensor, train: LongTensor)

Batch of TorchSequentialValidationDataset

features: Mapping[str, Tensor]: Alias for field number 2

ground_truth: LongTensor: Alias for field number 3

padding_mask: BoolTensor: Alias for field number 1

query_id: LongTensor: Alias for field number 0

train: LongTensor: Alias for field number 4

TorchSequentialValidationDataset

class replay.data.nn.TorchSequentialValidationDataset(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)

Torch dataset for sequential recommender models that additionally stores ground truth

__init__(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)

Parameters

sequential (SequentialDataset) – validation sequential dataset
ground_truth (SequentialDataset) – validation ground_truth sequential dataset
train (SequentialDataset) – train sequential dataset
max_sequence_length (int) – the maximum length of sequence
padding_value (int) – value to pad sequences to desired length
sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None
label_feature_name (Optional[str]) – the name of the column containing the sequence of items.