Data

Dataset

class replay.data.Dataset(feature_schema, interactions, query_features=None, item_features=None, check_consistency=True, categorical_encoded=False)

Universal dataset for feeding data to models.

cache()

Persists the SparkDataFrame with the default storage level (MEMORY_AND_DISK) for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Return type

None

property feature_schema: FeatureSchema
Returns

List of features.

property interactions: Union[DataFrame, DataFrame, DataFrame]
Returns

interactions dataset.

property is_categorical_encoded: bool
Returns

is categorical features are encoded.

property item_count: int
Returns

The number of items.

property item_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
Returns

item features dataset.

property item_ids: Union[DataFrame, DataFrame, DataFrame]
Returns

dataset with unique item ids.

classmethod load(path, dataframe_type=None)

Load the Dataset from the provided path.

Parameters

path (str) – The file path

Dataframe_type

Dataframe type to use to store internal data. Can be spark|pandas|polars|None. If not provided automatically sets to the one used when the Dataset was saved.

Return type

Dataset

Returns

Loaded Dataset.

persist(storage_level=StorageLevel(True, True, False, True, 1))

Sets the storage level to persist SparkDataFrame for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters

storage_level (StorageLevel) – storage level to set for persistance. default: `MEMORY_AND_DISK_DESER`.

Return type

None

property query_count: int
Returns

the number of queries.

property query_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
Returns

query features dataset.

property query_ids: Union[DataFrame, DataFrame, DataFrame]
Returns

dataset with unique query ids.

save(path)

Save the Dataset to the provided path.

Parameters

path (str) – Path to save the Dataset to.

Return type

None

subset(features_to_keep)

Returns subset of features. Keeps query and item IDs even if the corresponding sources are not explicitly passed to this functions.

Parameters

features_to_keep (Iterable[str]) – sequence of features to keep.

Return type

Dataset

Returns

new Dataset with given features.

to_pandas()

Convert internally stored dataframes to pandas.DataFrame.

Return type

None

to_polars()

Convert internally stored dataframes to polars.DataFrame.

to_spark()

Convert internally stored dataframes to pyspark.sql.DataFrame.

unpersist(blocking=False)

Marks SparkDataFrame as non-persistent, and remove all blocks for it from memory and disk for interactions, item_features and user_features.

The function is only available when the PySpark is installed.

Parameters

blocking (bool) – whether to block until all blocks are deleted. default: `False`.

Return type

None

DatasetLabelEncoder

class replay.data.dataset_utils.DatasetLabelEncoder(handle_unknown_rule='error', default_value_rule=None)

Categorical features encoder for the Dataset class

fit(dataset)

Fits an encoder by the input Dataset for categorical features.

Parameters

dataset (Dataset) – the Dataset object.

Return type

DatasetLabelEncoder

Returns

fitted DatasetLabelEncoder.

Raises
AssertionError: if any of dataset categorical features contains

invalid FeatureSource type.

fit_transform(dataset)

Fits an encoder and transforms the input Dataset categorical features.

Parameters

dataset (Dataset) – the Dataset object.

Return type

Dataset

Returns

transformed dataset.

get_encoder(columns)

Get the encoder of fitted Dataset for columns.

Parameters

columns (Union[str, Iterable[str]]) – columns to filter by.

Return type

Optional[LabelEncoder]

Returns

LabelEncoder.

property interactions_encoder: Optional[LabelEncoder]
Returns

interactions LabelEncoder.

property item_features_encoder: Optional[LabelEncoder]
Returns

item features LabelEncoder.

property item_id_encoder: LabelEncoder
Returns

item id LabelEncoder.

property query_and_item_id_encoder: LabelEncoder
Returns

query id and item id LabelEncoder.

property query_features_encoder: Optional[LabelEncoder]
Returns

query features LabelEncoder.

property query_id_encoder: LabelEncoder
Returns

query id LabelEncoder.

transform(dataset)

Transforms the input Dataset categorical features by rules.

Parameters

dataset (Dataset) – The Dataset object.

Return type

Dataset

Returns

transformed dataset.

FeatureType

final class replay.data.FeatureType(value)

Type of Feature.

CATEGORICAL= categorical

Type of Feature.

NUMERICAL= numerical

Type of Feature.

FeatureSource

final class replay.data.FeatureSource(value)

Name of DataFrame.

ITEM_FEATURES= item_features

Name of DataFrame.

QUERY_FEATURES= query_features

Name of DataFrame.

INTERACTIONS= interactions

Name of DataFrame.

FeatureHint

final class replay.data.FeatureHint(value)

Hint to algorithm about column.

ITEM_ID= item_id

Hint to algorithm about column.

QUERY_ID= query_id

Hint to algorithm about column.

RATING= rating

Hint to algorithm about column.

TIMESTAMP= timestamp

Hint to algorithm about column.

FeatureInfo

class replay.data.FeatureInfo(column, feature_type, feature_hint=None, feature_source=None, cardinality=None)

Information about a feature.

property cardinality: Optional[int]
Returns

cardinality of the feature.

property column: str
Returns

the feature name.

property feature_hint: Optional[FeatureHint]
Returns

the feature hint.

property feature_source: Optional[FeatureSource]
Returns

the name of source dataframe of feature.

property feature_type: FeatureType
Returns

the type of feature.

reset_cardinality()

Reset cardinality of the feature to None.

Return type

None

FeatureSchema

class replay.data.FeatureSchema(features_list)

Key-value like collection with information about all dataset features.

property all_features: Sequence[FeatureInfo]
Returns

sequence of all features.

property categorical_features: FeatureSchema
Returns

sequence of categorical features in a schema.

property columns: Sequence[str]
Returns

list of all feature’s column names.

copy()

Creates a copy of all features.

Return type

FeatureSchema

Returns

copy of the initial feature schema.

drop(column=None, feature_hint=None, feature_source=None, feature_type=None)

Drop features from list by column, feature_source, feature_type and feature_hint.

Parameters
  • column (Optional[str]) – Column name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

FeatureSchema

Returns

new filtered feature schema without selected features.

filter(column=None, feature_hint=None, feature_source=None, feature_type=None)

Filter list by column, feature_source, feature_type and feature_hint.

Parameters
  • column (Optional[str]) – Column name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source (Optional[FeatureSource]) – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

FeatureSchema

Returns

new filtered feature schema.

get(key, default=None)
Return type

Optional[FeatureInfo]

property interaction_features: FeatureSchema
Returns

sequence of interaction features in a schema.

property interactions_rating_column: Optional[str]
Returns

interactions-rating column name.

property interactions_rating_features: FeatureSchema
Returns

sequence of interactions-rating features in a schema.

property interactions_timestamp_column: Optional[str]
Returns

interactions-timestamp column name.

property interactions_timestamp_features: FeatureSchema
Returns

sequence of interactions-timestamp features in a schema.

item()
Return type

FeatureInfo

Returns

extract a feature information from a schema.

property item_features: FeatureSchema
Returns

sequence of item features in a schema.

property item_id_column: str
Returns

item id column name.

property item_id_feature: FeatureInfo
Returns

sequence of item id features in a schema.

items()
Return type

ItemsView[str, FeatureInfo]

keys()
Return type

KeysView[str]

property numerical_features: FeatureSchema
Returns

sequence of numerical features in a schema.

property query_features: FeatureSchema
Returns

sequence of query features in a schema.

property query_id_column: str
Returns

query id column name.

property query_id_feature: FeatureInfo
Returns

sequence of query id features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters

features_to_keep (Iterable[str]) – a sequence of feature columns in original schema to keep in subset.

Return type

FeatureSchema

Returns

new feature schema of given features.

values()
Return type

ValuesView[FeatureInfo]

GetSchema

replay.data.get_schema(query_column='query_id', item_column='item_id', timestamp_column='timestamp', rating_column='rating', has_timestamp=True, has_rating=True)

Get Spark Schema with query_id, item_id, rating, timestamp columns

Parameters
  • query_column (str) – column name with query ids

  • item_column (str) – column name with item ids

  • timestamp_column (str) – column name with timestamps

  • rating_column (str) – column name with ratings

  • has_rating (bool) – flag to add rating to schema

  • has_timestamp (bool) – flag to add tomestamp to schema

Neural Networks

This submodule is only available when the PyTorch is installed.

TensorFeatureInfo

class replay.data.nn.TensorFeatureInfo(name, feature_type, is_seq=False, feature_hint=None, feature_sources=None, cardinality=None, embedding_dim=None, tensor_dim=None)

Information about a tensor feature.

property cardinality: Optional[int]
Returns

Cardinality of the feature.

property embedding_dim: Optional[int]
Returns

Embedding dimensions of the feature.

property feature_hint: Optional[FeatureHint]
Returns

The feature hint.

property feature_source: Optional[TensorFeatureSource]
Returns

Dataframe info of feature.

property feature_sources: Optional[List[TensorFeatureSource]]
Returns

List of sources feature came from.

property feature_type: FeatureType
Returns

The type of feature.

property is_cat: bool
Returns

Flag that feature is categorical.

property is_num: bool
Returns

Flag that feature is numerical.

property is_seq: bool
Returns

Flag that feature is sequential.

property name: str
Returns

The feature name.

property tensor_dim: Optional[int]
Returns

Dimensions of the numerical feature.

TensorFeatureSource

class replay.data.nn.TensorFeatureSource(source, column, index=None)

Describes source of a feature

property column: str
Returns

column name

property index: Optional[int]
Returns

provided index

property source: FeatureSource
Returns

feature source

TensorSchema

class replay.data.nn.TensorSchema(features_list)

Key-value like collection that stores tensor features

property all_features: Sequence[TensorFeatureInfo]
Returns

Sequence of all features.

property categorical_features: TensorSchema
Returns

Sequence of categorical features in a schema.

filter(name=None, feature_hint=None, is_seq=None, feature_type=None)

Filter list by name, feature_type, is_seq and feature_hint.

Parameters
  • name (Optional[str]) – Feature name to filter by. default: None.

  • feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default: None.

  • feature_source – Feature source to filter by. default: None.

  • feature_type (Optional[FeatureType]) – Feature type to filter by. default: None.

Return type

TensorSchema

Returns

New filtered feature schema.

get(key, default=None)
Return type

Optional[TensorFeatureInfo]

item()
Return type

TensorFeatureInfo

Returns

Extract single feature from a schema.

property item_id_feature_name: Optional[str]
Returns

Item id feature name.

property item_id_features: TensorSchema
Returns

Sequence of item id features in a schema.

items()
Return type

ItemsView[str, TensorFeatureInfo]

keys()
Return type

KeysView[str]

property names: Sequence[str]
Returns

List of all feature’s names.

property numerical_features: TensorSchema
Returns

Sequence of numerical features in a schema.

property query_id_feature_name: Optional[str]
Returns

Query id feature name.

property query_id_features: TensorSchema
Returns

Sequence of query id features in a schema.

property rating_feature_name: Optional[str]
Returns

Rating feature name.

property rating_features: TensorSchema
Returns

Sequence of rating features in a schema.

property sequential_features: TensorSchema
Returns

Sequence of sequential features in a schema.

subset(features_to_keep)

Creates a subset of given features.

Parameters

features_to_keep (Iterable[str]) – A sequence of feature names in original schema to keep in subset.

Return type

TensorSchema

Returns

New tensor schema of given features.

property timestamp_feature_name: Optional[str]
Returns

Timestamp feature name.

property timestamp_features: TensorSchema
Returns

Sequence of timestamp features in a schema.

values()
Return type

ValuesView[TensorFeatureInfo]

SequenceTokenizer

class replay.data.nn.SequenceTokenizer(tensor_schema, handle_unknown_rule='error', default_value_rule=None, allow_collect_to_master=False)

Data tokenizer for transformers; Encodes all categorical features (the ones marked as FeatureType.CATEGORICAL in the FeatureSchema) and stores all data as items sequences (sorted by time if a feature of type FeatureHint.TIMESTAMP is provided, unsorted otherwise).

fit(dataset)
Parameters

dataset (Dataset) – input dataset to fit

Return type

SequenceTokenizer

Returns

fitted SequenceTokenizer

fit_transform(dataset)
Parameters

dataset (Dataset) – input dataset to transform

Return type

SequentialDataset

Returns

SequentialDataset

property interactions_encoder: Optional[LabelEncoder]
Returns

encoder for interactions

property item_features_encoder: Optional[LabelEncoder]
Returns

encoder for item features

property item_id_encoder: LabelEncoder
Returns

encoder for item id

classmethod load(cls, path, use_pickle=False, **kwargs)

Load tokenizer object from the given path.

Parameters
  • path (str) – Path to load the tokenizer.

  • use_pickle (bool) – If False - tokenizer will be loaded from .replay directory. If True - tokenizer will be loaded with pickle. Default: False.

Return type

SequenceTokenizer

Returns

Loaded tokenizer object.

property query_and_item_id_encoder: LabelEncoder
Returns

encoder for query and item id

property query_features_encoder: Optional[LabelEncoder]
Returns

encoder for query features

property query_id_encoder: LabelEncoder
Returns

encoder for query id

save(path, use_pickle=False)

Save the tokenizer to the given path.

Parameters
  • path (str) – Path to save the tokenizer.

  • use_pickle (bool) – If False - tokenizer will be saved in .replay directory. If True - tokenizer will be saved with pickle. Default: False.

Return type

None

property tensor_schema: TensorSchema
Returns

tensor schema

transform(dataset, tensor_features_to_keep=None)
Parameters
  • dataset (Dataset) – input dataset to transform

  • tensor_features_to_keep (Optional[Sequence[str]]) – specified feature names to transform

Return type

SequentialDataset

Returns

SequentialDataset

PandasSequentialDataset

class replay.data.nn.PandasSequentialDataset(tensor_schema, query_id_column, item_id_column, sequences)

Sequential dataset that stores sequences in PandasDataFrame format.

filter_by_query_id(query_ids_to_keep)

Returns a SequentialDataset that contains only query ids from the specified list.

Parameters

query_ids_to_keep (ndarray) – list of query ids.

Return type

PandasSequentialDataset

get_all_query_ids()

Getting a list of all query ids.

Return type

ndarray

get_max_sequence_length()

Returns the maximum length among all sequences from the SequentialDataset.

Return type

int

get_query_id(index)

Getting a query id for a given index.

Parameters

index (int) – the row number in the dataset.

Return type

int

get_sequence(index, feature_name)

Getting a sequence based on a given index and feature name.

Parameters
  • index (Union[int, ndarray]) – single index or list of indices.

  • feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_by_query_id(query_id, feature_name)

Getting a sequence based on a given query id and feature name.

Parameters
  • query_id (Union[int, ndarray]) – single query id or list of query ids.

  • feature_name (str) – the name of the feature.

Return type

ndarray

get_sequence_length(index)

Returns the length of the sequence at the specified index.

Parameters

index (int) – the row number in the dataset.

Return type

int

static keep_common_query_ids(lhs, rhs)

Returns SequentialDatasets that contain query ids from both datasets.

Parameters
  • lhs (SequentialDataset) – SequentialDataset.

  • rhs (SequentialDataset) – SequentialDataset.

Return type

Tuple[SequentialDataset, SequentialDataset]

classmethod load(path, **kwargs)

Method for loading PandasSequentialDataset object from .replay directory.

Return type

PandasSequentialDataset

property schema: TensorSchema
Returns

List of tensor features.

TorchSequentialBatch

class replay.data.nn.TorchSequentialBatch(query_id: LongTensor, padding_mask: BoolTensor, features: Mapping[str, Tensor])

Batch of TorchSequentialDataset

features: Mapping[str, Tensor]

Alias for field number 2

padding_mask: BoolTensor

Alias for field number 1

query_id: LongTensor

Alias for field number 0

TorchSequentialDataset

class replay.data.nn.TorchSequentialDataset(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)

Torch dataset for sequential recommender models

__init__(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)
Parameters
  • sequential (SequentialDataset) – sequential dataset

  • max_sequence_length (int) – the maximum length of sequence

  • sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None

  • padding_value (int) – value to pad sequences to desired length

TorchSequentialValidationBatch

class replay.data.nn.TorchSequentialValidationBatch(query_id: LongTensor, padding_mask: BoolTensor, features: Mapping[str, Tensor], ground_truth: LongTensor, train: LongTensor)

Batch of TorchSequentialValidationDataset

features: Mapping[str, Tensor]

Alias for field number 2

ground_truth: LongTensor

Alias for field number 3

padding_mask: BoolTensor

Alias for field number 1

query_id: LongTensor

Alias for field number 0

train: LongTensor

Alias for field number 4

TorchSequentialValidationDataset

class replay.data.nn.TorchSequentialValidationDataset(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)

Torch dataset for sequential recommender models that additionally stores ground truth

__init__(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)
Parameters
  • sequential (SequentialDataset) – validation sequential dataset

  • ground_truth (SequentialDataset) – validation ground_truth sequential dataset

  • train (SequentialDataset) – train sequential dataset

  • max_sequence_length (int) – the maximum length of sequence

  • padding_value (int) – value to pad sequences to desired length

  • sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None

  • label_feature_name (Optional[str]) – the name of the column containing the sequence of items.