Data
Dataset
- class replay.data.Dataset(feature_schema, interactions, query_features=None, item_features=None, check_consistency=True, categorical_encoded=False)
Universal dataset for feeding data to models.
- cache()
Persists the SparkDataFrame with the default storage level (MEMORY_AND_DISK) for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- Return type
None
- property feature_schema: FeatureSchema
- Returns
List of features.
- property interactions: Union[DataFrame, DataFrame, DataFrame]
- Returns
interactions dataset.
- property is_categorical_encoded: bool
- Returns
is categorical features are encoded.
- property item_count: int
- Returns
The number of items.
- property item_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
- Returns
item features dataset.
- property item_ids: Union[DataFrame, DataFrame, DataFrame]
- Returns
dataset with unique item ids.
- classmethod load(path, dataframe_type=None)
Load the Dataset from the provided path.
- Parameters
path (
str
) – The file path- Dataframe_type
Dataframe type to use to store internal data. Can be spark|pandas|polars|None. If not provided automatically sets to the one used when the Dataset was saved.
- Return type
- Returns
Loaded Dataset.
- persist(storage_level=StorageLevel(True, True, False, True, 1))
Sets the storage level to persist SparkDataFrame for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- Parameters
storage_level (
StorageLevel
) – storage level to set for persistance. default:`MEMORY_AND_DISK_DESER`
.- Return type
None
- property query_count: int
- Returns
the number of queries.
- property query_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
- Returns
query features dataset.
- property query_ids: Union[DataFrame, DataFrame, DataFrame]
- Returns
dataset with unique query ids.
- save(path)
Save the Dataset to the provided path.
- Parameters
path (
str
) – Path to save the Dataset to.- Return type
None
- subset(features_to_keep)
Returns subset of features. Keeps query and item IDs even if the corresponding sources are not explicitly passed to this functions.
- Parameters
features_to_keep (
Iterable
[str
]) – sequence of features to keep.- Return type
- Returns
new Dataset with given features.
- to_pandas()
Convert internally stored dataframes to pandas.DataFrame.
- Return type
None
- to_polars()
Convert internally stored dataframes to polars.DataFrame.
- to_spark()
Convert internally stored dataframes to pyspark.sql.DataFrame.
- unpersist(blocking=False)
Marks SparkDataFrame as non-persistent, and remove all blocks for it from memory and disk for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- Parameters
blocking (
bool
) – whether to block until all blocks are deleted. default:`False`
.- Return type
None
DatasetLabelEncoder
- class replay.data.dataset_utils.DatasetLabelEncoder(handle_unknown_rule='error', default_value_rule=None)
Categorical features encoder for the Dataset class
- fit(dataset)
Fits an encoder by the input Dataset for categorical features.
- Parameters
dataset (
Dataset
) – the Dataset object.- Return type
- Returns
fitted DatasetLabelEncoder.
- Raises
- AssertionError: if any of dataset categorical features contains
invalid
FeatureSource
type.
- fit_transform(dataset)
Fits an encoder and transforms the input Dataset categorical features.
- get_encoder(columns)
Get the encoder of fitted Dataset for columns.
- Parameters
columns (
Union
[str
,Iterable
[str
]]) – columns to filter by.- Return type
Optional
[LabelEncoder
]- Returns
LabelEncoder.
- property interactions_encoder: Optional[LabelEncoder]
- Returns
interactions LabelEncoder.
- property item_features_encoder: Optional[LabelEncoder]
- Returns
item features LabelEncoder.
- property item_id_encoder: LabelEncoder
- Returns
item id LabelEncoder.
- property query_and_item_id_encoder: LabelEncoder
- Returns
query id and item id LabelEncoder.
- property query_features_encoder: Optional[LabelEncoder]
- Returns
query features LabelEncoder.
- property query_id_encoder: LabelEncoder
- Returns
query id LabelEncoder.
FeatureType
- final class replay.data.FeatureType(value)
Type of Feature.
- CATEGORICAL= categorical
Type of Feature.
- NUMERICAL= numerical
Type of Feature.
FeatureSource
- final class replay.data.FeatureSource(value)
Name of DataFrame.
- ITEM_FEATURES= item_features
Name of DataFrame.
- QUERY_FEATURES= query_features
Name of DataFrame.
- INTERACTIONS= interactions
Name of DataFrame.
FeatureHint
- final class replay.data.FeatureHint(value)
Hint to algorithm about column.
- ITEM_ID= item_id
Hint to algorithm about column.
- QUERY_ID= query_id
Hint to algorithm about column.
- RATING= rating
Hint to algorithm about column.
- TIMESTAMP= timestamp
Hint to algorithm about column.
FeatureInfo
- class replay.data.FeatureInfo(column, feature_type, feature_hint=None, feature_source=None, cardinality=None)
Information about a feature.
- property cardinality: Optional[int]
- Returns
cardinality of the feature.
- property column: str
- Returns
the feature name.
- property feature_hint: Optional[FeatureHint]
- Returns
the feature hint.
- property feature_source: Optional[FeatureSource]
- Returns
the name of source dataframe of feature.
- property feature_type: FeatureType
- Returns
the type of feature.
- reset_cardinality()
Reset cardinality of the feature to None.
- Return type
None
FeatureSchema
- class replay.data.FeatureSchema(features_list)
Key-value like collection with information about all dataset features.
- property all_features: Sequence[FeatureInfo]
- Returns
sequence of all features.
- property categorical_features: FeatureSchema
- Returns
sequence of categorical features in a schema.
- property columns: Sequence[str]
- Returns
list of all feature’s column names.
- copy()
Creates a copy of all features.
- Return type
- Returns
copy of the initial feature schema.
- drop(column=None, feature_hint=None, feature_source=None, feature_type=None)
Drop features from list by
column
,feature_source
,feature_type
andfeature_hint
.- Parameters
column (
Optional
[str
]) – Column name to filter by. default:None
.feature_hint (
Optional
[FeatureHint
]) – Feature hint to filter by. default:None
.feature_source (
Optional
[FeatureSource
]) – Feature source to filter by. default:None
.feature_type (
Optional
[FeatureType
]) – Feature type to filter by. default:None
.
- Return type
- Returns
new filtered feature schema without selected features.
- filter(column=None, feature_hint=None, feature_source=None, feature_type=None)
Filter list by
column
,feature_source
,feature_type
andfeature_hint
.- Parameters
column (
Optional
[str
]) – Column name to filter by. default:None
.feature_hint (
Optional
[FeatureHint
]) – Feature hint to filter by. default:None
.feature_source (
Optional
[FeatureSource
]) – Feature source to filter by. default:None
.feature_type (
Optional
[FeatureType
]) – Feature type to filter by. default:None
.
- Return type
- Returns
new filtered feature schema.
- get(key, default=None)
- Return type
Optional
[FeatureInfo
]
- property interaction_features: FeatureSchema
- Returns
sequence of interaction features in a schema.
- property interactions_rating_column: Optional[str]
- Returns
interactions-rating column name.
- property interactions_rating_features: FeatureSchema
- Returns
sequence of interactions-rating features in a schema.
- property interactions_timestamp_column: Optional[str]
- Returns
interactions-timestamp column name.
- property interactions_timestamp_features: FeatureSchema
- Returns
sequence of interactions-timestamp features in a schema.
- item()
- Return type
- Returns
extract a feature information from a schema.
- property item_features: FeatureSchema
- Returns
sequence of item features in a schema.
- property item_id_column: str
- Returns
item id column name.
- property item_id_feature: FeatureInfo
- Returns
sequence of item id features in a schema.
- items()
- Return type
ItemsView
[str
,FeatureInfo
]
- keys()
- Return type
KeysView
[str
]
- property numerical_features: FeatureSchema
- Returns
sequence of numerical features in a schema.
- property query_features: FeatureSchema
- Returns
sequence of query features in a schema.
- property query_id_column: str
- Returns
query id column name.
- property query_id_feature: FeatureInfo
- Returns
sequence of query id features in a schema.
- subset(features_to_keep)
Creates a subset of given features.
- Parameters
features_to_keep (
Iterable
[str
]) – a sequence of feature columns in original schema to keep in subset.- Return type
- Returns
new feature schema of given features.
- values()
- Return type
ValuesView
[FeatureInfo
]
GetSchema
- replay.data.get_schema(query_column='query_id', item_column='item_id', timestamp_column='timestamp', rating_column='rating', has_timestamp=True, has_rating=True)
Get Spark Schema with query_id, item_id, rating, timestamp columns
- Parameters
query_column (
str
) – column name with query idsitem_column (
str
) – column name with item idstimestamp_column (
str
) – column name with timestampsrating_column (
str
) – column name with ratingshas_rating (
bool
) – flag to add rating to schemahas_timestamp (
bool
) – flag to add tomestamp to schema
Neural Networks
This submodule is only available when the PyTorch is installed.
TensorFeatureInfo
- class replay.data.nn.TensorFeatureInfo(name, feature_type, is_seq=False, feature_hint=None, feature_sources=None, cardinality=None, embedding_dim=None, tensor_dim=None)
Information about a tensor feature.
- property cardinality: Optional[int]
- Returns
Cardinality of the feature.
- property embedding_dim: Optional[int]
- Returns
Embedding dimensions of the feature.
- property feature_hint: Optional[FeatureHint]
- Returns
The feature hint.
- property feature_source: Optional[TensorFeatureSource]
- Returns
Dataframe info of feature.
- property feature_sources: Optional[List[TensorFeatureSource]]
- Returns
List of sources feature came from.
- property feature_type: FeatureType
- Returns
The type of feature.
- property is_cat: bool
- Returns
Flag that feature is categorical.
- property is_num: bool
- Returns
Flag that feature is numerical.
- property is_seq: bool
- Returns
Flag that feature is sequential.
- property name: str
- Returns
The feature name.
- property tensor_dim: Optional[int]
- Returns
Dimensions of the numerical feature.
TensorFeatureSource
- class replay.data.nn.TensorFeatureSource(source, column, index=None)
Describes source of a feature
- property column: str
- Returns
column name
- property index: Optional[int]
- Returns
provided index
- property source: FeatureSource
- Returns
feature source
TensorSchema
- class replay.data.nn.TensorSchema(features_list)
Key-value like collection that stores tensor features
- property all_features: Sequence[TensorFeatureInfo]
- Returns
Sequence of all features.
- property categorical_features: TensorSchema
- Returns
Sequence of categorical features in a schema.
- filter(name=None, feature_hint=None, is_seq=None, feature_type=None)
Filter list by
name
,feature_type
,is_seq
andfeature_hint
.- Parameters
name (
Optional
[str
]) – Feature name to filter by. default:None
.feature_hint (
Optional
[FeatureHint
]) – Feature hint to filter by. default:None
.feature_source – Feature source to filter by. default:
None
.feature_type (
Optional
[FeatureType
]) – Feature type to filter by. default:None
.
- Return type
- Returns
New filtered feature schema.
- get(key, default=None)
- Return type
Optional
[TensorFeatureInfo
]
- item()
- Return type
- Returns
Extract single feature from a schema.
- property item_id_feature_name: Optional[str]
- Returns
Item id feature name.
- property item_id_features: TensorSchema
- Returns
Sequence of item id features in a schema.
- items()
- Return type
ItemsView
[str
,TensorFeatureInfo
]
- keys()
- Return type
KeysView
[str
]
- property names: Sequence[str]
- Returns
List of all feature’s names.
- property numerical_features: TensorSchema
- Returns
Sequence of numerical features in a schema.
- property query_id_feature_name: Optional[str]
- Returns
Query id feature name.
- property query_id_features: TensorSchema
- Returns
Sequence of query id features in a schema.
- property rating_feature_name: Optional[str]
- Returns
Rating feature name.
- property rating_features: TensorSchema
- Returns
Sequence of rating features in a schema.
- property sequential_features: TensorSchema
- Returns
Sequence of sequential features in a schema.
- subset(features_to_keep)
Creates a subset of given features.
- Parameters
features_to_keep (
Iterable
[str
]) – A sequence of feature names in original schema to keep in subset.- Return type
- Returns
New tensor schema of given features.
- property timestamp_feature_name: Optional[str]
- Returns
Timestamp feature name.
- property timestamp_features: TensorSchema
- Returns
Sequence of timestamp features in a schema.
- values()
- Return type
ValuesView
[TensorFeatureInfo
]
SequenceTokenizer
- class replay.data.nn.SequenceTokenizer(tensor_schema, handle_unknown_rule='error', default_value_rule=None, allow_collect_to_master=False)
Data tokenizer for transformers; Encodes all categorical features (the ones marked as FeatureType.CATEGORICAL in the FeatureSchema) and stores all data as items sequences (sorted by time if a feature of type FeatureHint.TIMESTAMP is provided, unsorted otherwise).
- fit(dataset)
- Parameters
dataset (
Dataset
) – input dataset to fit- Return type
- Returns
fitted SequenceTokenizer
- fit_transform(dataset)
- Parameters
dataset (
Dataset
) – input dataset to transform- Return type
SequentialDataset
- Returns
SequentialDataset
- property interactions_encoder: Optional[LabelEncoder]
- Returns
encoder for interactions
- property item_features_encoder: Optional[LabelEncoder]
- Returns
encoder for item features
- property item_id_encoder: LabelEncoder
- Returns
encoder for item id
- classmethod load(cls, path, use_pickle=False, **kwargs)
Load tokenizer object from the given path.
- Parameters
path (
str
) – Path to load the tokenizer.use_pickle (
bool
) – If False - tokenizer will be loaded from .replay directory. If True - tokenizer will be loaded with pickle. Default: False.
- Return type
- Returns
Loaded tokenizer object.
- property query_and_item_id_encoder: LabelEncoder
- Returns
encoder for query and item id
- property query_features_encoder: Optional[LabelEncoder]
- Returns
encoder for query features
- property query_id_encoder: LabelEncoder
- Returns
encoder for query id
- save(path, use_pickle=False)
Save the tokenizer to the given path.
- Parameters
path (
str
) – Path to save the tokenizer.use_pickle (
bool
) – If False - tokenizer will be saved in .replay directory. If True - tokenizer will be saved with pickle. Default: False.
- Return type
None
- property tensor_schema: TensorSchema
- Returns
tensor schema
PandasSequentialDataset
- class replay.data.nn.PandasSequentialDataset(tensor_schema, query_id_column, item_id_column, sequences)
Sequential dataset that stores sequences in PandasDataFrame format.
- filter_by_query_id(query_ids_to_keep)
Returns a SequentialDataset that contains only query ids from the specified list.
- Parameters
query_ids_to_keep (
ndarray
) – list of query ids.- Return type
- get_all_query_ids()
Getting a list of all query ids.
- Return type
ndarray
- get_max_sequence_length()
Returns the maximum length among all sequences from the SequentialDataset.
- Return type
int
- get_query_id(index)
Getting a query id for a given index.
- Parameters
index (
int
) – the row number in the dataset.- Return type
int
- get_sequence(index, feature_name)
Getting a sequence based on a given index and feature name.
- Parameters
index (
Union
[int
,ndarray
]) – single index or list of indices.feature_name (
str
) – the name of the feature.
- Return type
ndarray
- get_sequence_by_query_id(query_id, feature_name)
Getting a sequence based on a given query id and feature name.
- Parameters
query_id (
Union
[int
,ndarray
]) – single query id or list of query ids.feature_name (
str
) – the name of the feature.
- Return type
ndarray
- get_sequence_length(index)
Returns the length of the sequence at the specified index.
- Parameters
index (
int
) – the row number in the dataset.- Return type
int
- static keep_common_query_ids(lhs, rhs)
Returns SequentialDatasets that contain query ids from both datasets.
- Parameters
lhs (
SequentialDataset
) – SequentialDataset.rhs (
SequentialDataset
) – SequentialDataset.
- Return type
Tuple
[SequentialDataset
,SequentialDataset
]
- classmethod load(path, **kwargs)
Method for loading PandasSequentialDataset object from .replay directory.
- Return type
- property schema: TensorSchema
- Returns
List of tensor features.
TorchSequentialBatch
- class replay.data.nn.TorchSequentialBatch(query_id: LongTensor, padding_mask: BoolTensor, features: Mapping[str, Tensor])
Batch of TorchSequentialDataset
-
features:
Mapping
[str
,Tensor
] Alias for field number 2
-
padding_mask:
BoolTensor
Alias for field number 1
-
query_id:
LongTensor
Alias for field number 0
-
features:
TorchSequentialDataset
- class replay.data.nn.TorchSequentialDataset(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)
Torch dataset for sequential recommender models
- __init__(sequential, max_sequence_length, sliding_window_step=None, padding_value=0)
- Parameters
sequential (
SequentialDataset
) – sequential datasetmax_sequence_length (
int
) – the maximum length of sequencesliding_window_step (
Optional
[int
]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: Nonepadding_value (
int
) – value to pad sequences to desired length
TorchSequentialValidationBatch
- class replay.data.nn.TorchSequentialValidationBatch(query_id: LongTensor, padding_mask: BoolTensor, features: Mapping[str, Tensor], ground_truth: LongTensor, train: LongTensor)
Batch of TorchSequentialValidationDataset
-
features:
Mapping
[str
,Tensor
] Alias for field number 2
-
ground_truth:
LongTensor
Alias for field number 3
-
padding_mask:
BoolTensor
Alias for field number 1
-
query_id:
LongTensor
Alias for field number 0
-
train:
LongTensor
Alias for field number 4
-
features:
TorchSequentialValidationDataset
- class replay.data.nn.TorchSequentialValidationDataset(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)
Torch dataset for sequential recommender models that additionally stores ground truth
- __init__(sequential, ground_truth, train, max_sequence_length, padding_value=0, sliding_window_step=None, label_feature_name=None)
- Parameters
sequential (
SequentialDataset
) – validation sequential datasetground_truth (
SequentialDataset
) – validation ground_truth sequential datasettrain (
SequentialDataset
) – train sequential datasetmax_sequence_length (
int
) – the maximum length of sequencepadding_value (
int
) – value to pad sequences to desired lengthsliding_window_step (
Optional
[int
]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: Nonelabel_feature_name (
Optional
[str
]) – the name of the column containing the sequence of items.