Data
Dataset
- class replay.data.Dataset(feature_schema, interactions, query_features=None, item_features=None, check_consistency=True, categorical_encoded=False)
Universal dataset for feeding data to models.
- cache()
Persists the SparkDataFrame with the default storage level (MEMORY_AND_DISK) for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- property feature_schema: FeatureSchema
- Returns
List of features.
- property interactions: Union[DataFrame, DataFrame, DataFrame]
- Returns
interactions dataset.
- property is_categorical_encoded: bool
- Returns
is categorical features are encoded.
- property item_count: int
- Returns
The number of items.
- property item_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
- Returns
item features dataset.
- property item_ids: Union[DataFrame, DataFrame, DataFrame]
- Returns
dataset with unique item ids.
- classmethod load(path, dataframe_type=None)
Load the Dataset from the provided path.
- Parameters
path (str) – The file path
- Dataframe_type
Dataframe type to use to store internal data. Can be spark|pandas|polars|None. If not provided automatically sets to the one used when the Dataset was saved.
- Returns
Loaded Dataset.
- Return type
- persist(storage_level=StorageLevel(True, True, False, True, 1))
Sets the storage level to persist SparkDataFrame for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- Parameters
storage_level (StorageLevel) – storage level to set for persistance. default:
`MEMORY_AND_DISK_DESER`.
- property query_count: int
- Returns
the number of queries.
- property query_features: Optional[Union[DataFrame, DataFrame, DataFrame]]
- Returns
query features dataset.
- property query_ids: Union[DataFrame, DataFrame, DataFrame]
- Returns
dataset with unique query ids.
- save(path)
Save the Dataset to the provided path.
- Parameters
path (str) – Path to save the Dataset to.
- subset(features_to_keep)
Returns subset of features. Keeps query and item IDs even if the corresponding sources are not explicitly passed to this functions.
- Parameters
features_to_keep (Iterable[str]) – sequence of features to keep.
- Returns
new Dataset with given features.
- Return type
- to_pandas()
Convert internally stored dataframes to pandas.DataFrame.
- to_polars()
Convert internally stored dataframes to polars.DataFrame.
- to_spark()
Convert internally stored dataframes to pyspark.sql.DataFrame.
- unpersist(blocking=False)
Marks SparkDataFrame as non-persistent, and remove all blocks for it from memory and disk for interactions, item_features and user_features.
The function is only available when the PySpark is installed.
- Parameters
blocking (bool) – whether to block until all blocks are deleted. default:
`False`.
DatasetLabelEncoder
- class replay.data.dataset_utils.DatasetLabelEncoder(handle_unknown_rule='error', default_value_rule=None)
Categorical features encoder for the Dataset class
- fit(dataset)
Fits an encoder by the input Dataset for categorical features.
- Parameters
dataset (Dataset) – the Dataset object.
- Returns
fitted DatasetLabelEncoder.
- Raises
- AssertionError: if any of dataset categorical features contains
invalid
FeatureSourcetype.
- Return type
- fit_transform(dataset)
Fits an encoder and transforms the input Dataset categorical features.
- get_encoder(columns)
Get the encoder of fitted Dataset for columns.
- Parameters
columns (Union[str, Iterable[str]]) – columns to filter by.
- Returns
LabelEncoder.
- Return type
Optional[LabelEncoder]
- property interactions_encoder: Optional[LabelEncoder]
- Returns
interactions LabelEncoder.
- property item_features_encoder: Optional[LabelEncoder]
- Returns
item features LabelEncoder.
- property item_id_encoder: LabelEncoder
- Returns
item id LabelEncoder.
- property query_and_item_id_encoder: LabelEncoder
- Returns
query id and item id LabelEncoder.
- property query_features_encoder: Optional[LabelEncoder]
- Returns
query features LabelEncoder.
- property query_id_encoder: LabelEncoder
- Returns
query id LabelEncoder.
FeatureType
- final class replay.data.FeatureType(value)
Type of Feature.
- CATEGORICAL= categorical
Type of Feature.
- CATEGORICAL_LIST= categorical_list
Type of Feature.
- NUMERICAL= numerical
Type of Feature.
- NUMERICAL_LIST= numerical_list
Type of Feature.
FeatureSource
- final class replay.data.FeatureSource(value)
Name of DataFrame.
- ITEM_FEATURES= item_features
Name of DataFrame.
- QUERY_FEATURES= query_features
Name of DataFrame.
- INTERACTIONS= interactions
Name of DataFrame.
FeatureHint
- final class replay.data.FeatureHint(value)
Hint to algorithm about column.
- ITEM_ID= item_id
Hint to algorithm about column.
- QUERY_ID= query_id
Hint to algorithm about column.
- RATING= rating
Hint to algorithm about column.
- TIMESTAMP= timestamp
Hint to algorithm about column.
FeatureInfo
- class replay.data.FeatureInfo(column, feature_type, feature_hint=None, feature_source=None, cardinality=None)
Information about a feature.
- property cardinality: Optional[int]
- Returns
cardinality of the feature.
- property column: str
- Returns
the feature name.
- property feature_hint: Optional[FeatureHint]
- Returns
the feature hint.
- property feature_source: Optional[FeatureSource]
- Returns
the name of source dataframe of feature.
- property feature_type: FeatureType
- Returns
the type of feature.
- reset_cardinality()
Reset cardinality of the feature to None.
FeatureSchema
- class replay.data.FeatureSchema(features_list)
Key-value like collection with information about all dataset features.
- property all_features: Sequence[FeatureInfo]
- Returns
sequence of all features.
- property categorical_features: FeatureSchema
- Returns
sequence of categorical features in a schema.
- property columns: Sequence[str]
- Returns
list of all feature’s column names.
- copy()
Creates a copy of all features. For the returned copy, all cardinality values will be undefined.
- Returns
copy of the initial feature schema.
- Return type
- drop(column=None, feature_hint=None, feature_source=None, feature_type=None)
Drop features from list by
column,feature_source,feature_typeandfeature_hint.- Parameters
column (Optional[str]) – Column name to filter by. default:
None.feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default:
None.feature_source (Optional[FeatureSource]) – Feature source to filter by. default:
None.feature_type (Optional[FeatureType]) – Feature type to filter by. default:
None.
- Returns
new filtered feature schema without selected features.
- Return type
- filter(column=None, feature_hint=None, feature_source=None, feature_type=None)
Filter list by
column,feature_source,feature_typeandfeature_hint.- Parameters
column (Optional[str]) – Column name to filter by. default:
None.feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default:
None.feature_source (Optional[FeatureSource]) – Feature source to filter by. default:
None.feature_type (Optional[FeatureType]) – Feature type to filter by. default:
None.
- Returns
new filtered feature schema.
- Return type
- get(k[, d]) D[k] if k in D, else d. d defaults to None.
- Return type
Optional[FeatureInfo]
- property interaction_features: FeatureSchema
- Returns
sequence of interaction features in a schema.
- property interactions_rating_column: Optional[str]
- Returns
interactions-rating column name.
- property interactions_rating_features: FeatureSchema
- Returns
sequence of interactions-rating features in a schema.
- property interactions_timestamp_column: Optional[str]
- Returns
interactions-timestamp column name.
- property interactions_timestamp_features: FeatureSchema
- Returns
sequence of interactions-timestamp features in a schema.
- item()
- Returns
extract a feature information from a schema.
- Return type
- property item_features: FeatureSchema
- Returns
sequence of item features in a schema.
- property item_id_column: str
- Returns
item id column name.
- property item_id_feature: FeatureInfo
- Returns
sequence of item id features in a schema.
- items() a set-like object providing a view on D's items
- Return type
ItemsView[str, FeatureInfo]
- keys() a set-like object providing a view on D's keys
- Return type
KeysView[str]
- property numerical_features: FeatureSchema
- Returns
sequence of numerical features in a schema.
- property query_features: FeatureSchema
- Returns
sequence of query features in a schema.
- property query_id_column: str
- Returns
query id column name.
- property query_id_feature: FeatureInfo
- Returns
sequence of query id features in a schema.
- subset(features_to_keep)
Creates a subset of given features.
- Parameters
features_to_keep (Iterable[str]) – a sequence of feature columns in original schema to keep in subset.
- Returns
new feature schema of given features.
- Return type
- values() an object providing a view on D's values
- Return type
ValuesView[FeatureInfo]
GetSchema
- replay.data.get_schema(query_column='query_id', item_column='item_id', timestamp_column='timestamp', rating_column='rating', has_timestamp=True, has_rating=True)
Get Spark Schema with query_id, item_id, rating, timestamp columns
- Parameters
query_column (str) – column name with query ids
item_column (str) – column name with item ids
timestamp_column (str) – column name with timestamps
rating_column (str) – column name with ratings
has_rating (bool) – flag to add rating to schema
has_timestamp (bool) – flag to add tomestamp to schema
Neural Networks
This submodule is only available when the PyTorch is installed.
TensorFeatureInfo
- class replay.data.nn.TensorFeatureInfo(name, feature_type, is_seq=False, feature_hint=None, feature_sources=None, cardinality=None, padding_value=0, embedding_dim=None, tensor_dim=None)
Information about a tensor feature.
- property cardinality: Optional[int]
- Returns
Cardinality of the feature.
- property embedding_dim: Optional[int]
- Returns
Embedding dimensions of the feature.
- property feature_hint: Optional[FeatureHint]
- Returns
The feature hint.
- property feature_source: Optional[TensorFeatureSource]
- Returns
Dataframe info of feature.
- property feature_sources: Optional[list[replay.data.nn.schema.TensorFeatureSource]]
- Returns
List of sources feature came from.
- property feature_type: FeatureType
- Returns
The type of feature.
- property is_cat: bool
- Returns
Flag that feature is categorical.
- property is_list: bool
- Returns
Flag that feature is numerical list or categorical list.
- property is_num: bool
- Returns
Flag that feature is numerical.
- property is_seq: bool
- Returns
Flag that feature is sequential.
Sequential means that the value of the feature will be determined for each element of the user’s sequence.
- property name: str
- Returns
The feature name.
- property padding_value: int
- Returns
value to pad sequences to desired length.
- property tensor_dim: Optional[int]
- Returns
Dimensions of the numerical feature.
TensorFeatureSource
- class replay.data.nn.TensorFeatureSource(source, column, index=None)
Describes source of a feature
- property column: str
- Returns
column name
- property index: Optional[int]
- Returns
provided index
- property source: FeatureSource
- Returns
feature source
TensorSchema
- class replay.data.nn.TensorSchema(features_list)
Key-value like collection that stores tensor features
- property all_features: Sequence[TensorFeatureInfo]
- Returns
Sequence of all features.
- property categorical_features: TensorSchema
- Returns
Sequence of categorical features in a schema.
- filter(name=None, feature_hint=None, is_seq=None, feature_type=None)
Filter list by
name,feature_type,is_seqandfeature_hint.- Parameters
name (Optional[str]) – Feature name to filter by. default:
None.feature_hint (Optional[FeatureHint]) – Feature hint to filter by. default:
None.feature_source – Feature source to filter by. default:
None.feature_type (Optional[FeatureType]) – Feature type to filter by. default:
None.
- Returns
New filtered feature schema.
- Return type
- get(k[, d]) D[k] if k in D, else d. d defaults to None.
- Return type
Optional[TensorFeatureInfo]
- item()
- Returns
Extract single feature from a schema.
- Return type
- property item_id_feature_name: Optional[str]
- Returns
Item id feature name.
- property item_id_features: TensorSchema
- Returns
Sequence of item id features in a schema.
- items() a set-like object providing a view on D's items
- Return type
ItemsView[str, TensorFeatureInfo]
- keys() a set-like object providing a view on D's keys
- Return type
KeysView[str]
- property names: Sequence[str]
- Returns
List of all feature’s names.
- property numerical_features: TensorSchema
- Returns
Sequence of numerical features in a schema.
- property query_id_feature_name: Optional[str]
- Returns
Query id feature name.
- property query_id_features: TensorSchema
- Returns
Sequence of query id features in a schema.
- property rating_feature_name: Optional[str]
- Returns
Rating feature name.
- property rating_features: TensorSchema
- Returns
Sequence of rating features in a schema.
- property sequential_features: TensorSchema
- Returns
Sequence of sequential features in a schema.
- subset(features_to_keep)
Creates a subset of given features.
- Parameters
features_to_keep (Iterable[str]) – A sequence of feature names in original schema to keep in subset.
- Returns
New tensor schema of given features.
- Return type
- property timestamp_feature_name: Optional[str]
- Returns
Timestamp feature name.
- property timestamp_features: TensorSchema
- Returns
Sequence of timestamp features in a schema.
- values() an object providing a view on D's values
- Return type
ValuesView[TensorFeatureInfo]
SequenceTokenizer
- class replay.data.nn.SequenceTokenizer(tensor_schema, handle_unknown_rule='error', default_value_rule=None, allow_collect_to_master=False)
Data tokenizer for transformers; Encodes all categorical features (the ones marked as FeatureType.CATEGORICAL in the FeatureSchema) and stores all data as items sequences (sorted by time if a feature of type FeatureHint.TIMESTAMP is provided, unsorted otherwise).
- fit(dataset)
- Parameters
dataset (Dataset) – input dataset to fit
- Returns
fitted SequenceTokenizer
- Return type
- fit_transform(dataset)
- Parameters
dataset (Dataset) – input dataset to transform
- Returns
SequentialDataset
- Return type
SequentialDataset
- property interactions_encoder: Optional[LabelEncoder]
- Returns
encoder for interactions
- property item_features_encoder: Optional[LabelEncoder]
- Returns
encoder for item features
- property item_id_encoder: LabelEncoder
- Returns
encoder for item id
- classmethod load(path, use_pickle=False, **kwargs)
Load tokenizer object from the given path.
- Parameters
path (str) – Path to load the tokenizer.
use_pickle (bool) – If False - tokenizer will be loaded from .replay directory. If True - tokenizer will be loaded with pickle. Default: False.
- Returns
Loaded tokenizer object.
- Return type
- property query_and_item_id_encoder: LabelEncoder
- Returns
encoder for query and item id
- property query_features_encoder: Optional[LabelEncoder]
- Returns
encoder for query features
- property query_id_encoder: LabelEncoder
- Returns
encoder for query id
- save(path, use_pickle=False)
Save the tokenizer to the given path.
- Parameters
path (str) – Path to save the tokenizer.
use_pickle (bool) – If False - tokenizer will be saved in .replay directory. If True - tokenizer will be saved with pickle. Default: False.
- property tensor_schema: TensorSchema
- Returns
tensor schema
PandasSequentialDataset
- class replay.data.nn.PandasSequentialDataset(tensor_schema, query_id_column, item_id_column, sequences)
Sequential dataset that stores sequences in PandasDataFrame format.
- filter_by_query_id(query_ids_to_keep)
Returns a SequentialDataset that contains only query ids from the specified list.
- Parameters
query_ids_to_keep (ndarray) – list of query ids.
- Return type
- get_all_query_ids()
Getting a list of all query ids.
- Return type
ndarray
- get_max_sequence_length()
Returns the maximum length among all sequences from the SequentialDataset.
- Return type
int
- get_query_id(index)
Getting a query id for a given index.
- Parameters
index (int) – the row number in the dataset.
- Return type
int
- get_sequence(index, feature_name)
Getting a sequence based on a given index and feature name.
- Parameters
index (Union[int, ndarray]) – single index or list of indices.
feature_name (str) – the name of the feature.
- Return type
ndarray
- get_sequence_by_query_id(query_id, feature_name)
Getting a sequence based on a given query id and feature name.
- Parameters
query_id (Union[int, ndarray]) – single query id or list of query ids.
feature_name (str) – the name of the feature.
- Return type
ndarray
- get_sequence_length(index)
Returns the length of the sequence at the specified index.
- Parameters
index (int) – the row number in the dataset.
- Return type
int
- static keep_common_query_ids(lhs, rhs)
Returns SequentialDatasets that contain query ids from both datasets.
- Parameters
lhs (SequentialDataset) – SequentialDataset.
rhs (SequentialDataset) – SequentialDataset.
- Return type
tuple[‘SequentialDataset’, ‘SequentialDataset’]
- classmethod load(path, **kwargs)
Method for loading PandasSequentialDataset object from .replay directory.
- Return type
- property schema: TensorSchema
- Returns
List of tensor features.
TorchSequentialBatch
TorchSequentialDataset
- class replay.data.nn.TorchSequentialDataset(sequential, max_sequence_length, sliding_window_step=None, padding_value=None)
Torch dataset for sequential recommender models
- __init__(sequential, max_sequence_length, sliding_window_step=None, padding_value=None)
- Parameters
sequential (SequentialDataset) – sequential dataset
max_sequence_length (int) – the maximum length of sequence
sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None
padding_value (Optional[int]) – value to pad sequences to desired length
TorchSequentialValidationBatch
- class replay.data.nn.TorchSequentialValidationBatch(query_id, padding_mask, features, ground_truth, train)
Batch of TorchSequentialValidationDataset
- features: TensorMap
Alias for field number 2
- ground_truth: LongTensor
Alias for field number 3
- padding_mask: BoolTensor
Alias for field number 1
- query_id: LongTensor
Alias for field number 0
- train: LongTensor
Alias for field number 4
TorchSequentialValidationDataset
- class replay.data.nn.TorchSequentialValidationDataset(sequential, ground_truth, train, max_sequence_length, padding_value=None, sliding_window_step=None, label_feature_name=None)
Torch dataset for sequential recommender models that additionally stores ground truth
- __init__(sequential, ground_truth, train, max_sequence_length, padding_value=None, sliding_window_step=None, label_feature_name=None)
- Parameters
sequential (SequentialDataset) – validation sequential dataset
ground_truth (SequentialDataset) – validation ground_truth sequential dataset
train (SequentialDataset) – train sequential dataset
max_sequence_length (int) – the maximum length of sequence
padding_value (Optional[int]) – value to pad sequences to desired length
sliding_window_step (Optional[int]) – value of offset from each sequence start during iteration, None means the offset will be equals to difference between actual sequence length and max_sequence_length. Default: None
label_feature_name (Optional[str]) – the name of the column containing the sequence of items.
Parquet processing
This module contains the implementation of ParquetDataset - a combination of PyTorch-compatible dataset and sampler designed for working with the Parquet file format.
The main advantages offered by this dataset are:
Batch-wise reading and processing of data, allowing it to work with large datasets in memory-constrained settings.
Full built-in support for Torch’s Distributed Data Parallel mode.
Automatic padding of data according to the provided schema.
ParquetDataset is primarily configured using column schemas - dictionaries containing target columns as keys and their shape/padding specifiers as values.
An example column schema:
schema = {
"user_id": {} # Empty metadata represents a non-array column.
"seq_1": {"shape": 5} # 1-D sequences of length 5 using default padding value as -1.
"seq_2": {"shape": [5, 6], "padding": -2} # 2-D sequences with custom padding values
}
ParquetDataset
- class replay.data.nn.parquet.ParquetDataset(source, metadata, partition_size, batch_size, filesystem=<pyarrow._fs.LocalFileSystem object>, make_mask_name=<function default_make_mask_name.<locals>.function>, device=device(type='cpu'), generator=None, replicas_info=<replay.data.nn.parquet.info.replicas.ReplicasInfo object>, collate_fn=<function general_collate>, **kwargs)
Combination dataset and sampler for batch-wise reading and processing of Parquet files.
This implementation allows one to read data using a PyArrow Dataset, convert it into structured columns, split it into partitions, and then into batches needed for model training. Supports distributed training and reproducible random shuffling.
During data loader operation, a partition of size
partition_sizeis read. There may be situations where the size of the read partition is less thanpartition_size- this depends on the number of rows in the data fragment. A fragment is a single Parquet file in the file system.The partition will be read by every worker, split according to their replica ID, processed and the result will be returned as a batch of size
batch_size. Please note that the resulting batch size may be less thanbatch_size.For maximum efficiency when reading and processing data, as well as imporved data shuffling, it is recommended to set
partition_sizeto several times larger thanbatch_size.Note:
ParquetDatasetsupports only numeric values (boolean/integer/float), therefore, the data paths passed as arguments must contain encoded data.For optimal performance, set the
OMP_NUM_THREADSandARROW_IO_THREADSto match the number of available CPU cores.
- __init__(source, metadata, partition_size, batch_size, filesystem=<pyarrow._fs.LocalFileSystem object>, make_mask_name=<function default_make_mask_name.<locals>.function>, device=device(type='cpu'), generator=None, replicas_info=<replay.data.nn.parquet.info.replicas.ReplicasInfo object>, collate_fn=<function general_collate>, **kwargs)
- Parameters
source (Union[str, list[str]]) – The path or list of paths to files/directories containing data in Parquet format.
metadata (dict[str, dict[str, Union[bool, int, float, str]]]) –
Metadata describing the data structure. The structure of each column is defined by the following values:
shape- the dimension of the column being read.If the column contains only one value, this parameter does not need to be specified. If the column contains a one-dimensional array, the parameter must be a number or an array containing one number. If the column contains a two-dimensional array, the parameter must be an array containing two numbers.
padding- padding value that will fill the arrays if their length is lessthan that specified in the shape parameter.
partition_size (int) – Partition size when reading data from Parquet files.
batch_size (int) – The size of the batch that will be returned during iteration.
filesystem (Union[str, FileSystem]) – A PyArrow’s Filesystem object used to access data, or a URI-based path to infer the filesystem from. Default: value of
DEFAULT_FILESYSTEM.make_mask_name (Callable[[str], str]) – Mask name generation function. Default: value of
DEFAULT_MAKE_MASK_NAME.device (device) – The device on which the data will be generated. Defaults: value of
DEFAULT_DEVICE.generator (Optional[Generator]) – Random number generator for batch shuffling. If
None, shuffling will be disabled. Default:None.replicas_info (ReplicasInfoProtocol) – A connector object capable of fetching total replica count and replica id during runtime. Default: value of
DEFAULT_REPLICAS_INFO- a pre-built connector which assumes standard Torch DDP mode.torch.utils.dataandtorch.distributedmodules.collate_fn (Callable[[dict[str, Union[torch.Tensor, dict[str, Union[torch.Tensor, ForwardRef('GeneralBatch')]]]]], dict[str, Union[torch.Tensor, dict[str, Union[torch.Tensor, ForwardRef('GeneralBatch')]]]]]) – Collate function for merging batches. Default: value of
DEFAULT_COLLATE_FN.
ParquetModule (Lightning DataModule)
- class replay.data.nn.ParquetModule(batch_size, metadata, transforms, config=None, *, train_path=None, validate_path=None, test_path=None, predict_path=None)
Standardized DataModule with batch-wise support via ParquetDataset.
Allows for unified access to all data splits across the training/inference pipeline without loading full dataset into memory. See the Parquet processing section for details.
ParquetModule provides per batch data loading and preprocessing via transform pipelines. See the Transforms for ParquetModule section for getting info about available batch transforms.
Note:
ParquetModulesupports only numeric values (boolean/integer/float), therefore, the data paths passed as arguments must contain encoded data.For optimal performance, set the OMP_NUM_THREADS and ARROW_IO_THREADS to match the number of available CPU cores.
It’s possible to use all train/validate/test/predict splits, then paths to splits should be passed as corresponding arguments of
ParquetModule. Alternatively, all the paths to the splits may be not specified but then do not forget to configure the Pytorch Lightning Trainer’s instance accordingly. For example, if you don’t want use validation data, you are able not to setvalidate_pathparameter inParquetModuleand setlimit_val_batches=0in Ligthning.Trainer.
- __init__(batch_size, metadata, transforms, config=None, *, train_path=None, validate_path=None, test_path=None, predict_path=None)
- Parameters
batch_size (int) – Target batch size.
metadata (dict) –
A dictionary that each data split maps to a dictionary of feature names with each feature is associated with its shape and padding_value.
Example: {“train”: {“item_id” : {“shape”: 100, “padding_value”: 7657}}}.
For details, see the section Parquet processing.
config (Optional[dict]) –
Dict specifying configuration options of
ParquetDataset(generator, filesystem, collate_fn, make_mask_name, replicas_info) for each data split. Default:DEFAULT_CONFIG.In most scenarios, the default configuration is sufficient.
transforms (dict[Literal['train', 'validate', 'test', 'predict'], list[torch.nn.modules.module.Module]]) – Dict specifying sequence of Transform modules for each data split.
train_path (Optional[str]) – Path to the Parquet file containing train data split. Default:
None.validate_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing validation data split. Default:
None.test_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing testing data split. Default:
None.predict_path (Optional[Union[str, list[str]]]) – Path to the Parquet file or files containing prediction data split. Default:
None.
Example
This is a minimal usage example of ParquetModule. It uses train data only, and the Transforms are defined to support further training of the SasRec model.
See the full example in examples/09_sasrec_example.ipynb.
from replay.data.nn import ParquetModule from replay.nn.transform.template import make_default_sasrec_transforms metadata = { "user_id": {}, "item_id": {"shape": 50, "padding": 51}, } transforms = make_default_sasrec_transforms(tensor_schema, query_column="user_id") parquet_datamodule = ParquetModule( batch_size=64, metadata=metadata, transforms=transforms, train_path="data/train.parquet", )