Modules

Generators

Generators serves for generating synthetic either real data for simulation process. All of the generators are derived from GeneratorBase base class and to implement your own generator you must inherit from it. Basicly, the generator fits from a provided dataset (in case of real generator it just remembers it), than it creates a population to sample from with a number of rows by calling the generate() method and samples from this those population with sample() method. Note, that sample() takes the fraction of population size.

If a user is interested in using multiple generators at ones (e.g. modelling multiple groups of users or mixing results from different generating models) that it will be useful to look at CompositeGenerator which can handle a list of generators has a proportion mixing parameter which controls the weights of particular generators at sampling and generating time.

class sim4rec.modules.GeneratorBase(label: str, seed: int | None = None)

Base class for data generators

Parameters:

label – Generator string label
seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame)

Fits generator on passed dataframe

Parameters:: df – Source dataframe to fit on

abstract generate(num_samples: int)

Generates num_samples from fitted model or saved dataframe

Parameters:: num_samples – Number of samples to generate

sample(sample_frac: float) → DataFrame

Samples a fraction of rows from a dataframe, generated with generate() call

Parameters:: sample_frac – Fraction of rows
Returns:: Sampled dataframe

class sim4rec.modules.RealDataGenerator(label: str, seed: int | None = None)

Real data generator, which can sample from existing dataframe

Base class for data generators

Parameters:

label – Generator string label
seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame) → None

Parameters:: df – Dataframe for generation and sampling

generate(num_samples: int) → DataFrame

Generates a number of samples from fitted dataframe and keeps it for sampling

Parameters:: num_samples – Number of samples to generate
Returns:: Generated dataframe

class sim4rec.modules.SDVDataGenerator(label: str, id_column_name: str, model_name: str = 'gaussiancopula', parallelization_level: int = 1, device_name: str = 'cpu', seed: int | None = None)

Synthetic data generator with a bunch of models from SDV library

Parameters:

label – Generator string label
id_column_name – Column name for identifier
model_name – Name of a SDV model. Possible values are: [‘copulagan’, ‘ctgan’, ‘gaussiancopula’, ‘tvae’], defaults to ‘gaussiancopula’
parallelization_level – Parallelization level, defaults to 1
device_name – PyTorch device name, defaults to ‘cpu’
seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame) → None

Fits a generation model with a passed dataframe. The one should pass only feature columns

Parameters:: df – Dataframe to fit on

generate(num_samples: int) → DataFrame

Generates a number of samples from fitted dataframe and keeps it for sampling

Parameters:: num_samples – Number of samples to generate
Returns:: Generated dataframe

static load(filename: str)

Loads the generator model from the file

Parameters:: filename – Path to the file
Returns:: Generator instance with restored model

save_model(filename: str)

Saves generator model to file. Note, that it saves only fitted model, but not the generated dataframe

Parameters:: filename – Path to the file

setDevice(device_name: str) → None

Changes the current device. Note, that for gaussiancopula model, only cpu is supported

Parameters:: device_name – PyTorch device name

class sim4rec.modules.CompositeGenerator(generators: List[GeneratorBase], label: str, weights: Iterable | None = None)

Wrapper for sampling from multiple generators. Use weights parameter to control the sampling fraction for each of the generator

Parameters:

generators – List of generators
label – Generator string label
weights – Weights for each of the generator. Weights must be normalized (sums to 1), defaults to None

generate(num_samples: int) → None

For each generator calls generate() with number of samples, proportional to weights to generate num_samples in total. You can call this method to not perform generate() separately on each generator

Parameters:: num_samples – Total number of samples to generate

sample(sample_frac: float) → DataFrame

Samples a fraction of rows from generators according to the weights.

Parameters:: sample_frac – Fraction of rows
Returns:: Sampled dataframe

Embeddings

Embeddings can be utilized in case of high dimensional data or high data sparsity and it should be applied before performing the main simulator pipeline. Here the autoencoder estimator and transformer are implemented in the chance of the existing spark methods are not enough for your propose. The usage example can be found in notebooks directory.

class sim4rec.modules.EncoderEstimator(inputCols: List[str], outputCols: List[str], hidden_dim: int, lr: float, batch_size: int, num_loader_workers: int, max_iter: int = 100, device_name: str = 'cpu', seed: int | None = None)

Estimator for encoder part of the autoencoder pipeline. Trains the encoder to process incoming data into latent representation

Parameters:

inputCols – Column names to process
outputCols – List of output column names per latent coordinate. The length of outputCols will determine the embedding dimension size
hidden_dim – Size of hidden layers
lr – Learning rate
batch_size – Batch size during training process
num_loader_workers – Number of cpus to use for data loader
max_iter – Maximum number of iterations, defaults to 100
device_name – PyTorch device name, defaults to ‘cpu’

class sim4rec.modules.EncoderTransformer(inputCols: List[str], outputCols: List[str], encoder: Encoder, device_name: str = 'cpu')

Encoder transformer that transforms incoming columns into latent representation. Output data will be appended to dataframe and named according to outputCols parameter

Parameters:

inputCols – Column names to process
outputCols – List of output column names per latent coordinate. The length of outputCols must be equal to embedding dimension of a trained encoder
encoder – Trained encoder
device_name – PyTorch device name, defaults to ‘cpu’

Items selectors

Those spark transformers are used to assign items to given users while making the candidate pairs for prediction by recommendation system. It is optional to use selector in your pipelines if a recommendation algorithm can recommend any item with no restrictions. The one should implement own selector in case of custom logic of items selection for a certain users. Selector could implement some business rules (e.g. some items are not available for a user), be a simple recommendation model, generating candidates, or create user-specific items (e.g. price offers) online. To implement your custom selector, you can derive from a ItemSelectionEstimator base class, to implement some pre-calculation logic, and ItemSelectionTransformer to perform pairs creation. Both classes are inherited from spark’s Estimator and Transformer classes and to define fit() and transform() methods the one can just overwrite _fit() and _transform().

class sim4rec.modules.ItemSelectionEstimator(userKeyColumn: str = None, itemKeyColumn: str = None): Base class for item selection estimator

class sim4rec.modules.ItemSelectionTransformer(userKeyColumn: str = None, itemKeyColumn: str = None): Base class for item selection transformer. transform() will be used to create user-item pairs

class sim4rec.modules.CrossJoinItemEstimator(k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)

Assigns k items for every user from random items subsample

Parameters:

k – Number of items for every user
userKeyColumn – Users identifier column, defaults to None
itemKeyColumn – Items identifier column, defaults to None
seed – Random state seed, defaults to None

_fit(df: DataFrame)

Fits estimator with items dataframe

Parameters:: df – Items dataframe
Returns:: CrossJoinItemTransformer instance

class sim4rec.modules.CrossJoinItemTransformer(item_df: DataFrame, k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)

Assigns k items for every user from random items subsample

_transform(df: DataFrame)

Takes a users dataframe and assings defined number of items

Parameters:: df – Users dataframe
Returns:: Users cross join on random items subsample

Simulator

The simulator class provides a way to handle the simulation process by connecting different parts of the module such as generatos and response pipelines and saving the results to a given directory. The simulation process consists of the following steps:

Sampling random real or synthetic users
Creation of candidate user-item pairs for a recommendation algorithm
Prediction by a recommendation system
Evaluating respones on a given recommendations
Updating the history log
Metrics evaluation
Refitting the recommendation model with a new data

Some of the steps can be skipped depending on the task your perform. For example you don’t need a second step if your algorithm dont use user-item pairs as an input or you don’t need to refit the model if you want just to evaluate it on some data. For more usage please refer to examples

class sim4rec.modules.Simulator(user_gen: GeneratorBase, item_gen: GeneratorBase, data_dir: str, log_df: DataFrame | None = None, user_key_col: str = 'user_idx', item_key_col: str = 'item_idx', spark_session: SparkSession | None = None)

Simulator for recommendation systems, which uses the users and items data passed to it, to simulate the users responses to recommended items

Parameters:

user_gen – Users data generator instance
item_gen – Items data generator instance
log_df – The history log with user-item pairs with other necessary fields. During the simulation the results will be appended to this log on update_log() call, defaults to None
user_key_col – User identifier column name, defaults to ‘user_idx’
item_key_col – Item identifier column name, defaults to ‘item_idx’
data_dir – Directory name to save simulator data
spark_session – Spark session to use, defaults to None

clear_log() → None: Clears the log

get_log(user_df: DataFrame) → DataFrame

Returns log for users listed in passed users’ dataframe

Parameters:: user_df – Dataframe with user identifiers to get log for
Returns:: Users’ history log. Will return None, if there is no log data

get_user_items(user_df: DataFrame, selector: Transformer) → Tuple[DataFrame, DataFrame]

Froms candidate pairs to pass to the recommendation algorithm based on the provided users

Parameters:

user_df – Users dataframe with features and identifiers
selector – Transformer to use for creating user-item pairs

Returns:

Tuple of user-item pairs and log dataframes which will be used by recommendation algorithm. Will return None as a log, if there is no log data

sample_items(frac_items: float) → DataFrame

Samples a fraction of random items

Parameters:: frac_items – Fractions of items to sample from item generator
Returns:: Sampled users dataframe

sample_responses(recs_df: DataFrame, user_features: DataFrame, item_features: DataFrame, action_models: PipelineModel) → DataFrame

Simulates the actions users took on their recommended items

Parameters:

recs_df – Dataframe with recommendations. Must contain user’s and item’s identifier columns. Other columns will be ignored
user_features – Users dataframe with features and identifiers
item_features – Items dataframe with features and identifiers
action_models – Spark pipeline to evaluate responses

Returns:

DataFrame with user-item pairs and the respective actions

sample_users(frac_users: float) → DataFrame

Samples a fraction of random users

Parameters:: frac_users – Fractions of users to sample from user generator
Returns:: Sampled users dataframe

update_log(log: DataFrame, iteration: int | str) → None

Appends the passed log to the existing one

Parameters:

log – The log with user-item pairs with their respective necessary fields. If there was no log before this: remembers the log schema, to which the future logs will be compared. To reset current log and the schema see clear_log()
iteration – Iteration label or index

Evaluation

class sim4rec.modules.EvaluateMetrics(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, replay_label_filter: float = 1.0, replay_metrics: Dict[Metric, int | List[int]] | None = None, mllib_metrics: str | List[str] | None = None)

Recommendation systems and response function metric evaluator class. The class allows you to evaluate the quality of a response function on historical data or a recommender system on historical data or based on the results of an experiment in a simulator. Provides simultaneous calculation of several metrics using metrics from the Spark MLlib and RePlay libraries. A created instance is callable on a dataframe with user_id, item_id, predicted relevance/response, true relevance/response format, which you can usually retrieve from simulators sample_responses() or log data with recommendation algorithm scores. In case when the RePlay metrics are needed it additionally apply filter on a passed dataframe to take only necessary responses (e.g. when response is equal to 1).

Parameters:

userKeyCol – User identifier column name
itemKeyCol – Item identifier column name
predictionCol – Predicted scores column name
labelCol – True label column name
replay_label_filter – RePlay metrics assume that only positive responses are presented in ground truth data. All user-item pairs with col(labelCol) >= replay_label_filter condition are treated as positive responses during RePlay metrics calculation, defaults to 1.0
replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library, defaults to None
mllib_metrics – Metrics to calculate from spark’s mllib. See REGRESSION_METRICS, MULTICLASS_METRICS, BINARY_METRICS for available values, defaults to None

__call__(df: DataFrame) → Dict[str, float]

Performs metrics calculations on passed dataframe

Parameters:: df – Spark dataframe with userKeyCol, itemKeyCol, predictionCol and labelCol columns
Returns:: Dictionary with metrics

sim4rec.modules.evaluate_synthetic(synth_df: DataFrame, real_df: DataFrame) → dict

Evaluates the quality of synthetic data against real. The following metrics will be calculated:

LogisticDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a Logistic regression model
SVCDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a C-Support Vector Classification model
KSTest: This metric uses the two-sample Kolmogorov-Smirnov test to compare the distributions of continuous columns using the empirical CDF
ContinuousKLDivergence: This approximates the KL divergence by binning the continuous values to turn them into categorical values and then computing the relative entropy

Parameters:

synth_df – Synthetic data without any identifiers
real_df – Real data without any identifiers

Returns:

Dictionary with metrics on synthetic data quality

class sim4rec.modules.QualityControlObjective(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, relevanceCol: str, response_function: Transformer, replay_metrics: Dict[Metric, int | List[int]] | None)

QualityControlObjective is designed to evaluate the quality of response function by calculating the similarity degree between results of the model, which was trained on real data and a model, trained with simulator. The calculated function is

\[ \begin{align}\begin{aligned}1 - KS(predictionCol, labelCol) + DKL_{norm}(predictionCol, labelCol)\\- \frac{1}{N} \sum_{n=1}^{N} |QM_{syn}^{i}(recs_{synthetic}, ground\_truth_{synthetic}) - QM_{real}^{i}(recs_{real}, ground\_truth_{real})|,\end{aligned}\end{align} \]

where

\[ \begin{align}\begin{aligned}KS = supx||Q(x) - P(x)||\ (i.e.\ KS\ test\ statistic)\\DKL_{norm} = \frac{1}{1 + DKL}\end{aligned}\end{align} \]

The greater value indicates more similarity between models’ result and lower value shows dissimilarity. As a predicted value for KS test and KL divergence it takes the result of response_function on a pairs from real log and compares the distributions similarity between real responses and predicted. For calculating QM from formula above the metrics from RePlay library are used. Those take ground truth and predicted values for both models and measures how close are metric values to each other.

Parameters:

userKeyCol – User identifier column name
itemKeyCol – Item identifier column name
predictionCol – Prediction column name, which response_function will create
labelCol – Column name with ground truth response values
relevanceCol – Relevance column name for RePlay metrics. For ground truth dataframe it should be response score and for dataframe with recommendations it should be the predicted relevance from recommendation algorithm
response_function – Spark’s transformer which predict response value
replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library. Those metrics will be used as QM in the objective above

__call__(test_log: DataFrame, user_features: DataFrame, item_features: DataFrame, real_recs: DataFrame, real_ground_truth: DataFrame, synthetic_recs: DataFrame, synthetic_ground_truth: DataFrame) → float

Calculates the models similarity value. Note, that dataframe with recommendations for both synthetic and real data must include only users from ground truth dataframe

Parameters:

test_log – Real log dataframe with response values
user_features – Users features dataframe with identifier
item_features – Items features dataframe with identifier
real_recs – Recommendations dataframe from model trained on real dataset
real_ground_truth – Real log dataframe with only positive responses
synthetic_recs – Recommendations dataframe from model trained with simulator
synthetic_ground_truth – Simulator’s log dataframe with only positive responses

Returns:

Function value

sim4rec.modules.ks_test(df: DataFrame, predCol: str, labelCol: str) → float

Kolmogorov-Smirnov test on two dataframe columns

Parameters:

df – Dataframe with two target columns
predCol – Column name with values to test
labelCol – Column name with values to test against

Returns:

Result of KS test

sim4rec.modules.kl_divergence(df: DataFrame, predCol: str, labelCol: str) → float

Normalized Kullback–Leibler divergence on two dataframe columns. The normalization is as follows:

\[\frac{1}{1 + KL\_div}\]

Parameters:

df – Dataframe with two target columns
predCol – First column name
labelCol – Second column name

Returns:

Result of KL divergence