Modules

Generators

Generators serves for generating synthetic either real data for simulation process. All of the generators are derived from GeneratorBase base class and to implement your own generator you must inherit from it. Basicly, the generator fits from a provided dataset (in case of real generator it just remembers it), than it creates a population to sample from with a number of rows by calling the generate() method and samples from this those population with sample() method. Note, that sample() takes the fraction of population size.

If a user is interested in using multiple generators at ones (e.g. modelling multiple groups of users or mixing results from different generating models) that it will be useful to look at CompositeGenerator which can handle a list of generators has a proportion mixing parameter which controls the weights of particular generators at sampling and generating time.

class sim4rec.modules.GeneratorBase(label: str, seed: int | None = None)

Base class for data generators

Parameters:
  • label – Generator string label

  • seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame)

Fits generator on passed dataframe

Parameters:

df – Source dataframe to fit on

abstract generate(num_samples: int)

Generates num_samples from fitted model or saved dataframe

Parameters:

num_samples – Number of samples to generate

sample(sample_frac: float) DataFrame

Samples a fraction of rows from a dataframe, generated with generate() call

Parameters:

sample_frac – Fraction of rows

Returns:

Sampled dataframe

class sim4rec.modules.RealDataGenerator(label: str, seed: int | None = None)

Real data generator, which can sample from existing dataframe

Base class for data generators

Parameters:
  • label – Generator string label

  • seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame) None
Parameters:

df – Dataframe for generation and sampling

generate(num_samples: int) DataFrame

Generates a number of samples from fitted dataframe and keeps it for sampling

Parameters:

num_samples – Number of samples to generate

Returns:

Generated dataframe

class sim4rec.modules.SDVDataGenerator(label: str, id_column_name: str, model_name: str = 'gaussiancopula', parallelization_level: int = 1, device_name: str = 'cpu', seed: int | None = None)

Synthetic data generator with a bunch of models from SDV library

Parameters:
  • label – Generator string label

  • id_column_name – Column name for identifier

  • model_name – Name of a SDV model. Possible values are: [‘copulagan’, ‘ctgan’, ‘gaussiancopula’, ‘tvae’], defaults to ‘gaussiancopula’

  • parallelization_level – Parallelization level, defaults to 1

  • device_name – PyTorch device name, defaults to ‘cpu’

  • seed – Fixes seed sequence to use during multiple generator calls, defaults to None

fit(df: DataFrame) None

Fits a generation model with a passed dataframe. The one should pass only feature columns

Parameters:

df – Dataframe to fit on

generate(num_samples: int) DataFrame

Generates a number of samples from fitted dataframe and keeps it for sampling

Parameters:

num_samples – Number of samples to generate

Returns:

Generated dataframe

static load(filename: str)

Loads the generator model from the file

Parameters:

filename – Path to the file

Returns:

Generator instance with restored model

save_model(filename: str)

Saves generator model to file. Note, that it saves only fitted model, but not the generated dataframe

Parameters:

filename – Path to the file

setDevice(device_name: str) None

Changes the current device. Note, that for gaussiancopula model, only cpu is supported

Parameters:

device_name – PyTorch device name

class sim4rec.modules.CompositeGenerator(generators: List[GeneratorBase], label: str, weights: Iterable | None = None)

Wrapper for sampling from multiple generators. Use weights parameter to control the sampling fraction for each of the generator

Parameters:
  • generators – List of generators

  • label – Generator string label

  • weights – Weights for each of the generator. Weights must be normalized (sums to 1), defaults to None

generate(num_samples: int) None

For each generator calls generate() with number of samples, proportional to weights to generate num_samples in total. You can call this method to not perform generate() separately on each generator

Parameters:

num_samples – Total number of samples to generate

sample(sample_frac: float) DataFrame

Samples a fraction of rows from generators according to the weights.

Parameters:

sample_frac – Fraction of rows

Returns:

Sampled dataframe

Embeddings

Embeddings can be utilized in case of high dimensional data or high data sparsity and it should be applied before performing the main simulator pipeline. Here the autoencoder estimator and transformer are implemented in the chance of the existing spark methods are not enough for your propose. The usage example can be found in notebooks directory.

class sim4rec.modules.EncoderEstimator(inputCols: List[str], outputCols: List[str], hidden_dim: int, lr: float, batch_size: int, num_loader_workers: int, max_iter: int = 100, device_name: str = 'cpu', seed: int | None = None)

Estimator for encoder part of the autoencoder pipeline. Trains the encoder to process incoming data into latent representation

Parameters:
  • inputCols – Column names to process

  • outputCols – List of output column names per latent coordinate. The length of outputCols will determine the embedding dimension size

  • hidden_dim – Size of hidden layers

  • lr – Learning rate

  • batch_size – Batch size during training process

  • num_loader_workers – Number of cpus to use for data loader

  • max_iter – Maximum number of iterations, defaults to 100

  • device_name – PyTorch device name, defaults to ‘cpu’

class sim4rec.modules.EncoderTransformer(inputCols: List[str], outputCols: List[str], encoder: Encoder, device_name: str = 'cpu')

Encoder transformer that transforms incoming columns into latent representation. Output data will be appended to dataframe and named according to outputCols parameter

Parameters:
  • inputCols – Column names to process

  • outputCols – List of output column names per latent coordinate. The length of outputCols must be equal to embedding dimension of a trained encoder

  • encoder – Trained encoder

  • device_name – PyTorch device name, defaults to ‘cpu’

Items selectors

Those spark transformers are used to assign items to given users while making the candidate pairs for prediction by recommendation system. It is optional to use selector in your pipelines if a recommendation algorithm can recommend any item with no restrictions. The one should implement own selector in case of custom logic of items selection for a certain users. Selector could implement some business rules (e.g. some items are not available for a user), be a simple recommendation model, generating candidates, or create user-specific items (e.g. price offers) online. To implement your custom selector, you can derive from a ItemSelectionEstimator base class, to implement some pre-calculation logic, and ItemSelectionTransformer to perform pairs creation. Both classes are inherited from spark’s Estimator and Transformer classes and to define fit() and transform() methods the one can just overwrite _fit() and _transform().

class sim4rec.modules.ItemSelectionEstimator(userKeyColumn: str = None, itemKeyColumn: str = None)

Base class for item selection estimator

class sim4rec.modules.ItemSelectionTransformer(userKeyColumn: str = None, itemKeyColumn: str = None)

Base class for item selection transformer. transform() will be used to create user-item pairs

class sim4rec.modules.CrossJoinItemEstimator(k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)

Assigns k items for every user from random items subsample

Parameters:
  • k – Number of items for every user

  • userKeyColumn – Users identifier column, defaults to None

  • itemKeyColumn – Items identifier column, defaults to None

  • seed – Random state seed, defaults to None

_fit(df: DataFrame)

Fits estimator with items dataframe

Parameters:

df – Items dataframe

Returns:

CrossJoinItemTransformer instance

class sim4rec.modules.CrossJoinItemTransformer(item_df: DataFrame, k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)

Assigns k items for every user from random items subsample

_transform(df: DataFrame)

Takes a users dataframe and assings defined number of items

Parameters:

df – Users dataframe

Returns:

Users cross join on random items subsample

Simulator

The simulator class provides a way to handle the simulation process by connecting different parts of the module such as generatos and response pipelines and saving the results to a given directory. The simulation process consists of the following steps:

  • Sampling random real or synthetic users

  • Creation of candidate user-item pairs for a recommendation algorithm

  • Prediction by a recommendation system

  • Evaluating respones on a given recommendations

  • Updating the history log

  • Metrics evaluation

  • Refitting the recommendation model with a new data

Some of the steps can be skipped depending on the task your perform. For example you don’t need a second step if your algorithm dont use user-item pairs as an input or you don’t need to refit the model if you want just to evaluate it on some data. For more usage please refer to examples

class sim4rec.modules.Simulator(user_gen: GeneratorBase, item_gen: GeneratorBase, data_dir: str, log_df: DataFrame | None = None, user_key_col: str = 'user_idx', item_key_col: str = 'item_idx', spark_session: SparkSession | None = None)

Simulator for recommendation systems, which uses the users and items data passed to it, to simulate the users responses to recommended items

Parameters:
  • user_gen – Users data generator instance

  • item_gen – Items data generator instance

  • log_df – The history log with user-item pairs with other necessary fields. During the simulation the results will be appended to this log on update_log() call, defaults to None

  • user_key_col – User identifier column name, defaults to ‘user_idx’

  • item_key_col – Item identifier column name, defaults to ‘item_idx’

  • data_dir – Directory name to save simulator data

  • spark_session – Spark session to use, defaults to None

clear_log() None

Clears the log

get_log(user_df: DataFrame) DataFrame

Returns log for users listed in passed users’ dataframe

Parameters:

user_df – Dataframe with user identifiers to get log for

Returns:

Users’ history log. Will return None, if there is no log data

get_user_items(user_df: DataFrame, selector: Transformer) Tuple[DataFrame, DataFrame]

Froms candidate pairs to pass to the recommendation algorithm based on the provided users

Parameters:
  • user_df – Users dataframe with features and identifiers

  • selector – Transformer to use for creating user-item pairs

Returns:

Tuple of user-item pairs and log dataframes which will be used by recommendation algorithm. Will return None as a log, if there is no log data

sample_items(frac_items: float) DataFrame

Samples a fraction of random items

Parameters:

frac_items – Fractions of items to sample from item generator

Returns:

Sampled users dataframe

sample_responses(recs_df: DataFrame, user_features: DataFrame, item_features: DataFrame, action_models: PipelineModel) DataFrame

Simulates the actions users took on their recommended items

Parameters:
  • recs_df – Dataframe with recommendations. Must contain user’s and item’s identifier columns. Other columns will be ignored

  • user_features – Users dataframe with features and identifiers

  • item_features – Items dataframe with features and identifiers

  • action_models – Spark pipeline to evaluate responses

Returns:

DataFrame with user-item pairs and the respective actions

sample_users(frac_users: float) DataFrame

Samples a fraction of random users

Parameters:

frac_users – Fractions of users to sample from user generator

Returns:

Sampled users dataframe

update_log(log: DataFrame, iteration: int | str) None

Appends the passed log to the existing one

Parameters:
  • log – The log with user-item pairs with their respective necessary fields. If there was no log before this: remembers the log schema, to which the future logs will be compared. To reset current log and the schema see clear_log()

  • iteration – Iteration label or index

Evaluation

class sim4rec.modules.EvaluateMetrics(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, replay_label_filter: float = 1.0, replay_metrics: Dict[Metric, int | List[int]] | None = None, mllib_metrics: str | List[str] | None = None)

Recommendation systems and response function metric evaluator class. The class allows you to evaluate the quality of a response function on historical data or a recommender system on historical data or based on the results of an experiment in a simulator. Provides simultaneous calculation of several metrics using metrics from the Spark MLlib and RePlay libraries. A created instance is callable on a dataframe with user_id, item_id, predicted relevance/response, true relevance/response format, which you can usually retrieve from simulators sample_responses() or log data with recommendation algorithm scores. In case when the RePlay metrics are needed it additionally apply filter on a passed dataframe to take only necessary responses (e.g. when response is equal to 1).

Parameters:
  • userKeyCol – User identifier column name

  • itemKeyCol – Item identifier column name

  • predictionCol – Predicted scores column name

  • labelCol – True label column name

  • replay_label_filter – RePlay metrics assume that only positive responses are presented in ground truth data. All user-item pairs with col(labelCol) >= replay_label_filter condition are treated as positive responses during RePlay metrics calculation, defaults to 1.0

  • replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library, defaults to None

  • mllib_metrics – Metrics to calculate from spark’s mllib. See REGRESSION_METRICS, MULTICLASS_METRICS, BINARY_METRICS for available values, defaults to None

__call__(df: DataFrame) Dict[str, float]

Performs metrics calculations on passed dataframe

Parameters:

df – Spark dataframe with userKeyCol, itemKeyCol, predictionCol and labelCol columns

Returns:

Dictionary with metrics

sim4rec.modules.evaluate_synthetic(synth_df: DataFrame, real_df: DataFrame) dict

Evaluates the quality of synthetic data against real. The following metrics will be calculated:

  • LogisticDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a Logistic regression model

  • SVCDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a C-Support Vector Classification model

  • KSTest: This metric uses the two-sample Kolmogorov-Smirnov test to compare the distributions of continuous columns using the empirical CDF

  • ContinuousKLDivergence: This approximates the KL divergence by binning the continuous values to turn them into categorical values and then computing the relative entropy

Parameters:
  • synth_df – Synthetic data without any identifiers

  • real_df – Real data without any identifiers

Returns:

Dictionary with metrics on synthetic data quality

class sim4rec.modules.QualityControlObjective(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, relevanceCol: str, response_function: Transformer, replay_metrics: Dict[Metric, int | List[int]] | None)

QualityControlObjective is designed to evaluate the quality of response function by calculating the similarity degree between results of the model, which was trained on real data and a model, trained with simulator. The calculated function is

\[ \begin{align}\begin{aligned}1 - KS(predictionCol, labelCol) + DKL_{norm}(predictionCol, labelCol)\\- \frac{1}{N} \sum_{n=1}^{N} |QM_{syn}^{i}(recs_{synthetic}, ground\_truth_{synthetic}) - QM_{real}^{i}(recs_{real}, ground\_truth_{real})|,\end{aligned}\end{align} \]

where

\[ \begin{align}\begin{aligned}KS = supx||Q(x) - P(x)||\ (i.e.\ KS\ test\ statistic)\\DKL_{norm} = \frac{1}{1 + DKL}\end{aligned}\end{align} \]

The greater value indicates more similarity between models’ result and lower value shows dissimilarity. As a predicted value for KS test and KL divergence it takes the result of response_function on a pairs from real log and compares the distributions similarity between real responses and predicted. For calculating QM from formula above the metrics from RePlay library are used. Those take ground truth and predicted values for both models and measures how close are metric values to each other.

Parameters:
  • userKeyCol – User identifier column name

  • itemKeyCol – Item identifier column name

  • predictionCol – Prediction column name, which response_function will create

  • labelCol – Column name with ground truth response values

  • relevanceCol – Relevance column name for RePlay metrics. For ground truth dataframe it should be response score and for dataframe with recommendations it should be the predicted relevance from recommendation algorithm

  • response_function – Spark’s transformer which predict response value

  • replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library. Those metrics will be used as QM in the objective above

__call__(test_log: DataFrame, user_features: DataFrame, item_features: DataFrame, real_recs: DataFrame, real_ground_truth: DataFrame, synthetic_recs: DataFrame, synthetic_ground_truth: DataFrame) float

Calculates the models similarity value. Note, that dataframe with recommendations for both synthetic and real data must include only users from ground truth dataframe

Parameters:
  • test_log – Real log dataframe with response values

  • user_features – Users features dataframe with identifier

  • item_features – Items features dataframe with identifier

  • real_recs – Recommendations dataframe from model trained on real dataset

  • real_ground_truth – Real log dataframe with only positive responses

  • synthetic_recs – Recommendations dataframe from model trained with simulator

  • synthetic_ground_truth – Simulator’s log dataframe with only positive responses

Returns:

Function value

sim4rec.modules.ks_test(df: DataFrame, predCol: str, labelCol: str) float

Kolmogorov-Smirnov test on two dataframe columns

Parameters:
  • df – Dataframe with two target columns

  • predCol – Column name with values to test

  • labelCol – Column name with values to test against

Returns:

Result of KS test

sim4rec.modules.kl_divergence(df: DataFrame, predCol: str, labelCol: str) float

Normalized Kullback–Leibler divergence on two dataframe columns. The normalization is as follows:

\[\frac{1}{1 + KL\_div}\]
Parameters:
  • df – Dataframe with two target columns

  • predCol – First column name

  • labelCol – Second column name

Returns:

Result of KL divergence