Modules
Generators
Generators serves for generating synthetic either real data for simulation process.
All of the generators are derived from GeneratorBase
base class and to implement
your own generator you must inherit from it. Basicly, the generator fits from a provided
dataset (in case of real generator it just remembers it), than it creates a population to
sample from with a number of rows by calling the generate()
method and samples from
this those population with sample()
method. Note, that sample()
takes the fraction
of population size.
If a user is interested in using multiple generators at ones (e.g. modelling multiple groups
of users or mixing results from different generating models) that it will be useful to look
at CompositeGenerator
which can handle a list of generators has a proportion mixing parameter
which controls the weights of particular generators at sampling and generating time.
- class sim4rec.modules.GeneratorBase(label: str, seed: int | None = None)
Base class for data generators
- Parameters:
label – Generator string label
seed – Fixes seed sequence to use during multiple generator calls, defaults to None
- fit(df: DataFrame)
Fits generator on passed dataframe
- Parameters:
df – Source dataframe to fit on
- abstract generate(num_samples: int)
Generates num_samples from fitted model or saved dataframe
- Parameters:
num_samples – Number of samples to generate
- sample(sample_frac: float) DataFrame
Samples a fraction of rows from a dataframe, generated with generate() call
- Parameters:
sample_frac – Fraction of rows
- Returns:
Sampled dataframe
- class sim4rec.modules.RealDataGenerator(label: str, seed: int | None = None)
Real data generator, which can sample from existing dataframe
Base class for data generators
- Parameters:
label – Generator string label
seed – Fixes seed sequence to use during multiple generator calls, defaults to None
- fit(df: DataFrame) None
- Parameters:
df – Dataframe for generation and sampling
- generate(num_samples: int) DataFrame
Generates a number of samples from fitted dataframe and keeps it for sampling
- Parameters:
num_samples – Number of samples to generate
- Returns:
Generated dataframe
- class sim4rec.modules.SDVDataGenerator(label: str, id_column_name: str, model_name: str = 'gaussiancopula', parallelization_level: int = 1, device_name: str = 'cpu', seed: int | None = None)
Synthetic data generator with a bunch of models from SDV library
- Parameters:
label – Generator string label
id_column_name – Column name for identifier
model_name – Name of a SDV model. Possible values are: [‘copulagan’, ‘ctgan’, ‘gaussiancopula’, ‘tvae’], defaults to ‘gaussiancopula’
parallelization_level – Parallelization level, defaults to 1
device_name – PyTorch device name, defaults to ‘cpu’
seed – Fixes seed sequence to use during multiple generator calls, defaults to None
- fit(df: DataFrame) None
Fits a generation model with a passed dataframe. The one should pass only feature columns
- Parameters:
df – Dataframe to fit on
- generate(num_samples: int) DataFrame
Generates a number of samples from fitted dataframe and keeps it for sampling
- Parameters:
num_samples – Number of samples to generate
- Returns:
Generated dataframe
- static load(filename: str)
Loads the generator model from the file
- Parameters:
filename – Path to the file
- Returns:
Generator instance with restored model
- save_model(filename: str)
Saves generator model to file. Note, that it saves only fitted model, but not the generated dataframe
- Parameters:
filename – Path to the file
- setDevice(device_name: str) None
Changes the current device. Note, that for gaussiancopula model, only cpu is supported
- Parameters:
device_name – PyTorch device name
- class sim4rec.modules.CompositeGenerator(generators: List[GeneratorBase], label: str, weights: Iterable | None = None)
Wrapper for sampling from multiple generators. Use weights parameter to control the sampling fraction for each of the generator
- Parameters:
generators – List of generators
label – Generator string label
weights – Weights for each of the generator. Weights must be normalized (sums to 1), defaults to None
- generate(num_samples: int) None
For each generator calls generate() with number of samples, proportional to weights to generate num_samples in total. You can call this method to not perform generate() separately on each generator
- Parameters:
num_samples – Total number of samples to generate
- sample(sample_frac: float) DataFrame
Samples a fraction of rows from generators according to the weights.
- Parameters:
sample_frac – Fraction of rows
- Returns:
Sampled dataframe
Embeddings
Embeddings can be utilized in case of high dimensional data or high data sparsity and it should be applied before performing the main simulator pipeline. Here the autoencoder estimator and transformer are implemented in the chance of the existing spark methods are not enough for your propose. The usage example can be found in notebooks directory.
- class sim4rec.modules.EncoderEstimator(inputCols: List[str], outputCols: List[str], hidden_dim: int, lr: float, batch_size: int, num_loader_workers: int, max_iter: int = 100, device_name: str = 'cpu', seed: int | None = None)
Estimator for encoder part of the autoencoder pipeline. Trains the encoder to process incoming data into latent representation
- Parameters:
inputCols – Column names to process
outputCols – List of output column names per latent coordinate. The length of outputCols will determine the embedding dimension size
hidden_dim – Size of hidden layers
lr – Learning rate
batch_size – Batch size during training process
num_loader_workers – Number of cpus to use for data loader
max_iter – Maximum number of iterations, defaults to 100
device_name – PyTorch device name, defaults to ‘cpu’
- class sim4rec.modules.EncoderTransformer(inputCols: List[str], outputCols: List[str], encoder: Encoder, device_name: str = 'cpu')
Encoder transformer that transforms incoming columns into latent representation. Output data will be appended to dataframe and named according to outputCols parameter
- Parameters:
inputCols – Column names to process
outputCols – List of output column names per latent coordinate. The length of outputCols must be equal to embedding dimension of a trained encoder
encoder – Trained encoder
device_name – PyTorch device name, defaults to ‘cpu’
Items selectors
Those spark transformers are used to assign items to given users while making the candidate
pairs for prediction by recommendation system. It is optional to use selector in your pipelines
if a recommendation algorithm can recommend any item with no restrictions. The one should
implement own selector in case of custom logic of items selection for a certain users. Selector
could implement some business rules (e.g. some items are not available for a user), be a simple
recommendation model, generating candidates, or create user-specific items (e.g. price offers)
online. To implement your custom selector, you can derive from a ItemSelectionEstimator
base
class, to implement some pre-calculation logic, and ItemSelectionTransformer
to perform pairs
creation. Both classes are inherited from spark’s Estimator and Transformer classes and to define
fit() and transform() methods the one can just overwrite _fit()
and _transform()
.
- class sim4rec.modules.ItemSelectionEstimator(userKeyColumn: str = None, itemKeyColumn: str = None)
Base class for item selection estimator
- class sim4rec.modules.ItemSelectionTransformer(userKeyColumn: str = None, itemKeyColumn: str = None)
Base class for item selection transformer. transform() will be used to create user-item pairs
- class sim4rec.modules.CrossJoinItemEstimator(k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)
Assigns k items for every user from random items subsample
- Parameters:
k – Number of items for every user
userKeyColumn – Users identifier column, defaults to None
itemKeyColumn – Items identifier column, defaults to None
seed – Random state seed, defaults to None
- _fit(df: DataFrame)
Fits estimator with items dataframe
- Parameters:
df – Items dataframe
- Returns:
CrossJoinItemTransformer instance
- class sim4rec.modules.CrossJoinItemTransformer(item_df: DataFrame, k: int, userKeyColumn: str | None = None, itemKeyColumn: str | None = None, seed: int | None = None)
Assigns k items for every user from random items subsample
- _transform(df: DataFrame)
Takes a users dataframe and assings defined number of items
- Parameters:
df – Users dataframe
- Returns:
Users cross join on random items subsample
Simulator
The simulator class provides a way to handle the simulation process by connecting different parts of the module such as generatos and response pipelines and saving the results to a given directory. The simulation process consists of the following steps:
Sampling random real or synthetic users
Creation of candidate user-item pairs for a recommendation algorithm
Prediction by a recommendation system
Evaluating respones on a given recommendations
Updating the history log
Metrics evaluation
Refitting the recommendation model with a new data
Some of the steps can be skipped depending on the task your perform. For example you don’t need a second step if your algorithm dont use user-item pairs as an input or you don’t need to refit the model if you want just to evaluate it on some data. For more usage please refer to examples
- class sim4rec.modules.Simulator(user_gen: GeneratorBase, item_gen: GeneratorBase, data_dir: str, log_df: DataFrame | None = None, user_key_col: str = 'user_idx', item_key_col: str = 'item_idx', spark_session: SparkSession | None = None)
Simulator for recommendation systems, which uses the users and items data passed to it, to simulate the users responses to recommended items
- Parameters:
user_gen – Users data generator instance
item_gen – Items data generator instance
log_df – The history log with user-item pairs with other necessary fields. During the simulation the results will be appended to this log on update_log() call, defaults to None
user_key_col – User identifier column name, defaults to ‘user_idx’
item_key_col – Item identifier column name, defaults to ‘item_idx’
data_dir – Directory name to save simulator data
spark_session – Spark session to use, defaults to None
- clear_log() None
Clears the log
- get_log(user_df: DataFrame) DataFrame
Returns log for users listed in passed users’ dataframe
- Parameters:
user_df – Dataframe with user identifiers to get log for
- Returns:
Users’ history log. Will return None, if there is no log data
- get_user_items(user_df: DataFrame, selector: Transformer) Tuple[DataFrame, DataFrame]
Froms candidate pairs to pass to the recommendation algorithm based on the provided users
- Parameters:
user_df – Users dataframe with features and identifiers
selector – Transformer to use for creating user-item pairs
- Returns:
Tuple of user-item pairs and log dataframes which will be used by recommendation algorithm. Will return None as a log, if there is no log data
- sample_items(frac_items: float) DataFrame
Samples a fraction of random items
- Parameters:
frac_items – Fractions of items to sample from item generator
- Returns:
Sampled users dataframe
- sample_responses(recs_df: DataFrame, user_features: DataFrame, item_features: DataFrame, action_models: PipelineModel) DataFrame
Simulates the actions users took on their recommended items
- Parameters:
recs_df – Dataframe with recommendations. Must contain user’s and item’s identifier columns. Other columns will be ignored
user_features – Users dataframe with features and identifiers
item_features – Items dataframe with features and identifiers
action_models – Spark pipeline to evaluate responses
- Returns:
DataFrame with user-item pairs and the respective actions
- sample_users(frac_users: float) DataFrame
Samples a fraction of random users
- Parameters:
frac_users – Fractions of users to sample from user generator
- Returns:
Sampled users dataframe
- update_log(log: DataFrame, iteration: int | str) None
Appends the passed log to the existing one
- Parameters:
log – The log with user-item pairs with their respective necessary fields. If there was no log before this: remembers the log schema, to which the future logs will be compared. To reset current log and the schema see clear_log()
iteration – Iteration label or index
Evaluation
- class sim4rec.modules.EvaluateMetrics(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, replay_label_filter: float = 1.0, replay_metrics: Dict[Metric, int | List[int]] | None = None, mllib_metrics: str | List[str] | None = None)
Recommendation systems and response function metric evaluator class. The class allows you to evaluate the quality of a response function on historical data or a recommender system on historical data or based on the results of an experiment in a simulator. Provides simultaneous calculation of several metrics using metrics from the Spark MLlib and RePlay libraries. A created instance is callable on a dataframe with
user_id, item_id, predicted relevance/response, true relevance/response
format, which you can usually retrieve from simulators sample_responses() or log data with recommendation algorithm scores. In case when the RePlay metrics are needed it additionally apply filter on a passed dataframe to take only necessary responses (e.g. when response is equal to 1).- Parameters:
userKeyCol – User identifier column name
itemKeyCol – Item identifier column name
predictionCol – Predicted scores column name
labelCol – True label column name
replay_label_filter – RePlay metrics assume that only positive responses are presented in ground truth data. All user-item pairs with col(labelCol) >= replay_label_filter condition are treated as positive responses during RePlay metrics calculation, defaults to 1.0
replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library, defaults to None
mllib_metrics – Metrics to calculate from spark’s mllib. See REGRESSION_METRICS, MULTICLASS_METRICS, BINARY_METRICS for available values, defaults to None
- __call__(df: DataFrame) Dict[str, float]
Performs metrics calculations on passed dataframe
- Parameters:
df – Spark dataframe with userKeyCol, itemKeyCol, predictionCol and labelCol columns
- Returns:
Dictionary with metrics
- sim4rec.modules.evaluate_synthetic(synth_df: DataFrame, real_df: DataFrame) dict
Evaluates the quality of synthetic data against real. The following metrics will be calculated:
LogisticDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a Logistic regression model
SVCDetection: The metric evaluates how hard it is to distinguish the synthetic data from the real data by using a C-Support Vector Classification model
KSTest: This metric uses the two-sample Kolmogorov-Smirnov test to compare the distributions of continuous columns using the empirical CDF
ContinuousKLDivergence: This approximates the KL divergence by binning the continuous values to turn them into categorical values and then computing the relative entropy
- Parameters:
synth_df – Synthetic data without any identifiers
real_df – Real data without any identifiers
- Returns:
Dictionary with metrics on synthetic data quality
- class sim4rec.modules.QualityControlObjective(userKeyCol: str, itemKeyCol: str, predictionCol: str, labelCol: str, relevanceCol: str, response_function: Transformer, replay_metrics: Dict[Metric, int | List[int]] | None)
QualityControlObjective is designed to evaluate the quality of response function by calculating the similarity degree between results of the model, which was trained on real data and a model, trained with simulator. The calculated function is
\[ \begin{align}\begin{aligned}1 - KS(predictionCol, labelCol) + DKL_{norm}(predictionCol, labelCol)\\- \frac{1}{N} \sum_{n=1}^{N} |QM_{syn}^{i}(recs_{synthetic}, ground\_truth_{synthetic}) - QM_{real}^{i}(recs_{real}, ground\_truth_{real})|,\end{aligned}\end{align} \]where
\[ \begin{align}\begin{aligned}KS = supx||Q(x) - P(x)||\ (i.e.\ KS\ test\ statistic)\\DKL_{norm} = \frac{1}{1 + DKL}\end{aligned}\end{align} \]The greater value indicates more similarity between models’ result and lower value shows dissimilarity. As a predicted value for KS test and KL divergence it takes the result of response_function on a pairs from real log and compares the distributions similarity between real responses and predicted. For calculating QM from formula above the metrics from RePlay library are used. Those take ground truth and predicted values for both models and measures how close are metric values to each other.
- Parameters:
userKeyCol – User identifier column name
itemKeyCol – Item identifier column name
predictionCol – Prediction column name, which response_function will create
labelCol – Column name with ground truth response values
relevanceCol – Relevance column name for RePlay metrics. For ground truth dataframe it should be response score and for dataframe with recommendations it should be the predicted relevance from recommendation algorithm
response_function – Spark’s transformer which predict response value
replay_metrics – Dictionary with replay metrics. See https://sb-ai-lab.github.io/RePlay/pages/modules/metrics.html for infromation about available metrics and their descriptions. The dictionary format is the same as in Experiment class of the RePlay library. Those metrics will be used as QM in the objective above
- __call__(test_log: DataFrame, user_features: DataFrame, item_features: DataFrame, real_recs: DataFrame, real_ground_truth: DataFrame, synthetic_recs: DataFrame, synthetic_ground_truth: DataFrame) float
Calculates the models similarity value. Note, that dataframe with recommendations for both synthetic and real data must include only users from ground truth dataframe
- Parameters:
test_log – Real log dataframe with response values
user_features – Users features dataframe with identifier
item_features – Items features dataframe with identifier
real_recs – Recommendations dataframe from model trained on real dataset
real_ground_truth – Real log dataframe with only positive responses
synthetic_recs – Recommendations dataframe from model trained with simulator
synthetic_ground_truth – Simulator’s log dataframe with only positive responses
- Returns:
Function value
- sim4rec.modules.ks_test(df: DataFrame, predCol: str, labelCol: str) float
Kolmogorov-Smirnov test on two dataframe columns
- Parameters:
df – Dataframe with two target columns
predCol – Column name with values to test
labelCol – Column name with values to test against
- Returns:
Result of KS test
- sim4rec.modules.kl_divergence(df: DataFrame, predCol: str, labelCol: str) float
Normalized Kullback–Leibler divergence on two dataframe columns. The normalization is as follows:
\[\frac{1}{1 + KL\_div}\]- Parameters:
df – Dataframe with two target columns
predCol – First column name
labelCol – Second column name
- Returns:
Result of KL divergence