Utils
Time Smoothing
time module provides function to apply time smoothing to item or interaction relevance.
- replay.utils.time.smoothe_time(log, decay=30, limit=0.1, kind='exp')
Weighs
relevancecolumn with a time-dependent weight.- Parameters
log (Union[DataFrame, DataFrame, DataFrame]) – interactions log
decay (float) – number of days after which the weight is reduced by half, must be grater than 1
limit (float) – minimal value the weight can reach
kind (str) – type of smoothing, one of [power, exp, linear]. Corresponding functions are
power:age^c,exp:c^age,linear:1-c*age
- Returns
modified DataFrame
>>> import pandas as pd >>> from pyspark.sql.functions import round >>> d = { ... "item_idx": [1, 1, 2, 3, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"], ... "relevance": [1, 1, 1, 1, 1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 1 1 1 2099-03-20 1 2 2 2099-03-22 1 3 3 2099-03-27 1 4 3 2099-03-25 1
Power smoothing falls quickly in the beginning but decays slowly afterwards as
age^c.>>> ( ... smoothe_time(df, kind="power") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.639| | 1|2099-03-20 00:00:00| 0.6546| | 2|2099-03-22 00:00:00| 0.6941| | 3|2099-03-25 00:00:00| 0.7994| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
Exponential smoothing is the other way around. Old objects decay more quickly as
c^age.>>> ( ... smoothe_time(df, kind="exp") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.8312| | 1|2099-03-20 00:00:00| 0.8507| | 2|2099-03-22 00:00:00| 0.8909| | 3|2099-03-25 00:00:00| 0.9548| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
Last type is a linear smoothing:
1 - c*age.>>> ( ... smoothe_time(df, kind="linear") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.8667| | 1|2099-03-20 00:00:00| 0.8833| | 2|2099-03-22 00:00:00| 0.9167| | 3|2099-03-25 00:00:00| 0.9667| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
These examples use constant relevance 1, so resulting weight equals the time dependent weight. But actually this value is an updated relevance.
>>> d = { ... "item_idx": [1, 2, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22"], ... "relevance": [10, 3, 0.1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 10.0 1 2 2099-03-20 3.0 2 3 2099-03-22 0.1 >>> ( ... smoothe_time(df) ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 9.3303| | 2|2099-03-20 00:00:00| 2.8645| | 3|2099-03-22 00:00:00| 0.1| +--------+-------------------+---------+
- replay.utils.time.get_item_recency(log, decay=30, limit=0.1, kind='exp')
Calculate item weight showing when the majority of interactions with this item happened.
- Parameters
log (Union[DataFrame, DataFrame, DataFrame]) – interactions log
decay (float) – number of days after which the weight is reduced by half, must be grater than 1
limit (float) – minimal value the weight can reach
kind (str) – type of smoothing, one of [power, exp, linear] Corresponding functions are
power:age^c,exp:c^age,linear:1-c*age
- Returns
DataFrame with item weights
>>> import pandas as pd >>> from pyspark.sql.functions import round >>> d = { ... "item_idx": [1, 1, 2, 3, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"], ... "relevance": [1, 1, 1, 1, 1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 1 1 1 2099-03-20 1 2 2 2099-03-22 1 3 3 2099-03-27 1 4 3 2099-03-25 1
Age in days is calculated for every item, which is transformed into a weight using some function. There are three types of smoothing types available: power, exp and linear. Each type calculates a parameter
cbased on thedecayargument, so that an item withage==decayhas weight 0.5.Power smoothing falls quickly in the beginning but decays slowly afterwards as
age^c.>>> ( ... get_item_recency(df, kind="power") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.6632| | 2|2099-03-22 00:00:00| 0.7204| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
Exponential smoothing is the other way around. Old objects decay more quickly as
c^age.>>> ( ... get_item_recency(df, kind="exp") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.8606| | 2|2099-03-22 00:00:00| 0.9117| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
Last type is a linear smoothing:
1 - c*age.>>> ( ... get_item_recency(df, kind="linear") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.8917| | 2|2099-03-22 00:00:00| 0.9333| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
This function does not take relevance values of interactions into account. Only item age is used.
Serializer
You can save trained models to disk and restore them later with save and load functions.
- replay.utils.model_handler.save(model, path, overwrite=False)
Save fitted model to disk as a folder
- Parameters
model (BaseRecommender) – Trained recommender
path (Union[str, Path]) – destination where model files will be stored
- Returns
- replay.utils.model_handler.load(path, model_type=None)
Load saved model from disk
- Parameters
path (str) – path to model folder
- Returns
Restored trained model
- Return type
BaseRecommender
Distributions
Item Distribution
Calculates item popularity in recommendations using 10 popularity bins.
- replay.utils.distributions.item_distribution(log, recommendations, k, allow_collect_to_master=False)
Calculate item distribution in
logandrecommendations.- Parameters
log (Union[DataFrame, DataFrame, DataFrame]) – historical DataFrame used to calculate popularity
recommendations (Union[DataFrame, DataFrame, DataFrame]) – model recommendations
k (int) – length of a recommendation list
- Returns
DataFrame with results
- Return type
DataFrame