Utils
Time Smoothing
time
module provides function to apply time smoothing to item or interaction relevance.
- replay.utils.time.smoothe_time(log, decay=30, limit=0.1, kind='exp')
Weighs
relevance
column with a time-dependent weight.- Parameters
log (
Union
[DataFrame
,DataFrame
,DataFrame
]) – interactions logdecay (
float
) – number of days after which the weight is reduced by half, must be grater than 1limit (
float
) – minimal value the weight can reachkind (
str
) – type of smoothing, one of [power, exp, linear]. Corresponding functions arepower
:age^c
,exp
:c^age
,linear
:1-c*age
- Returns
modified DataFrame
>>> import pandas as pd >>> from pyspark.sql.functions import round >>> d = { ... "item_idx": [1, 1, 2, 3, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"], ... "relevance": [1, 1, 1, 1, 1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 1 1 1 2099-03-20 1 2 2 2099-03-22 1 3 3 2099-03-27 1 4 3 2099-03-25 1
Power smoothing falls quickly in the beginning but decays slowly afterwards as
age^c
.>>> ( ... smoothe_time(df, kind="power") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.639| | 1|2099-03-20 00:00:00| 0.6546| | 2|2099-03-22 00:00:00| 0.6941| | 3|2099-03-25 00:00:00| 0.7994| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
Exponential smoothing is the other way around. Old objects decay more quickly as
c^age
.>>> ( ... smoothe_time(df, kind="exp") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.8312| | 1|2099-03-20 00:00:00| 0.8507| | 2|2099-03-22 00:00:00| 0.8909| | 3|2099-03-25 00:00:00| 0.9548| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
Last type is a linear smoothing:
1 - c*age
.>>> ( ... smoothe_time(df, kind="linear") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 0.8667| | 1|2099-03-20 00:00:00| 0.8833| | 2|2099-03-22 00:00:00| 0.9167| | 3|2099-03-25 00:00:00| 0.9667| | 3|2099-03-27 00:00:00| 1.0| +--------+-------------------+---------+
These examples use constant relevance 1, so resulting weight equals the time dependent weight. But actually this value is an updated relevance.
>>> d = { ... "item_idx": [1, 2, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22"], ... "relevance": [10, 3, 0.1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 10.0 1 2 2099-03-20 3.0 2 3 2099-03-22 0.1 >>> ( ... smoothe_time(df) ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("timestamp") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 00:00:00| 9.3303| | 2|2099-03-20 00:00:00| 2.8645| | 3|2099-03-22 00:00:00| 0.1| +--------+-------------------+---------+
- replay.utils.time.get_item_recency(log, decay=30, limit=0.1, kind='exp')
Calculate item weight showing when the majority of interactions with this item happened.
- Parameters
log (
Union
[DataFrame
,DataFrame
,DataFrame
]) – interactions logdecay (
float
) – number of days after which the weight is reduced by half, must be grater than 1limit (
float
) – minimal value the weight can reachkind (
str
) – type of smoothing, one of [power, exp, linear] Corresponding functions arepower
:age^c
,exp
:c^age
,linear
:1-c*age
- Returns
DataFrame with item weights
>>> import pandas as pd >>> from pyspark.sql.functions import round >>> d = { ... "item_idx": [1, 1, 2, 3, 3], ... "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"], ... "relevance": [1, 1, 1, 1, 1], ... } >>> df = pd.DataFrame(d) >>> df item_idx timestamp relevance 0 1 2099-03-19 1 1 1 2099-03-20 1 2 2 2099-03-22 1 3 3 2099-03-27 1 4 3 2099-03-25 1
Age in days is calculated for every item, which is transformed into a weight using some function. There are three types of smoothing types available: power, exp and linear. Each type calculates a parameter
c
based on thedecay
argument, so that an item withage==decay
has weight 0.5.Power smoothing falls quickly in the beginning but decays slowly afterwards as
age^c
.>>> ( ... get_item_recency(df, kind="power") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.6632| | 2|2099-03-22 00:00:00| 0.7204| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
Exponential smoothing is the other way around. Old objects decay more quickly as
c^age
.>>> ( ... get_item_recency(df, kind="exp") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.8606| | 2|2099-03-22 00:00:00| 0.9117| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
Last type is a linear smoothing:
1 - c*age
.>>> ( ... get_item_recency(df, kind="linear") ... .select("item_idx", "timestamp", round("relevance", 4).alias("relevance")) ... .orderBy("item_idx") ... .show() ... ) +--------+-------------------+---------+ |item_idx| timestamp|relevance| +--------+-------------------+---------+ | 1|2099-03-19 12:00:00| 0.8917| | 2|2099-03-22 00:00:00| 0.9333| | 3|2099-03-26 00:00:00| 1.0| +--------+-------------------+---------+
This function does not take relevance values of interactions into account. Only item age is used.
Serializer
You can save trained models to disk and restore them later with save
and load
functions.
- replay.utils.model_handler.save(model, path, overwrite=False)
Save fitted model to disk as a folder
- Parameters
model (
BaseRecommender
) – Trained recommenderpath (
Union
[str
,Path
]) – destination where model files will be stored
- Returns
- replay.utils.model_handler.load(path, model_type=None)
Load saved model from disk
- Parameters
path (
str
) – path to model folder- Return type
BaseRecommender
- Returns
Restored trained model
Distributions
Item Distribution
Calculates item popularity in recommendations using 10 popularity bins.
- replay.utils.distributions.item_distribution(log, recommendations, k, allow_collect_to_master=False)
Calculate item distribution in
log
andrecommendations
.- Parameters
log (
Union
[DataFrame
,DataFrame
,DataFrame
]) – historical DataFrame used to calculate popularityrecommendations (
Union
[DataFrame
,DataFrame
,DataFrame
]) – model recommendationsk (
int
) – length of a recommendation list
- Return type
DataFrame
- Returns
DataFrame with results