Utils

Time Smoothing

time module provides function to apply time smoothing to item or interaction relevance.

replay.utils.time.smoothe_time(log, decay=30, limit=0.1, kind='exp')

Weighs relevance column with a time-dependent weight.

Parameters

log (Union[DataFrame, DataFrame, DataFrame]) – interactions log
decay (float) – number of days after which the weight is reduced by half, must be grater than 1
limit (float) – minimal value the weight can reach
kind (str) – type of smoothing, one of [power, exp, linear]. Corresponding functions are power: age^c, exp: c^age, linear: 1-c*age

Returns

modified DataFrame

>>> import pandas as pd
>>> from pyspark.sql.functions import round
>>> d = {
...     "item_idx": [1, 1, 2, 3, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"],
...     "relevance": [1, 1, 1, 1, 1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19          1
1         1  2099-03-20          1
2         2  2099-03-22          1
3         3  2099-03-27          1
4         3  2099-03-25          1

Power smoothing falls quickly in the beginning but decays slowly afterwards as age^c.

>>> (
...     smoothe_time(df, kind="power")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|    0.639|
|       1|2099-03-20 00:00:00|   0.6546|
|       2|2099-03-22 00:00:00|   0.6941|
|       3|2099-03-25 00:00:00|   0.7994|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

Exponential smoothing is the other way around. Old objects decay more quickly as c^age.

>>> (
...     smoothe_time(df, kind="exp")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   0.8312|
|       1|2099-03-20 00:00:00|   0.8507|
|       2|2099-03-22 00:00:00|   0.8909|
|       3|2099-03-25 00:00:00|   0.9548|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

Last type is a linear smoothing: 1 - c*age.

>>> (
...     smoothe_time(df, kind="linear")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   0.8667|
|       1|2099-03-20 00:00:00|   0.8833|
|       2|2099-03-22 00:00:00|   0.9167|
|       3|2099-03-25 00:00:00|   0.9667|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

These examples use constant relevance 1, so resulting weight equals the time dependent weight. But actually this value is an updated relevance.

>>> d = {
...     "item_idx": [1, 2, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22"],
...     "relevance": [10, 3, 0.1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19       10.0
1         2  2099-03-20        3.0
2         3  2099-03-22        0.1
>>> (
...     smoothe_time(df)
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   9.3303|
|       2|2099-03-20 00:00:00|   2.8645|
|       3|2099-03-22 00:00:00|      0.1|
+--------+-------------------+---------+

replay.utils.time.get_item_recency(log, decay=30, limit=0.1, kind='exp')

Calculate item weight showing when the majority of interactions with this item happened.

Parameters

log (Union[DataFrame, DataFrame, DataFrame]) – interactions log
decay (float) – number of days after which the weight is reduced by half, must be grater than 1
limit (float) – minimal value the weight can reach
kind (str) – type of smoothing, one of [power, exp, linear] Corresponding functions are power: age^c, exp: c^age, linear: 1-c*age

Returns

DataFrame with item weights

>>> import pandas as pd
>>> from pyspark.sql.functions import round
>>> d = {
...     "item_idx": [1, 1, 2, 3, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"],
...     "relevance": [1, 1, 1, 1, 1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19          1
1         1  2099-03-20          1
2         2  2099-03-22          1
3         3  2099-03-27          1
4         3  2099-03-25          1

Age in days is calculated for every item, which is transformed into a weight using some function. There are three types of smoothing types available: power, exp and linear. Each type calculates a parameter c based on the decay argument, so that an item with age==decay has weight 0.5.

Power smoothing falls quickly in the beginning but decays slowly afterwards as age^c.

>>> (
...     get_item_recency(df, kind="power")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.6632|
|       2|2099-03-22 00:00:00|   0.7204|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

Exponential smoothing is the other way around. Old objects decay more quickly as c^age.

>>> (
...     get_item_recency(df, kind="exp")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.8606|
|       2|2099-03-22 00:00:00|   0.9117|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

Last type is a linear smoothing: 1 - c*age.

>>> (
...     get_item_recency(df, kind="linear")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.8917|
|       2|2099-03-22 00:00:00|   0.9333|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

This function does not take relevance values of interactions into account. Only item age is used.

Serializer

You can save trained models to disk and restore them later with save and load functions.

replay.utils.model_handler.save(model, path, overwrite=False)

Save fitted model to disk as a folder

Parameters

model (BaseRecommender) – Trained recommender
path (Union[str, Path]) – destination where model files will be stored

Returns

replay.utils.model_handler.load(path, model_type=None)

Load saved model from disk

Parameters: path (str) – path to model folder
Return type: BaseRecommender
Returns: Restored trained model

Distributions

Item Distribution

Calculates item popularity in recommendations using 10 popularity bins.

replay.utils.distributions.item_distribution(log, recommendations, k, allow_collect_to_master=False)

Calculate item distribution in log and recommendations.

Parameters

log (Union[DataFrame, DataFrame, DataFrame]) – historical DataFrame used to calculate popularity
recommendations (Union[DataFrame, DataFrame, DataFrame]) – model recommendations
k (int) – length of a recommendation list

Return type

DataFrame

Returns

DataFrame with results