Utils

Time Smoothing

time module provides function to apply time smoothing to item or interaction relevance.

replay.utils.time.smoothe_time(log, decay=30, limit=0.1, kind='exp')

Weighs relevance column with a time-dependent weight.

Parameters
  • log (Union[DataFrame, DataFrame, DataFrame]) – interactions log

  • decay (float) – number of days after which the weight is reduced by half, must be grater than 1

  • limit (float) – minimal value the weight can reach

  • kind (str) – type of smoothing, one of [power, exp, linear]. Corresponding functions are power: age^c, exp: c^age, linear: 1-c*age

Returns

modified DataFrame

>>> import pandas as pd
>>> from pyspark.sql.functions import round
>>> d = {
...     "item_idx": [1, 1, 2, 3, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"],
...     "relevance": [1, 1, 1, 1, 1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19          1
1         1  2099-03-20          1
2         2  2099-03-22          1
3         3  2099-03-27          1
4         3  2099-03-25          1

Power smoothing falls quickly in the beginning but decays slowly afterwards as age^c.

>>> (
...     smoothe_time(df, kind="power")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|    0.639|
|       1|2099-03-20 00:00:00|   0.6546|
|       2|2099-03-22 00:00:00|   0.6941|
|       3|2099-03-25 00:00:00|   0.7994|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

Exponential smoothing is the other way around. Old objects decay more quickly as c^age.

>>> (
...     smoothe_time(df, kind="exp")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   0.8312|
|       1|2099-03-20 00:00:00|   0.8507|
|       2|2099-03-22 00:00:00|   0.8909|
|       3|2099-03-25 00:00:00|   0.9548|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

Last type is a linear smoothing: 1 - c*age.

>>> (
...     smoothe_time(df, kind="linear")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   0.8667|
|       1|2099-03-20 00:00:00|   0.8833|
|       2|2099-03-22 00:00:00|   0.9167|
|       3|2099-03-25 00:00:00|   0.9667|
|       3|2099-03-27 00:00:00|      1.0|
+--------+-------------------+---------+

These examples use constant relevance 1, so resulting weight equals the time dependent weight. But actually this value is an updated relevance.

>>> d = {
...     "item_idx": [1, 2, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22"],
...     "relevance": [10, 3, 0.1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19       10.0
1         2  2099-03-20        3.0
2         3  2099-03-22        0.1
>>> (
...     smoothe_time(df)
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("timestamp")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 00:00:00|   9.3303|
|       2|2099-03-20 00:00:00|   2.8645|
|       3|2099-03-22 00:00:00|      0.1|
+--------+-------------------+---------+
replay.utils.time.get_item_recency(log, decay=30, limit=0.1, kind='exp')

Calculate item weight showing when the majority of interactions with this item happened.

Parameters
  • log (Union[DataFrame, DataFrame, DataFrame]) – interactions log

  • decay (float) – number of days after which the weight is reduced by half, must be grater than 1

  • limit (float) – minimal value the weight can reach

  • kind (str) – type of smoothing, one of [power, exp, linear] Corresponding functions are power: age^c, exp: c^age, linear: 1-c*age

Returns

DataFrame with item weights

>>> import pandas as pd
>>> from pyspark.sql.functions import round
>>> d = {
...     "item_idx": [1, 1, 2, 3, 3],
...     "timestamp": ["2099-03-19", "2099-03-20", "2099-03-22", "2099-03-27", "2099-03-25"],
...     "relevance": [1, 1, 1, 1, 1],
... }
>>> df = pd.DataFrame(d)
>>> df
   item_idx   timestamp  relevance
0         1  2099-03-19          1
1         1  2099-03-20          1
2         2  2099-03-22          1
3         3  2099-03-27          1
4         3  2099-03-25          1

Age in days is calculated for every item, which is transformed into a weight using some function. There are three types of smoothing types available: power, exp and linear. Each type calculates a parameter c based on the decay argument, so that an item with age==decay has weight 0.5.

Power smoothing falls quickly in the beginning but decays slowly afterwards as age^c.

>>> (
...     get_item_recency(df, kind="power")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.6632|
|       2|2099-03-22 00:00:00|   0.7204|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

Exponential smoothing is the other way around. Old objects decay more quickly as c^age.

>>> (
...     get_item_recency(df, kind="exp")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.8606|
|       2|2099-03-22 00:00:00|   0.9117|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

Last type is a linear smoothing: 1 - c*age.

>>> (
...     get_item_recency(df, kind="linear")
...     .select("item_idx", "timestamp", round("relevance", 4).alias("relevance"))
...     .orderBy("item_idx")
...     .show()
... )
+--------+-------------------+---------+
|item_idx|          timestamp|relevance|
+--------+-------------------+---------+
|       1|2099-03-19 12:00:00|   0.8917|
|       2|2099-03-22 00:00:00|   0.9333|
|       3|2099-03-26 00:00:00|      1.0|
+--------+-------------------+---------+

This function does not take relevance values of interactions into account. Only item age is used.

Serializer

You can save trained models to disk and restore them later with save and load functions.

replay.utils.model_handler.save(model, path, overwrite=False)

Save fitted model to disk as a folder

Parameters
  • model (BaseRecommender) – Trained recommender

  • path (Union[str, Path]) – destination where model files will be stored

Returns

replay.utils.model_handler.load(path, model_type=None)

Load saved model from disk

Parameters

path (str) – path to model folder

Return type

BaseRecommender

Returns

Restored trained model

Distributions

Item Distribution

Calculates item popularity in recommendations using 10 popularity bins.

replay.utils.distributions.item_distribution(log, recommendations, k, allow_collect_to_master=False)

Calculate item distribution in log and recommendations.

Parameters
  • log (Union[DataFrame, DataFrame, DataFrame]) – historical DataFrame used to calculate popularity

  • recommendations (Union[DataFrame, DataFrame, DataFrame]) – model recommendations

  • k (int) – length of a recommendation list

Return type

DataFrame

Returns

DataFrame with results