Metrics

Most metrics require dataframe with recommendations and dataframe with ground truth values — which objects each user interacted with. All input dataframes must have the same type.

  • recommendations (Union[pandas.DataFrame, spark.DataFrame, Dict]):

    If the recommendations is instance of dict then key represents user_id, value represents tuple(item_id, score). If the recommendations is instance of Spark or Pandas dataframe then the names of the corresponding columns should be passed through the constructor of metric.

  • ground_truth (Union[pandas.DataFrame, spark.DataFrame]):

    If ground_truth is instance of dict then key represents user_id, value represents item_id. If the recommendations is instance of Spark or Pandas dataframe then the names of the corresponding columns must match the recommendations.

Metric is calculated for all users, presented in ground_truth for accurate metric calculation in case when the recommender system generated recommendation not for all users. It is assumed, that all users, we want to calculate metric for, have positive interactions.

Every metric is calculated using top K items for each user. It is also possible to calculate metrics using multiple values for K simultaneously.

Make sure your recommendations do not contain user-item duplicates as duplicates could lead to the wrong calculation results.

  • k (Union[Iterable[int], int]):

    a single number or a list, specifying the truncation length for recommendation list for each user

By default, metrics are averaged by users - replay.metrics.Mean but you can alternatively use replay.metrics.Median. You can get the median value of the confidence interval - replay.metrics.ConfidenceInterval for a given alpha. To calculate the metric value for each user separately in most metrics there is a parameter per user.

To write your own aggregation kernel, you need to inherit from the replay.metrics.CalculationDescriptor and redefine two methods (spark, cpu).

For each metric, a formula for its calculation is given, because this is important for the correct comparison of algorithms, as mentioned in our article.

You can also add new metrics.

Precision

class replay.metrics.Precision(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Mean percentage of relevant items among top K recommendations.

\[Precision@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{K}\]
\[Precision@K = \frac {\sum_{i=1}^{N}Precision@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> Precision(2)(recommendations, groundtruth)
{'Precision@2': 0.3333333333333333}
>>> Precision(2, mode=PerUser())(recommendations, groundtruth)
{'Precision-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.5}}
>>> Precision(2, mode=Median())(recommendations, groundtruth)
{'Precision-Median@2': 0.5}
>>> Precision(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'Precision-ConfidenceInterval@2': 0.32666066409000905}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

Recall

class replay.metrics.Recall(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Recall measures the coverage of the recommended items, and is defined as:

Mean percentage of relevant items, that was shown among top K recommendations.

\[Recall@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{|Rel_i|}\]
\[Recall@K = \frac {\sum_{i=1}^{N}Recall@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

\(|Rel_i|\) – the number of relevant items for user \(i\)

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> Recall(2)(recommendations, groundtruth)
{'Recall@2': 0.12222222222222223}
>>> Recall(2, mode=PerUser())(recommendations, groundtruth)
{'Recall-PerUser@2': {1: 0.16666666666666666, 2: 0.0, 3: 0.2}}
>>> Recall(2, mode=Median())(recommendations, groundtruth)
{'Recall-Median@2': 0.16666666666666666}
>>> Recall(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'Recall-ConfidenceInterval@2': 0.12125130695058273}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

MAP

class replay.metrics.MAP(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Mean Average Precision – average the Precision at relevant positions for each user, and then calculate the mean across all users.

\[ \begin{align}\begin{aligned}&AP@K(i) = \frac {1}{\min(K, |Rel_i|)} \sum_{j=1}^{K}\mathbb{1}_{r_{ij}}Precision@j(i)\\&MAP@K = \frac {\sum_{i=1}^{N}AP@K(i)}{N}\end{aligned}\end{align} \]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing if user \(i\) interacted with item \(j\)

\(|Rel_i|\) – the number of relevant items for user \(i\)

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> MAP(2)(recommendations, groundtruth)
{'MAP@2': 0.25}
>>> MAP(2, mode=PerUser())(recommendations, groundtruth)
{'MAP-PerUser@2': {1: 0.25, 2: 0.0, 3: 0.5}}
>>> MAP(2, mode=Median())(recommendations, groundtruth)
{'MAP-Median@2': 0.25}
>>> MAP(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'MAP-ConfidenceInterval@2': 0.282896433519043}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

MRR

class replay.metrics.MRR(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Mean Reciprocal Rank – Reciprocal Rank is the inverse position of the first relevant item among top-k recommendations, \(\frac{1}{rank_i}\). This value is aggregated by all users.

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> MRR(2)(recommendations, groundtruth)
{'MRR@2': 0.5}
>>> MRR(2, mode=PerUser())(recommendations, groundtruth)
{'MRR-PerUser@2': {1: 0.5, 2: 0.0, 3: 1.0}}
>>> MRR(2, mode=Median())(recommendations, groundtruth)
{'MRR-Median@2': 0.5}
>>> MRR(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'MRR-ConfidenceInterval@2': 0.565792867038086}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

NDCG

class replay.metrics.NDCG(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Normalized Discounted Cumulative Gain is a metric that takes into account positions of relevant items.

This is the binary version, it takes into account whether the item was consumed or not, relevance value is ignored.

\[DCG@K(i) = \sum_{j=1}^{K}\frac{\mathbb{1}_{r_{ij}}}{\log_2 (j+1)}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

To get from \(DCG\) to \(nDCG\) we calculate the biggest possible value of DCG for user \(i\) and recommendation length \(K\).

\[IDCG@K(i) = max(DCG@K(i)) = \sum_{j=1}^{K}\frac{\mathbb{1}_{j\le|Rel_i|}}{\log_2 (j+1)}\]
\[nDCG@K(i) = \frac {DCG@K(i)}{IDCG@K(i)}\]

\(|Rel_i|\) – number of relevant items for user \(i\)

Metric is averaged by users.

\[nDCG@K = \frac {\sum_{i=1}^{N}nDCG@K(i)}{N}\]
>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> NDCG(2)(recommendations, groundtruth)
{'NDCG@2': 0.3333333333333333}
>>> NDCG(2, mode=PerUser())(recommendations, groundtruth)
{'NDCG-PerUser@2': {1: 0.38685280723454163, 2: 0.0, 3: 0.6131471927654584}}
>>> NDCG(2, mode=Median())(recommendations, groundtruth)
{'NDCG-Median@2': 0.38685280723454163}
>>> NDCG(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'NDCG-ConfidenceInterval@2': 0.3508565839953337}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

HitRate

class replay.metrics.HitRate(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Percentage of users that have at least one correctly recommended item among top-k.

\[HitRate@K(i) = \max_{j \in [1..K]}\mathbb{1}_{r_{ij}}\]
\[HitRate@K = \frac {\sum_{i=1}^{N}HitRate@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function stating that user \(i\) interacted with item \(j\)

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> HitRate(2)(recommendations, groundtruth)
{'HitRate@2': 0.6666666666666666}
>>> HitRate(2, mode=PerUser())(recommendations, groundtruth)
{'HitRate-PerUser@2': {1: 1.0, 2: 0.0, 3: 1.0}}
>>> HitRate(2, mode=Median())(recommendations, groundtruth)
{'HitRate-Median@2': 1.0}
>>> HitRate(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'HitRate-ConfidenceInterval@2': 0.6533213281800181}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

RocAuc

class replay.metrics.RocAuc(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Receiver Operating Characteristic/Area Under the Curve is the aggregated performance measure, that depends only on the order of recommended items. It can be interpreted as the fraction of object pairs (object of class 1, object of class 0) that were correctly ordered by the model. The bigger the value of AUC, the better the classification model.

\[ROCAUC@K(i) = \frac {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{r_{si}<r_{ti}} \mathbb{1}_{gt_{si}<gt_{ti}}} {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{gt_{si}<gt_{tj}}}\]

\(\mathbb{1}_{r_{si}<r_{ti}}\) – indicator function showing that recommendation score for user \(i\) for item \(s\) is bigger than for item \(t\)

\(\mathbb{1}_{gt_{si}<gt_{ti}}\) – indicator function showing that user \(i\) values item \(s\) more than item \(t\).

Metric is averaged by all users.

\[ROCAUC@K = \frac {\sum_{i=1}^{N}ROCAUC@K(i)}{N}\]
>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> RocAuc(2)(recommendations, groundtruth)
{'RocAuc@2': 0.3333333333333333}
>>> RocAuc(2, mode=PerUser())(recommendations, groundtruth)
{'RocAuc-PerUser@2': {1: 0.0, 2: 0.0, 3: 1.0}}
>>> RocAuc(2, mode=Median())(recommendations, groundtruth)
{'RocAuc-Median@2': 0.0}
>>> RocAuc(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth)
{'RocAuc-ConfidenceInterval@2': 0.6533213281800181}
__call__(recommendations, ground_truth)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

Unexpectedness

class replay.metrics.Unexpectedness(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Fraction of recommended items that are not present in some baseline recommendations.

\[Unexpectedness@K(i) = 1 - \frac {\parallel R^{i}_{1..\min(K, \parallel R^{i} \parallel)} \cap BR^{i}_{1..\min(K, \parallel BR^{i} \parallel)} \parallel} {K}\]
\[Unexpectedness@K = \frac {1}{N}\sum_{i=1}^{N}Unexpectedness@K(i)\]

\(R_{1..j}^{i}\) – the first \(j\) recommendations for the \(i\)-th user.

\(BR_{1..j}^{i}\) – the first \(j\) base recommendations for the \(i\)-th user.

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> base_rec
   query_id  item_id  rating
0        1        3    0.5
1        1        7    0.5
2        1        2    0.7
3        2        5    0.6
4        2        8    0.6
5        2        3    0.3
6        3        4    1.0
7        3        9    0.5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> Unexpectedness([2, 4])(recommendations, base_rec)
{'Unexpectedness@2': 0.16666666666666666, 'Unexpectedness@4': 0.5}
>>> Unexpectedness([2, 4], mode=PerUser())(recommendations, base_rec)
{'Unexpectedness-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.0},
 'Unexpectedness-PerUser@4': {1: 0.5, 2: 0.5, 3: 0.5}}
>>> Unexpectedness([2, 4], mode=Median())(recommendations, base_rec)
{'Unexpectedness-Median@2': 0.0, 'Unexpectedness-Median@4': 0.5}
>>> Unexpectedness([2, 4], mode=ConfidenceInterval(alpha=0.95))(recommendations, base_rec)
{'Unexpectedness-ConfidenceInterval@2': 0.32666066409000905,
 'Unexpectedness-ConfidenceInterval@4': 0.0}
__call__(recommendations, base_recommendations)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • base_recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): base model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

Coverage

class replay.metrics.Coverage(topk, query_column='query_id', item_column='item_id', rating_column='rating', allow_caching=True)

Metric calculation is as follows:

  • take K recommendations with the biggest score for each user_id

  • count the number of distinct item_id in these recommendations

  • divide it by the number of distinct items in train dataset, provided to metric call

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> train
   query_id  item_id
0         1        5
1         1        6
2         1        8
3         1        9
4         1        2
5         2        5
6         2        8
7         2       11
8         2        1
9         2        3
10        3        4
11        3        9
12        3        2
>>> Coverage(2)(recommendations, train)
{'Coverage@2': 0.5555555555555556}
__call__(recommendations, train)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • train (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', allow_caching=True)
Parameters
  • topk (Union[List, int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • allow_caching (bool) – (bool): The flag for using caching to optimize calculations. Default: True.

CategoricalDiversity

class replay.metrics.CategoricalDiversity(topk, query_column='query_id', category_column='category_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Metric calculation is as follows:

  • take K recommendations with the biggest score for each user_id

  • count the number of distinct category_id in these recommendations / K

  • average this number for all users

>>> category_recommendations
   query_id  category_id  rating
0         1            3    0.6
1         1            7    0.5
2         1           10    0.4
3         1           11    0.3
4         1            2    0.2
5         2            5    0.6
6         2            8    0.5
7         2           11    0.4
8         2            1    0.3
9         2            3    0.2
10        3            4    1.0
11        3            9    0.5
12        3            2    0.1
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> CategoricalDiversity([3, 5])(category_recommendations)
{'CategoricalDiversity@3': 1.0, 'CategoricalDiversity@5': 0.8666666666666667}
>>> CategoricalDiversity([3, 5], mode=PerUser())(category_recommendations)
{'CategoricalDiversity-PerUser@3': {1: 1.0, 2: 1.0, 3: 1.0},
 'CategoricalDiversity-PerUser@5': {1: 1.0, 2: 1.0, 3: 0.6}}
>>> CategoricalDiversity([3, 5], mode=Median())(category_recommendations)
{'CategoricalDiversity-Median@3': 1.0,
 'CategoricalDiversity-Median@5': 1.0}
>>> CategoricalDiversity([3, 5], mode=ConfidenceInterval(alpha=0.95))(category_recommendations)
{'CategoricalDiversity-ConfidenceInterval@3': 0.0,
 'CategoricalDiversity-ConfidenceInterval@5': 0.2613285312720073}
__call__(recommendations)

Compute metric.

Parameters

recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, category and score columns. If dict then key represents user_ids, value represents list of tuple(category, score).

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', category_column='category_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List, int]) – (list or int): Consider the highest k scores in the ranking.

  • user_column – (str): The name of the user column.

  • category_column (str) – (str): The name of the category column.

  • score_column – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

Novelty

class replay.metrics.Novelty(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Measure the fraction of shown items in recommendation list, that users didn’t see in train dataset.

\[Novelty@K(i) = \frac {\parallel {R^{i}_{1..\min(K, \parallel R^{i} \parallel)} \setminus train^{i}} \parallel} {K}\]
\[Novelty@K = \frac {1}{N}\sum_{i=1}^{N}Novelty@K(i)\]

\(R^{i}\) – the recommendations for the \(i\)-th user.

\(R^{i}_{j}\) – the \(j\)-th recommended item for the \(i\)-th user.

\(R_{1..j}^{i}\) – the first \(j\) recommendations for the \(i\)-th user.

\(train^{i}\) – the train items of the \(i\)-th user.

\(N\) – the number of users.

Based on

P. Castells, S. Vargas, and J. Wang, Novelty and diversity metrics for recommender systems: choice, discovery and relevance, ECIR 2011. Link.

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> train
   query_id  item_id
0         1        5
1         1        6
2         1        8
3         1        9
4         1        2
5         2        5
6         2        8
7         2       11
8         2        1
9         2        3
10        3        4
11        3        9
12        3        2
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> Novelty(2)(recommendations, train)
{'Novelty@2': 0.3333333333333333}
>>> Novelty(2, mode=PerUser())(recommendations, train)
{'Novelty-PerUser@2': {1: 1.0, 2: 0.0, 3: 0.0}}
>>> Novelty(2, mode=Median())(recommendations, train)
{'Novelty-Median@2': 0.0}
>>> Novelty(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, train)
{'Novelty-ConfidenceInterval@2': 0.6533213281800181}
__call__(recommendations, train)

Compute metric.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then items must be sorted in decreasing order of their scores.

  • train (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns.

Return type

Union[Mapping[str, float], Mapping[str, Mapping[Any, float]]]

Returns

metric values

__init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Parameters
  • topk (Union[List[int], int]) – (list or int): Consider the highest k scores in the ranking.

  • query_column (str) – (str): The name of the user column.

  • item_column (str) – (str): The name of the item column.

  • rating_column (str) – (str): The name of the score column.

  • mode (CalculationDescriptor) – (CalculationDescriptor): class for calculating aggregation metrics. Default: Mean.

Surprisal

class replay.metrics.Surprisal(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Measures how many surprising rare items are present in recommendations.

\[\textit{Self-Information}(j)= -\log_2 \frac {u_j}{N}\]

\(u_j\) – number of users that interacted with item \(j\). Cold items are treated as if they were rated by 1 user. That is, if they appear in recommendations it will be completely unexpected.

Surprisal for item \(j\) is

\[Surprisal(j)= \frac {\textit{Self-Information}(j)}{log_2 N}\]

Recommendation list surprisal is the average surprisal of items in it.

\[Surprisal@K(i) = \frac {\sum_{j=1}^{K}Surprisal(j)} {K}\]

Final metric is averaged by users.

\[Surprisal@K = \frac {\sum_{i=1}^{N}Surprisal@K(i)}{N}\]

\(N\) – the number of users.

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> train
   query_id  item_id
0         1        5
1         1        6
2         1        8
3         1        9
4         1        2
5         2        5
6         2        8
7         2       11
8         2        1
9         2        3
10        3        4
11        3        9
12        3        2
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> Surprisal(2)(recommendations, train)
{'Surprisal@2': 0.6845351232142715}
>>> Surprisal(2, mode=PerUser())(recommendations, train)
{'Surprisal-PerUser@2': {1: 1.0, 2: 0.3690702464285426, 3: 0.6845351232142713}}
>>> Surprisal(2, mode=Median())(recommendations, train)
{'Surprisal-Median@2': 0.6845351232142713}
>>> Surprisal(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, train)
{'Surprisal-ConfidenceInterval@2': 0.3569755541728279}

OfflineMetrics

class replay.metrics.OfflineMetrics(metrics, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id', allow_caching=True)

Designed for efficient calculation of offline metrics provided by the RePlay. If you need to calculate multiple metrics for the same input data, then using this class is much more efficient than calculating metrics individually.

For example, you want to calculate several metrics with different parameters. When calling offline metrics with the specified metrics, the common part of these metrics will be computed only once.

>>> from replay.metrics import *
>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> train
    query_id  item_id
0         1        5
1         1        6
2         1        8
3         1        9
4         1        2
5         2        5
6         2        8
7         2       11
8         2        1
9         2        3
10        3        4
11        3        9
12        3        2
>>> base_rec
   query_id  item_id  rating
0        1        3    0.5
1        1        7    0.5
2        1        2    0.7
3        2        5    0.6
4        2        8    0.6
5        2        3    0.3
6        3        4    1.0
7        3        9    0.5
>>> from replay.metrics import Median, ConfidenceInterval, PerUser
>>> metrics = [
...     Precision(2),
...     Precision(2, mode=PerUser()),
...     Precision(2, mode=Median()),
...     Precision(2, mode=ConfidenceInterval(alpha=0.95)),
...     Recall(2),
...     MAP(2),
...     MRR(2),
...     NDCG(2),
...     HitRate(2),
...     RocAuc(2),
...     Coverage(2),
...     Novelty(2),
...     Surprisal(2),
... ]
>>> OfflineMetrics(metrics)(recommendations, groundtruth, train)
{'Precision@2': 0.3333333333333333,
 'Precision-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.5},
 'Precision-Median@2': 0.5,
 'Precision-ConfidenceInterval@2': 0.32666066409000905,
 'Recall@2': 0.12222222222222223,
 'MAP@2': 0.25,
 'MRR@2': 0.5,
 'NDCG@2': 0.3333333333333333,
 'HitRate@2': 0.6666666666666666,
 'RocAuc@2': 0.3333333333333333,
 'Coverage@2': 0.5555555555555556,
 'Novelty@2': 0.3333333333333333,
 'Surprisal@2': 0.6845351232142715}
>>> metrics = [
...     Precision(2),
...     Unexpectedness([1, 2]),
...     Unexpectedness([1, 2], mode=PerUser()),
... ]
>>> OfflineMetrics(metrics)(
...     recommendations,
...     groundtruth,
...     train,
...     base_recommendations={"ALS": base_rec, "KNN": recommendations}
... )
{'Precision@2': 0.3333333333333333,
 'Unexpectedness_ALS@1': 0.3333333333333333,
 'Unexpectedness_ALS@2': 0.16666666666666666,
 'Unexpectedness_KNN@1': 0.0,
 'Unexpectedness_KNN@2': 0.0,
 'Unexpectedness-PerUser_ALS@1': {1: 1.0, 2: 0.0, 3: 0.0},
 'Unexpectedness-PerUser_ALS@2': {1: 0.5, 2: 0.0, 3: 0.0},
 'Unexpectedness-PerUser_KNN@1': {1: 0.0, 2: 0.0, 3: 0.0},
 'Unexpectedness-PerUser_KNN@2': {1: 0.0, 2: 0.0, 3: 0.0}}
__call__(recommendations, ground_truth, train=None, base_recommendations=None)

Compute metrics.

Parameters
  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

  • train (Union[DataFrame, DataFrame, DataFrame, Dict, None]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids. Default: None.

  • base_recommendations (Union[DataFrame, DataFrame, DataFrame, Dict, Dict[str, Union[DataFrame, DataFrame, DataFrame, Dict]], None]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict or Dict[str, DataFrameLike]): predictions from baseline model. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score). If Unexpectedness is not in given metrics list, then you can omit this parameter. If it is necessary to calculate the value of metrics on several dataframes, then you need to submit a dict(key - name of the data frame, value - DataFrameLike). For a better understanding, check out the examples. Default: None.

Return type

Dict[str, float]

Returns

metric values

__init__(metrics, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id', allow_caching=True)
Parameters
  • metrics (List[Metric]) – (list of metrics): List of metrics to be calculated.

  • user_column – (str): The name of the user column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • item_column (str) – (str): The name of the item column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • score_column – (str): The name of the score column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • category_column (str) –

    (str): The name of the category column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

    It is used only for calculating the Diversity metric. If you don’t calculate this metric, you can omit this parameter.

  • allow_caching (bool) – (bool): The flag for using caching to optimize calculations. Default: True.

Compare Results

class replay.metrics.Experiment(metrics, ground_truth, train=None, base_recommendations=None, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id')

The class is designed for calculating, storing and comparing metrics from different models in the Pandas DataFrame format.

The main difference from the OfflineMetrics class is that OfflineMetrics are only responsible for calculating metrics. The Experiment class is responsible for storing metrics from different models, clear and their convenient comparison with each other.

Calculated metrics are available with results attribute.

Example:

>>> recommendations
   query_id  item_id  rating
0         1        3    0.6
1         1        7    0.5
2         1       10    0.4
3         1       11    0.3
4         1        2    0.2
5         2        5    0.6
6         2        8    0.5
7         2       11    0.4
8         2        1    0.3
9         2        3    0.2
10        3        4    1.0
11        3        9    0.5
12        3        2    0.1
>>> groundtruth
   query_id  item_id
0         1        5
1         1        6
2         1        7
3         1        8
4         1        9
5         1       10
6         2        6
7         2        7
8         2        4
9         2       10
10        2       11
11        3        1
12        3        2
13        3        3
14        3        4
15        3        5
>>> train
   query_id  item_id
0         1        5
1         1        6
2         1        8
3         1        9
4         1        2
5         2        5
6         2        8
7         2       11
8         2        1
9         2        3
10        3        4
11        3        9
12        3        2
>>> base_rec
   query_id  item_id  rating
0        1        3    0.5
1        1        7    0.5
2        1        2    0.7
3        2        5    0.6
4        2        8    0.6
5        2        3    0.3
6        3        4    1.0
7        3        9    0.5
>>> from replay.metrics import NDCG, Surprisal, Precision, Coverage, Median, ConfidenceInterval
>>> ex = Experiment([NDCG([2, 3]), Surprisal(3)], groundtruth, train)
>>> ex.add_result("baseline", base_rec)
>>> ex.add_result("model", recommendations)
>>> ex.results
            NDCG@2    NDCG@3  Surprisal@3
baseline  0.204382  0.234639     0.608476
model     0.333333  0.489760     0.719587
>>> ex.compare("baseline")
          NDCG@2   NDCG@3 Surprisal@3
baseline       –        –           –
model     63.09%  108.73%      18.26%
>>> ex = Experiment([Precision(3, mode=Median()), Precision(3, mode=ConfidenceInterval(0.95))], groundtruth)
>>> ex.add_result("baseline", base_rec)
>>> ex.add_result("model", recommendations)
>>> ex.results
          Precision-Median@3  Precision-ConfidenceInterval@3
baseline            0.333333                        0.217774
model               0.666667                        0.217774
__init__(metrics, ground_truth, train=None, base_recommendations=None, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id')
Parameters
  • metrics (List[Metric]) – (list of metrics): List of metrics to be calculated.

  • ground_truth (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.

  • train (Union[DataFrame, DataFrame, DataFrame, Dict, None]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids. Default: None.

  • base_recommendations (Union[DataFrame, DataFrame, DataFrame, Dict, Dict[str, Union[DataFrame, DataFrame, DataFrame, Dict]], None]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict or Dict[str, DataFrameLike]): predictions from baseline model. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score). If Unexpectedness is not in given metrics list, then you can omit this parameter. Default: None.

  • query_column (str) – (str): The name of the user column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • item_column (str) – (str): The name of the item column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • rating_column (str) – (str): The name of the score column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

  • category_column (str) –

    (str): The name of the category column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.

    It is used only for calculating the Diversity metric. If you don’t calculate this metric, you can omit this parameter.

add_result(name, recommendations)

Calculate metrics for predictions

Parameters
  • name (str) – name of the run to store in the resulting DataFrame

  • recommendations (Union[DataFrame, DataFrame, DataFrame, Dict]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).

Return type

None

compare(name)

Show results as a percentage difference to record name.

Parameters

name (str) – name of the baseline record

Return type

DataFrame

Returns

results table in a percentage format


Custom Metric

Your metric should be inherited from Metric class and implement following methods:

  • __init__

  • _get_metric_value_by_user

_get_metric_value_by_user is required for every metric because this is where the actual calculations happen. For a better understanding, see already implemented metrics, for example Recall.

class replay.metrics.base_metric.Metric(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)

Base metric class

abstract static _get_metric_value_by_user(ks, *args)

Metric calculation for one user.

Parameters
  • k – depth cut-off

  • ground_truth – test data

  • pred – recommendations

Return type

List[float]

Returns

metric value for current user