Metrics
Most metrics require dataframe with recommendations and dataframe with ground truth values — which objects each user interacted with. All input dataframes must have the same type.
- recommendations (Union[pandas.DataFrame, spark.DataFrame, Dict]):
If the recommendations is instance of dict then key represents user_id, value represents tuple(item_id, score). If the recommendations is instance of Spark or Pandas dataframe then the names of the corresponding columns should be passed through the constructor of metric.
- ground_truth (Union[pandas.DataFrame, spark.DataFrame]):
If ground_truth is instance of dict then key represents user_id, value represents item_id. If the recommendations is instance of Spark or Pandas dataframe then the names of the corresponding columns must match the recommendations.
Metric is calculated for all users, presented in ground_truth
for accurate metric calculation in case when the recommender system generated
recommendation not for all users. It is assumed, that all users,
we want to calculate metric for, have positive interactions.
Every metric is calculated using top K
items for each user.
It is also possible to calculate metrics
using multiple values for K
simultaneously.
Make sure your recommendations do not contain user-item duplicates as duplicates could lead to the wrong calculation results.
- k (Union[Iterable[int], int]):
a single number or a list, specifying the truncation length for recommendation list for each user
By default, metrics are averaged by users - replay.metrics.Mean
but you can alternatively use replay.metrics.Median
.
You can get the median value of the confidence interval -
replay.metrics.ConfidenceInterval
for a given alpha
.
To calculate the metric value for each user separately in most metrics there is a parameter per user
.
To write your own aggregation kernel,
you need to inherit from the replay.metrics.CalculationDescriptor
and redefine two methods (spark
, cpu
).
For each metric, a formula for its calculation is given, because this is important for the correct comparison of algorithms, as mentioned in our article.
You can also add new metrics.
Precision
- class replay.metrics.Precision(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Mean percentage of relevant items among top
K
recommendations.\[Precision@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{K}\]\[Precision@K = \frac {\sum_{i=1}^{N}Precision@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> Precision(2)(recommendations, groundtruth) {'Precision@2': 0.3333333333333333} >>> Precision(2, mode=PerUser())(recommendations, groundtruth) {'Precision-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.5}} >>> Precision(2, mode=Median())(recommendations, groundtruth) {'Precision-Median@2': 0.5} >>> Precision(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'Precision-ConfidenceInterval@2': 0.32666066409000905}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
Recall
- class replay.metrics.Recall(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Recall measures the coverage of the recommended items, and is defined as:
Mean percentage of relevant items, that was shown among top
K
recommendations.\[Recall@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{|Rel_i|}\]\[Recall@K = \frac {\sum_{i=1}^{N}Recall@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
\(|Rel_i|\) – the number of relevant items for user \(i\)
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> Recall(2)(recommendations, groundtruth) {'Recall@2': 0.12222222222222223} >>> Recall(2, mode=PerUser())(recommendations, groundtruth) {'Recall-PerUser@2': {1: 0.16666666666666666, 2: 0.0, 3: 0.2}} >>> Recall(2, mode=Median())(recommendations, groundtruth) {'Recall-Median@2': 0.16666666666666666} >>> Recall(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'Recall-ConfidenceInterval@2': 0.12125130695058273}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
MAP
- class replay.metrics.MAP(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Mean Average Precision – average the
Precision
at relevant positions for each user, and then calculate the mean across all users.\[ \begin{align}\begin{aligned}&AP@K(i) = \frac {1}{\min(K, |Rel_i|)} \sum_{j=1}^{K}\mathbb{1}_{r_{ij}}Precision@j(i)\\&MAP@K = \frac {\sum_{i=1}^{N}AP@K(i)}{N}\end{aligned}\end{align} \]\(\mathbb{1}_{r_{ij}}\) – indicator function showing if user \(i\) interacted with item \(j\)
\(|Rel_i|\) – the number of relevant items for user \(i\)
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> MAP(2)(recommendations, groundtruth) {'MAP@2': 0.25} >>> MAP(2, mode=PerUser())(recommendations, groundtruth) {'MAP-PerUser@2': {1: 0.25, 2: 0.0, 3: 0.5}} >>> MAP(2, mode=Median())(recommendations, groundtruth) {'MAP-Median@2': 0.25} >>> MAP(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'MAP-ConfidenceInterval@2': 0.282896433519043}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
MRR
- class replay.metrics.MRR(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Mean Reciprocal Rank – Reciprocal Rank is the inverse position of the first relevant item among top-k recommendations, \(\frac{1}{rank_i}\). This value is aggregated by all users.
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> MRR(2)(recommendations, groundtruth) {'MRR@2': 0.5} >>> MRR(2, mode=PerUser())(recommendations, groundtruth) {'MRR-PerUser@2': {1: 0.5, 2: 0.0, 3: 1.0}} >>> MRR(2, mode=Median())(recommendations, groundtruth) {'MRR-Median@2': 0.5} >>> MRR(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'MRR-ConfidenceInterval@2': 0.565792867038086}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
NDCG
- class replay.metrics.NDCG(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Normalized Discounted Cumulative Gain is a metric that takes into account positions of relevant items.
This is the binary version, it takes into account whether the item was consumed or not, relevance value is ignored.
\[DCG@K(i) = \sum_{j=1}^{K}\frac{\mathbb{1}_{r_{ij}}}{\log_2 (j+1)}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
To get from \(DCG\) to \(nDCG\) we calculate the biggest possible value of DCG for user \(i\) and recommendation length \(K\).
\[IDCG@K(i) = max(DCG@K(i)) = \sum_{j=1}^{K}\frac{\mathbb{1}_{j\le|Rel_i|}}{\log_2 (j+1)}\]\[nDCG@K(i) = \frac {DCG@K(i)}{IDCG@K(i)}\]\(|Rel_i|\) – number of relevant items for user \(i\)
Metric is averaged by users.
\[nDCG@K = \frac {\sum_{i=1}^{N}nDCG@K(i)}{N}\]>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> NDCG(2)(recommendations, groundtruth) {'NDCG@2': 0.3333333333333333} >>> NDCG(2, mode=PerUser())(recommendations, groundtruth) {'NDCG-PerUser@2': {1: 0.38685280723454163, 2: 0.0, 3: 0.6131471927654584}} >>> NDCG(2, mode=Median())(recommendations, groundtruth) {'NDCG-Median@2': 0.38685280723454163} >>> NDCG(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'NDCG-ConfidenceInterval@2': 0.3508565839953337}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
HitRate
- class replay.metrics.HitRate(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Percentage of users that have at least one correctly recommended item among top-k.
\[HitRate@K(i) = \max_{j \in [1..K]}\mathbb{1}_{r_{ij}}\]\[HitRate@K = \frac {\sum_{i=1}^{N}HitRate@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function stating that user \(i\) interacted with item \(j\)
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> HitRate(2)(recommendations, groundtruth) {'HitRate@2': 0.6666666666666666} >>> HitRate(2, mode=PerUser())(recommendations, groundtruth) {'HitRate-PerUser@2': {1: 1.0, 2: 0.0, 3: 1.0}} >>> HitRate(2, mode=Median())(recommendations, groundtruth) {'HitRate-Median@2': 1.0} >>> HitRate(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'HitRate-ConfidenceInterval@2': 0.6533213281800181}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
RocAuc
- class replay.metrics.RocAuc(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Receiver Operating Characteristic/Area Under the Curve is the aggregated performance measure, that depends only on the order of recommended items. It can be interpreted as the fraction of object pairs (object of class 1, object of class 0) that were correctly ordered by the model. The bigger the value of AUC, the better the classification model.
\[ROCAUC@K(i) = \frac {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{r_{si}<r_{ti}} \mathbb{1}_{gt_{si}<gt_{ti}}} {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{gt_{si}<gt_{tj}}}\]\(\mathbb{1}_{r_{si}<r_{ti}}\) – indicator function showing that recommendation score for user \(i\) for item \(s\) is bigger than for item \(t\)
\(\mathbb{1}_{gt_{si}<gt_{ti}}\) – indicator function showing that user \(i\) values item \(s\) more than item \(t\).
Metric is averaged by all users.
\[ROCAUC@K = \frac {\sum_{i=1}^{N}ROCAUC@K(i)}{N}\]>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> RocAuc(2)(recommendations, groundtruth) {'RocAuc@2': 0.3333333333333333} >>> RocAuc(2, mode=PerUser())(recommendations, groundtruth) {'RocAuc-PerUser@2': {1: 0.0, 2: 0.0, 3: 1.0}} >>> RocAuc(2, mode=Median())(recommendations, groundtruth) {'RocAuc-Median@2': 0.0} >>> RocAuc(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, groundtruth) {'RocAuc-ConfidenceInterval@2': 0.6533213281800181}
- __call__(recommendations, ground_truth)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
Unexpectedness
- class replay.metrics.Unexpectedness(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Fraction of recommended items that are not present in some baseline recommendations.
\[Unexpectedness@K(i) = 1 - \frac {\parallel R^{i}_{1..\min(K, \parallel R^{i} \parallel)} \cap BR^{i}_{1..\min(K, \parallel BR^{i} \parallel)} \parallel} {K}\]\[Unexpectedness@K = \frac {1}{N}\sum_{i=1}^{N}Unexpectedness@K(i)\]\(R_{1..j}^{i}\) – the first \(j\) recommendations for the \(i\)-th user.
\(BR_{1..j}^{i}\) – the first \(j\) base recommendations for the \(i\)-th user.
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> base_rec query_id item_id rating 0 1 3 0.5 1 1 7 0.5 2 1 2 0.7 3 2 5 0.6 4 2 8 0.6 5 2 3 0.3 6 3 4 1.0 7 3 9 0.5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> Unexpectedness([2, 4])(recommendations, base_rec) {'Unexpectedness@2': 0.16666666666666666, 'Unexpectedness@4': 0.5} >>> Unexpectedness([2, 4], mode=PerUser())(recommendations, base_rec) {'Unexpectedness-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.0}, 'Unexpectedness-PerUser@4': {1: 0.5, 2: 0.5, 3: 0.5}} >>> Unexpectedness([2, 4], mode=Median())(recommendations, base_rec) {'Unexpectedness-Median@2': 0.0, 'Unexpectedness-Median@4': 0.5} >>> Unexpectedness([2, 4], mode=ConfidenceInterval(alpha=0.95))(recommendations, base_rec) {'Unexpectedness-ConfidenceInterval@2': 0.32666066409000905, 'Unexpectedness-ConfidenceInterval@4': 0.0}
- __call__(recommendations, base_recommendations)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).base_recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): base model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
Coverage
- class replay.metrics.Coverage(topk, query_column='query_id', item_column='item_id', rating_column='rating', allow_caching=True)
Metric calculation is as follows:
take
K
recommendations with the biggestscore
for eachuser_id
count the number of distinct
item_id
in these recommendationsdivide it by the number of distinct items in train dataset, provided to metric call
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> train query_id item_id 0 1 5 1 1 6 2 1 8 3 1 9 4 1 2 5 2 5 6 2 8 7 2 11 8 2 1 9 2 3 10 3 4 11 3 9 12 3 2 >>> Coverage(2)(recommendations, train) {'Coverage@2': 0.5555555555555556}
- __call__(recommendations, train)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).train (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', allow_caching=True)
- Parameters
topk (
Union
[List
,int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.allow_caching (
bool
) – (bool): The flag for using caching to optimize calculations. Default:True
.
CategoricalDiversity
- class replay.metrics.CategoricalDiversity(topk, query_column='query_id', category_column='category_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Metric calculation is as follows:
take
K
recommendations with the biggestscore
for eachuser_id
count the number of distinct
category_id
in these recommendations /K
average this number for all users
>>> category_recommendations query_id category_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> CategoricalDiversity([3, 5])(category_recommendations) {'CategoricalDiversity@3': 1.0, 'CategoricalDiversity@5': 0.8666666666666667} >>> CategoricalDiversity([3, 5], mode=PerUser())(category_recommendations) {'CategoricalDiversity-PerUser@3': {1: 1.0, 2: 1.0, 3: 1.0}, 'CategoricalDiversity-PerUser@5': {1: 1.0, 2: 1.0, 3: 0.6}} >>> CategoricalDiversity([3, 5], mode=Median())(category_recommendations) {'CategoricalDiversity-Median@3': 1.0, 'CategoricalDiversity-Median@5': 1.0} >>> CategoricalDiversity([3, 5], mode=ConfidenceInterval(alpha=0.95))(category_recommendations) {'CategoricalDiversity-ConfidenceInterval@3': 0.0, 'CategoricalDiversity-ConfidenceInterval@5': 0.2613285312720073}
- __call__(recommendations)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, category and score columns. If dict then key represents user_ids, value represents list of tuple(category, score).- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', category_column='category_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
,int
]) – (list or int): Consider the highest k scores in the ranking.user_column – (str): The name of the user column.
category_column (
str
) – (str): The name of the category column.score_column – (str): The name of the score column.
mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
Novelty
- class replay.metrics.Novelty(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Measure the fraction of shown items in recommendation list, that users didn’t see in train dataset.
\[Novelty@K(i) = \frac {\parallel {R^{i}_{1..\min(K, \parallel R^{i} \parallel)} \setminus train^{i}} \parallel} {K}\]\[Novelty@K = \frac {1}{N}\sum_{i=1}^{N}Novelty@K(i)\]\(R^{i}\) – the recommendations for the \(i\)-th user.
\(R^{i}_{j}\) – the \(j\)-th recommended item for the \(i\)-th user.
\(R_{1..j}^{i}\) – the first \(j\) recommendations for the \(i\)-th user.
\(train^{i}\) – the train items of the \(i\)-th user.
\(N\) – the number of users.
- Based on
P. Castells, S. Vargas, and J. Wang, Novelty and diversity metrics for recommender systems: choice, discovery and relevance, ECIR 2011. Link.
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> train query_id item_id 0 1 5 1 1 6 2 1 8 3 1 9 4 1 2 5 2 5 6 2 8 7 2 11 8 2 1 9 2 3 10 3 4 11 3 9 12 3 2 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> Novelty(2)(recommendations, train) {'Novelty@2': 0.3333333333333333} >>> Novelty(2, mode=PerUser())(recommendations, train) {'Novelty-PerUser@2': {1: 1.0, 2: 0.0, 3: 0.0}} >>> Novelty(2, mode=Median())(recommendations, train) {'Novelty-Median@2': 0.0} >>> Novelty(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, train) {'Novelty-ConfidenceInterval@2': 0.6533213281800181}
- __call__(recommendations, train)
Compute metric.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then items must be sorted in decreasing order of their scores.train (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns.
- Return type
Union
[Mapping
[str
,float
],Mapping
[str
,Mapping
[Any
,float
]]]- Returns
metric values
- __init__(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
- Parameters
topk (
Union
[List
[int
],int
]) – (list or int): Consider the highest k scores in the ranking.query_column (
str
) – (str): The name of the user column.item_column (
str
) – (str): The name of the item column.rating_column (
str
) – (str): The name of the score column.mode (
CalculationDescriptor
) – (CalculationDescriptor): class for calculating aggregation metrics. Default:Mean
.
Surprisal
- class replay.metrics.Surprisal(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Measures how many surprising rare items are present in recommendations.
\[\textit{Self-Information}(j)= -\log_2 \frac {u_j}{N}\]\(u_j\) – number of users that interacted with item \(j\). Cold items are treated as if they were rated by 1 user. That is, if they appear in recommendations it will be completely unexpected.
Surprisal for item \(j\) is
\[Surprisal(j)= \frac {\textit{Self-Information}(j)}{log_2 N}\]Recommendation list surprisal is the average surprisal of items in it.
\[Surprisal@K(i) = \frac {\sum_{j=1}^{K}Surprisal(j)} {K}\]Final metric is averaged by users.
\[Surprisal@K = \frac {\sum_{i=1}^{N}Surprisal@K(i)}{N}\]\(N\) – the number of users.
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> train query_id item_id 0 1 5 1 1 6 2 1 8 3 1 9 4 1 2 5 2 5 6 2 8 7 2 11 8 2 1 9 2 3 10 3 4 11 3 9 12 3 2 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> Surprisal(2)(recommendations, train) {'Surprisal@2': 0.6845351232142715} >>> Surprisal(2, mode=PerUser())(recommendations, train) {'Surprisal-PerUser@2': {1: 1.0, 2: 0.3690702464285426, 3: 0.6845351232142713}} >>> Surprisal(2, mode=Median())(recommendations, train) {'Surprisal-Median@2': 0.6845351232142713} >>> Surprisal(2, mode=ConfidenceInterval(alpha=0.95))(recommendations, train) {'Surprisal-ConfidenceInterval@2': 0.3569755541728279}
OfflineMetrics
- class replay.metrics.OfflineMetrics(metrics, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id', allow_caching=True)
Designed for efficient calculation of offline metrics provided by the RePlay. If you need to calculate multiple metrics for the same input data, then using this class is much more efficient than calculating metrics individually.
For example, you want to calculate several metrics with different parameters. When calling offline metrics with the specified metrics, the common part of these metrics will be computed only once.
>>> from replay.metrics import * >>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> train query_id item_id 0 1 5 1 1 6 2 1 8 3 1 9 4 1 2 5 2 5 6 2 8 7 2 11 8 2 1 9 2 3 10 3 4 11 3 9 12 3 2 >>> base_rec query_id item_id rating 0 1 3 0.5 1 1 7 0.5 2 1 2 0.7 3 2 5 0.6 4 2 8 0.6 5 2 3 0.3 6 3 4 1.0 7 3 9 0.5 >>> from replay.metrics import Median, ConfidenceInterval, PerUser >>> metrics = [ ... Precision(2), ... Precision(2, mode=PerUser()), ... Precision(2, mode=Median()), ... Precision(2, mode=ConfidenceInterval(alpha=0.95)), ... Recall(2), ... MAP(2), ... MRR(2), ... NDCG(2), ... HitRate(2), ... RocAuc(2), ... Coverage(2), ... Novelty(2), ... Surprisal(2), ... ] >>> OfflineMetrics(metrics)(recommendations, groundtruth, train) {'Precision@2': 0.3333333333333333, 'Precision-PerUser@2': {1: 0.5, 2: 0.0, 3: 0.5}, 'Precision-Median@2': 0.5, 'Precision-ConfidenceInterval@2': 0.32666066409000905, 'Recall@2': 0.12222222222222223, 'MAP@2': 0.25, 'MRR@2': 0.5, 'NDCG@2': 0.3333333333333333, 'HitRate@2': 0.6666666666666666, 'RocAuc@2': 0.3333333333333333, 'Coverage@2': 0.5555555555555556, 'Novelty@2': 0.3333333333333333, 'Surprisal@2': 0.6845351232142715} >>> metrics = [ ... Precision(2), ... Unexpectedness([1, 2]), ... Unexpectedness([1, 2], mode=PerUser()), ... ] >>> OfflineMetrics(metrics)( ... recommendations, ... groundtruth, ... train, ... base_recommendations={"ALS": base_rec, "KNN": recommendations} ... ) {'Precision@2': 0.3333333333333333, 'Unexpectedness_ALS@1': 0.3333333333333333, 'Unexpectedness_ALS@2': 0.16666666666666666, 'Unexpectedness_KNN@1': 0.0, 'Unexpectedness_KNN@2': 0.0, 'Unexpectedness-PerUser_ALS@1': {1: 1.0, 2: 0.0, 3: 0.0}, 'Unexpectedness-PerUser_ALS@2': {1: 0.5, 2: 0.0, 3: 0.0}, 'Unexpectedness-PerUser_KNN@1': {1: 0.0, 2: 0.0, 3: 0.0}, 'Unexpectedness-PerUser_KNN@2': {1: 0.0, 2: 0.0, 3: 0.0}}
- __call__(recommendations, ground_truth, train=None, base_recommendations=None)
Compute metrics.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.train (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
,None
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids. Default:None
.base_recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
,Dict
[str
,Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]],None
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict or Dict[str, DataFrameLike]): predictions from baseline model. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score). IfUnexpectedness
is not in given metrics list, then you can omit this parameter. If it is necessary to calculate the value of metrics on several dataframes, then you need to submit a dict(key - name of the data frame, value - DataFrameLike). For a better understanding, check out the examples. Default:None
.
- Return type
Dict
[str
,float
]- Returns
metric values
- __init__(metrics, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id', allow_caching=True)
- Parameters
metrics (
List
[Metric
]) – (list of metrics): List of metrics to be calculated.user_column – (str): The name of the user column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.
item_column (
str
) – (str): The name of the item column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.score_column – (str): The name of the score column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.
category_column (
str
) –(str): The name of the category column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.
It is used only for calculating the
Diversity
metric. If you don’t calculate this metric, you can omit this parameter.allow_caching (
bool
) – (bool): The flag for using caching to optimize calculations. Default:True
.
Compare Results
- class replay.metrics.Experiment(metrics, ground_truth, train=None, base_recommendations=None, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id')
The class is designed for calculating, storing and comparing metrics from different models in the Pandas DataFrame format.
The main difference from the
OfflineMetrics
class is thatOfflineMetrics
are only responsible for calculating metrics. TheExperiment
class is responsible for storing metrics from different models, clear and their convenient comparison with each other.Calculated metrics are available with
results
attribute.Example:
>>> recommendations query_id item_id rating 0 1 3 0.6 1 1 7 0.5 2 1 10 0.4 3 1 11 0.3 4 1 2 0.2 5 2 5 0.6 6 2 8 0.5 7 2 11 0.4 8 2 1 0.3 9 2 3 0.2 10 3 4 1.0 11 3 9 0.5 12 3 2 0.1 >>> groundtruth query_id item_id 0 1 5 1 1 6 2 1 7 3 1 8 4 1 9 5 1 10 6 2 6 7 2 7 8 2 4 9 2 10 10 2 11 11 3 1 12 3 2 13 3 3 14 3 4 15 3 5 >>> train query_id item_id 0 1 5 1 1 6 2 1 8 3 1 9 4 1 2 5 2 5 6 2 8 7 2 11 8 2 1 9 2 3 10 3 4 11 3 9 12 3 2 >>> base_rec query_id item_id rating 0 1 3 0.5 1 1 7 0.5 2 1 2 0.7 3 2 5 0.6 4 2 8 0.6 5 2 3 0.3 6 3 4 1.0 7 3 9 0.5 >>> from replay.metrics import NDCG, Surprisal, Precision, Coverage, Median, ConfidenceInterval >>> ex = Experiment([NDCG([2, 3]), Surprisal(3)], groundtruth, train) >>> ex.add_result("baseline", base_rec) >>> ex.add_result("model", recommendations) >>> ex.results NDCG@2 NDCG@3 Surprisal@3 baseline 0.204382 0.234639 0.608476 model 0.333333 0.489760 0.719587 >>> ex.compare("baseline") NDCG@2 NDCG@3 Surprisal@3 baseline – – – model 63.09% 108.73% 18.26% >>> ex = Experiment([Precision(3, mode=Median()), Precision(3, mode=ConfidenceInterval(0.95))], groundtruth) >>> ex.add_result("baseline", base_rec) >>> ex.add_result("model", recommendations) >>> ex.results Precision-Median@3 Precision-ConfidenceInterval@3 baseline 0.333333 0.217774 model 0.666667 0.217774
- __init__(metrics, ground_truth, train=None, base_recommendations=None, query_column='query_id', item_column='item_id', rating_column='rating', category_column='category_id')
- Parameters
metrics (
List
[Metric
]) – (list of metrics): List of metrics to be calculated.ground_truth (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): test data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids.train (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
,None
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict, optional): train data. If DataFrame then it must contains user and item columns. If dict then key represents user_ids, value represents list of item_ids. Default:None
.base_recommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
,Dict
[str
,Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]],None
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict or Dict[str, DataFrameLike]): predictions from baseline model. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score). IfUnexpectedness
is not in given metrics list, then you can omit this parameter. Default:None
.query_column (
str
) – (str): The name of the user column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.item_column (
str
) – (str): The name of the item column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.rating_column (
str
) – (str): The name of the score column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.category_column (
str
) –(str): The name of the category column. Note that you do not need to specify the value of this parameter for each metric separately. It is enough to specify the value of this parameter here once.
It is used only for calculating the
Diversity
metric. If you don’t calculate this metric, you can omit this parameter.
- add_result(name, recommendations)
Calculate metrics for predictions
- Parameters
name (
str
) – name of the run to store in the resulting DataFramerecommendations (
Union
[DataFrame
,DataFrame
,DataFrame
,Dict
]) – (PySpark DataFrame or Polars DataFrame or Pandas DataFrame or dict): model predictions. If DataFrame then it must contains user, item and score columns. If dict then key represents user_ids, value represents list of tuple(item_id, score).
- Return type
None
- compare(name)
Show results as a percentage difference to record
name
.- Parameters
name (
str
) – name of the baseline record- Return type
DataFrame
- Returns
results table in a percentage format
Custom Metric
Your metric should be inherited from Metric
class and implement following methods:
__init__
_get_metric_value_by_user
_get_metric_value_by_user
is required for every metric because this is where the actual calculations happen.
For a better understanding, see already implemented metrics, for example Recall.
- class replay.metrics.base_metric.Metric(topk, query_column='query_id', item_column='item_id', rating_column='rating', mode=<replay.metrics.descriptors.Mean object>)
Base metric class
- abstract static _get_metric_value_by_user(ks, *args)
Metric calculation for one user.
- Parameters
k – depth cut-off
ground_truth – test data
pred – recommendations
- Return type
List
[float
]- Returns
metric value for current user