Splitters

Splits data into train and test

Splits are returned with split method.

replay.splitters.base_splitter.Splitter.split(self, interactions)

Splits input DataFrame into train and test

Parameters

interactions (Union[DataFrame, DataFrame, DataFrame]) – input DataFrame [timestamp, user_id, item_id, relevance]

Return type

Tuple[Union[DataFrame, DataFrame, DataFrame], Union[DataFrame, DataFrame, DataFrame]]

Returns

List of splitted DataFrames

TwoStageSplitter

class replay.splitters.two_stage_splitter.TwoStageSplitter(first_divide_size, second_divide_size, first_divide_column='query_id', second_divide_column='item_id', shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp')

Split data by two columns. First step: takes first_divide_size distinct values of first_divide_column to the test split. Second step: takes second_divide_size of second_divide_column among the data provided after first step to the test split.

Example:

>>> from replay.utils.session_handler import get_spark_session, State
>>> spark = get_spark_session(1, 1)
>>> state = State(spark)
>>> from replay.splitters import TwoStageSplitter
>>> import pandas as pd
>>> data_frame = pd.DataFrame({"query_id": [1,1,1,2,2,2],
...    "item_id": [1,2,3,1,2,3],
...    "relevance": [1,2,3,4,5,6],
...    "timestamp": [1,2,3,3,2,1]})
>>> data_frame
   query_id  item_id  relevance  timestamp
0         1         1          1          1
1         1         2          2          2
2         1         3          3          3
3         2         1          4          3
4         2         2          5          2
5         2         3          6          1
>>> train, test = TwoStageSplitter(first_divide_size=1, second_divide_size=2, seed=42).split(data_frame)
>>> test
   query_id  item_id  relevance  timestamp
3         2         1          4          3
4         2         2          5          2
>>> train, test = TwoStageSplitter(first_divide_size=0.5, second_divide_size=2, seed=42).split(data_frame)
>>> test
   query_id  item_id  relevance  timestamp
3         2         1          4          3
4         2         2          5          2
>>> train, test = TwoStageSplitter(first_divide_size=0.5, second_divide_size=0.7, seed=42).split(data_frame)
>>> test
   query_id  item_id  relevance  timestamp
3         2         1          4          3
4         2         2          5          2
__init__(first_divide_size, second_divide_size, first_divide_column='query_id', second_divide_column='item_id', shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp')
Parameters
  • second_divide_size (Union[float, int]) – fraction or a number of items per user

  • first_divide_size (Union[float, int]) – similar to item_test_size, but corresponds to the number of users. None is all available users.

  • shuffle – take random items and not last based on timestamp.

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test

  • seed (Optional[int]) – random seed

  • query_column (str) – query id column name

  • item_column (Optional[str]) – item id column name

  • timestamp_column (Optional[str]) – timestamp column name

KFolds

replay.splitters.k_folds.KFolds(n_folds=5, strategy='query', drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')

Splits interactions inside each query into folds at random.

TimeSplitter

class replay.splitters.time_splitter.TimeSplitter(time_threshold, query_column='query_id', drop_cold_users=False, drop_cold_items=False, item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test', time_column_format='%Y-%m-%d %H:%M:%S')

Split interactions by time.

>>> from datetime import datetime
>>> import pandas as pd
>>> columns = ["query_id", "item_id", "timestamp"]
>>> data = [
...     (1, 1, "01-01-2020"),
...     (1, 2, "02-01-2020"),
...     (1, 3, "03-01-2020"),
...     (1, 4, "04-01-2020"),
...     (1, 5, "05-01-2020"),
...     (2, 1, "06-01-2020"),
...     (2, 2, "07-01-2020"),
...     (2, 3, "08-01-2020"),
...     (2, 9, "09-01-2020"),
...     (2, 10, "10-01-2020"),
...     (3, 1, "01-01-2020"),
...     (3, 5, "02-01-2020"),
...     (3, 3, "03-01-2020"),
...     (3, 1, "04-01-2020"),
...     (3, 2, "05-01-2020"),
... ]
>>> dataset = pd.DataFrame(data, columns=columns)
>>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y")
>>> dataset
   query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
2         1        3 2020-01-03
3         1        4 2020-01-04
4         1        5 2020-01-05
5         2        1 2020-01-06
6         2        2 2020-01-07
7         2        3 2020-01-08
8         2        9 2020-01-09
9         2       10 2020-01-10
10        3        1 2020-01-01
11        3        5 2020-01-02
12        3        3 2020-01-03
13        3        1 2020-01-04
14        3        2 2020-01-05
>>> train, test = TimeSplitter(
...     time_threshold=datetime.strptime("2020-01-04", "%Y-%M-%d")
... ).split(dataset)
>>> train
   query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
2         1        3 2020-01-03
3         1        4 2020-01-04
10        3        1 2020-01-01
11        3        5 2020-01-02
12        3        3 2020-01-03
13        3        1 2020-01-04
>>> test
   query_id  item_id  timestamp
4         1        5 2020-01-05
5         2        1 2020-01-06
6         2        2 2020-01-07
7         2        3 2020-01-08
8         2        9 2020-01-09
9         2       10 2020-01-10
14        3        2 2020-01-05
__init__(time_threshold, query_column='query_id', drop_cold_users=False, drop_cold_items=False, item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test', time_column_format='%Y-%m-%d %H:%M:%S')
Parameters
  • time_threshold (Union[datetime, str, int, float]) – Test threshold, can be datetime, string, int or float. datetime is in case of splitting by datetime, int is in case of splitting by datetime (Unix format), string will be converted to datetime using time_column_format, float is in case of splitting by ratio, the value must be between 0 and 1.

  • query_column (str) – Name of user interaction column.

  • drop_cold_users (bool) – Drop users from test DataFrame. which are not in train DataFrame, default: False.

  • drop_cold_items (bool) – Drop items from test DataFrame which are not in train DataFrame, default: False.

  • item_column (str) – Name of item interaction column. If drop_cold_items is False, then you can omit this parameter. Default: item_id.

  • timestamp_column (str) – Name of time column, Default: timestamp.

  • session_id_column (Optional[str]) – Name of session id column, which values can not be split, default: None.

  • session_id_processing_strategy (str) – strategy of processing session if it is split, values: train, test, train: whole split session goes to train. test: same but to test. default: test.

LastNSplitter

class replay.splitters.last_n_splitter.LastNSplitter(N, divide_column='query_id', time_column_format='yyyy-MM-dd HH:mm:ss', strategy='interactions', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')

Split interactions by last N interactions/timedelta per user. Type of splitting depends on the strategy parameter.

>>> from datetime import datetime
>>> import pandas as pd
>>> columns = ["query_id", "item_id", "timestamp"]
>>> data = [
...     (1, 1, "01-01-2020"),
...     (1, 2, "02-01-2020"),
...     (1, 3, "03-01-2020"),
...     (1, 4, "04-01-2020"),
...     (1, 5, "05-01-2020"),
...     (2, 1, "06-01-2020"),
...     (2, 2, "07-01-2020"),
...     (2, 3, "08-01-2020"),
...     (2, 9, "09-01-2020"),
...     (2, 10, "10-01-2020"),
...     (3, 1, "01-01-2020"),
...     (3, 5, "02-01-2020"),
...     (3, 3, "03-01-2020"),
...     (3, 1, "04-01-2020"),
...     (3, 2, "05-01-2020"),
... ]
>>> dataset = pd.DataFrame(data, columns=columns)
>>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y")
>>> dataset
   query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
2         1        3 2020-01-03
3         1        4 2020-01-04
4         1        5 2020-01-05
5         2        1 2020-01-06
6         2        2 2020-01-07
7         2        3 2020-01-08
8         2        9 2020-01-09
9         2       10 2020-01-10
10        3        1 2020-01-01
11        3        5 2020-01-02
12        3        3 2020-01-03
13        3        1 2020-01-04
14        3        2 2020-01-05
>>> splitter = LastNSplitter(
...     N=2,
...     divide_column="query_id",
...     time_column_format="yyyy-MM-dd",
...     query_column="query_id",
...     item_column="item_id"
... )
>>> train, test = splitter.split(dataset)
>>> train
   query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
2         1        3 2020-01-03
5         2        1 2020-01-06
6         2        2 2020-01-07
7         2        3 2020-01-08
10        3        1 2020-01-01
11        3        5 2020-01-02
12        3        3 2020-01-03
>>> test
   query_id  item_id  timestamp
3         1        4 2020-01-04
4         1        5 2020-01-05
8         2        9 2020-01-09
9         2       10 2020-01-10
13        3        1 2020-01-04
14        3        2 2020-01-05
__init__(N, divide_column='query_id', time_column_format='yyyy-MM-dd HH:mm:ss', strategy='interactions', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
Parameters
  • N (int) – Array of interactions/timedelta to split.

  • divide_column (str) – Name of column for dividing in dataframe, default: query_id.

  • time_column_format (str) – Format of time_column, needs for convert time_column into unix_timestamp type. If strategy is set to ‘interactions’, then you can omit this parameter. If time_column has already transformed into unix_timestamp type, then you can omit this parameter. default: yyyy-MM-dd HH:mm:ss

  • strategy (Literal['interactions', 'timedelta']) – Defines the type of data splitting. Must be interactions or timedelta. default: interactions.

  • query_column (str) – Name of query interaction column.

  • drop_cold_users (bool) – Drop users from test DataFrame. which are not in train DataFrame, default: False.

  • drop_cold_items (bool) – Drop items from test DataFrame which are not in train DataFrame, default: False.

  • item_column (str) – Name of item interaction column. If drop_cold_items is False, then you can omit this parameter. Default: item_id.

  • timestamp_column (str) – Name of time column, Default: timestamp.

  • session_id_column (Optional[str]) – Name of session id column, which values can not be split, default: None.

  • session_id_processing_strategy (str) – strategy of processing session if it is split, values: train, test, train: whole split session goes to train. test: same but to test. default: test.

RatioSplitter

class replay.splitters.ratio_splitter.RatioSplitter(test_size, divide_column='query_id', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', min_interactions_per_group=None, split_by_fractions=True, session_id_column=None, session_id_processing_strategy='test')

Split interactions into train and test by ratio. Split is made for each user separately.

>>> from datetime import datetime
>>> import pandas as pd
>>> columns = ["query_id", "item_id", "timestamp"]
>>> data = [
...     (1, 1, "01-01-2020"),
...     (1, 2, "02-01-2020"),
...     (1, 3, "03-01-2020"),
...     (1, 4, "04-01-2020"),
...     (1, 5, "05-01-2020"),
...     (2, 1, "06-01-2020"),
...     (2, 2, "07-01-2020"),
...     (2, 3, "08-01-2020"),
...     (2, 9, "09-01-2020"),
...     (2, 10, "10-01-2020"),
...     (3, 1, "01-01-2020"),
...     (3, 5, "02-01-2020"),
...     (3, 3, "03-01-2020"),
...     (3, 1, "04-01-2020"),
...     (3, 2, "05-01-2020"),
... ]
>>> dataset = pd.DataFrame(data, columns=columns)
>>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y")
>>> dataset
    query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
2         1        3 2020-01-03
3         1        4 2020-01-04
4         1        5 2020-01-05
5         2        1 2020-01-06
6         2        2 2020-01-07
7         2        3 2020-01-08
8         2        9 2020-01-09
9         2       10 2020-01-10
10        3        1 2020-01-01
11        3        5 2020-01-02
12        3        3 2020-01-03
13        3        1 2020-01-04
14        3        2 2020-01-05
>>> splitter = RatioSplitter(
...     test_size=0.5,
...     divide_column="query_id",
...     query_column="query_id",
...     item_column="item_id"
... )
>>> train, test = splitter.split(dataset)
>>> train
   query_id  item_id  timestamp
0         1        1 2020-01-01
1         1        2 2020-01-02
5         2        1 2020-01-06
6         2        2 2020-01-07
10        3        1 2020-01-01
11        3        5 2020-01-02
>>> test
   query_id  item_id  timestamp
2         1        3 2020-01-03
3         1        4 2020-01-04
4         1        5 2020-01-05
7         2        3 2020-01-08
8         2        9 2020-01-09
9         2       10 2020-01-10
12        3        3 2020-01-03
13        3        1 2020-01-04
14        3        2 2020-01-05
__init__(test_size, divide_column='query_id', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', min_interactions_per_group=None, split_by_fractions=True, session_id_column=None, session_id_processing_strategy='test')
Parameters
  • ratio – test size, must be in \((0, 1)\).

  • divide_column (str) – Name of column for dividing in dataframe, default: query_id.

  • drop_cold_users (bool) – Drop users from test DataFrame. which are not in train DataFrame, default: False.

  • drop_cold_items (bool) – Drop items from test DataFrame which are not in train DataFrame, default: False.

  • query_column (str) – Name of query interaction column. If drop_cold_users is False, then you can omit this parameter. Default: query_id.

  • item_column (str) – Name of item interaction column. If drop_cold_items is False, then you can omit this parameter. Default: item_id.

  • timestamp_column (str) – Name of time column, Default: timestamp.

  • min_interactions_per_group (Optional[int]) – minimal required interactions per group to make first split. if value is less than min_interactions_per_group, than whole group goes to train. If not set, than any amount of interactions will be split. default: None.

  • split_by_fractions (bool) – the variable that is responsible for using the split by fractions. Split by fractions means that each line is marked with its fraq (line number / number of lines) and only those lines with a fraq > test_ratio get into the test. Split not by fractions means that the number of rows in the train is calculated by rounding the formula: the total number of rows minus the number of rows multiplied by the test ratio. The difference between these two methods is that due to rounding in the second method, 1 more interaction in each group (1 item for each user) falls into the train. When split by fractions, these items fall into the test. default: True.

  • session_id_column (Optional[str]) – Name of session id column, which values can not be split, default: None.

  • session_id_processing_strategy (str) – strategy of processing session if it is split, values: train, test, train: whole split session goes to train. test: same but to test. default: test.

RandomSplitter

class replay.splitters.random_splitter.RandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id')

Assign records into train and test at random.

__init__(test_size, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id')
Parameters
  • test_size (float) – test size 0 to 1

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test

  • seed (Optional[int]) – random seed

  • query_column (str) – Name of query interaction column

  • item_column (str) – Name of item interaction column

NewUsersSplitter

class replay.splitters.new_users_splitter.NewUsersSplitter(test_size, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')

Only new users will be assigned to test set. Splits interactions by timestamp so that test has test_size fraction of most recent users.

>>> from replay.splitters import NewUsersSplitter
>>> import pandas as pd
>>> data_frame = pd.DataFrame({"query_id": [1,1,2,2,3,4],
...    "item_id": [1,2,3,1,2,3],
...    "relevance": [1,2,3,4,5,6],
...    "timestamp": [20,40,20,30,10,40]})
>>> data_frame
   query_id   item_id  relevance  timestamp
0         1         1          1         20
1         1         2          2         40
2         2         3          3         20
3         2         1          4         30
4         3         2          5         10
5         4         3          6         40
>>> train, test = NewUsersSplitter(test_size=0.1).split(data_frame)
>>> train
  query_id  item_id  relevance  timestamp
0        1        1          1         20
2        2        3          3         20
3        2        1          4         30
4        3        2          5         10

>>> test
  query_id  item_id  relevance  timestamp
0        4        3          6         40

Train DataFrame can be drastically reduced even with moderate test_size if the amount of new users is small.

>>> train, test = NewUsersSplitter(test_size=0.3).split(data_frame)
>>> train
  query_id  item_id  relevance  timestamp
4        3        2          5         10
__init__(test_size, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
Parameters
  • test_size (float) – test size 0 to 1

  • drop_cold_items (bool) – flag to drop cold items from test

  • query_column (str) – query id column name

  • item_column (Optional[str]) – item id column name

  • timestamp_column (Optional[str]) – timestamp column name

  • session_id_column (Optional[str]) – name of session id column, which values can not be split.

  • session_id_processing_strategy (str) – strategy of processing session if it is split, values: train, test, train: whole split session goes to train. test: same but to test. default: test.

ColdUserRandomSplitter

class replay.splitters.cold_user_random_splitter.ColdUserRandomSplitter(test_size, drop_cold_items=False, seed=None, query_column='query_id', item_column='item_id')

Test set consists of all actions of randomly chosen users.

__init__(test_size, drop_cold_items=False, seed=None, query_column='query_id', item_column='item_id')
Parameters
  • test_size (float) – fraction of users to be in test

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users – flag to drop cold users from test

  • seed (Optional[int]) – random seed

  • query_column (str) – query id column name

  • item_column (Optional[str]) – item id column name