Splitters
Splits data into train and test
Splits are returned with split
method.
- replay.splitters.base_splitter.Splitter.split(self, interactions)
Splits input DataFrame into train and test
- Parameters
interactions (
Union
[DataFrame
,DataFrame
,DataFrame
]) – input DataFrame[timestamp, user_id, item_id, relevance]
- Return type
Tuple
[Union
[DataFrame
,DataFrame
,DataFrame
],Union
[DataFrame
,DataFrame
,DataFrame
]]- Returns
List of splitted DataFrames
TwoStageSplitter
- class replay.splitters.two_stage_splitter.TwoStageSplitter(first_divide_size, second_divide_size, first_divide_column='query_id', second_divide_column='item_id', shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp')
Split data by two columns. First step: takes first_divide_size distinct values of first_divide_column to the test split. Second step: takes second_divide_size of second_divide_column among the data provided after first step to the test split.
Example:
>>> from replay.utils.session_handler import get_spark_session, State >>> spark = get_spark_session(1, 1) >>> state = State(spark)
>>> from replay.splitters import TwoStageSplitter >>> import pandas as pd >>> data_frame = pd.DataFrame({"query_id": [1,1,1,2,2,2], ... "item_id": [1,2,3,1,2,3], ... "relevance": [1,2,3,4,5,6], ... "timestamp": [1,2,3,3,2,1]}) >>> data_frame query_id item_id relevance timestamp 0 1 1 1 1 1 1 2 2 2 2 1 3 3 3 3 2 1 4 3 4 2 2 5 2 5 2 3 6 1 >>> train, test = TwoStageSplitter(first_divide_size=1, second_divide_size=2, seed=42).split(data_frame) >>> test query_id item_id relevance timestamp 3 2 1 4 3 4 2 2 5 2
>>> train, test = TwoStageSplitter(first_divide_size=0.5, second_divide_size=2, seed=42).split(data_frame) >>> test query_id item_id relevance timestamp 3 2 1 4 3 4 2 2 5 2
>>> train, test = TwoStageSplitter(first_divide_size=0.5, second_divide_size=0.7, seed=42).split(data_frame) >>> test query_id item_id relevance timestamp 3 2 1 4 3 4 2 2 5 2
- __init__(first_divide_size, second_divide_size, first_divide_column='query_id', second_divide_column='item_id', shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp')
- Parameters
second_divide_size (
float
) – fraction or a number of items per userfirst_divide_size (
float
) – similar toitem_test_size
, but corresponds to the number of users.None
is all available users.shuffle – take random items and not last based on
timestamp
.drop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from testseed (
Optional
[int
]) – random seedquery_column (
str
) – query id column nameitem_column (
Optional
[str
]) – item id column nametimestamp_column (
Optional
[str
]) – timestamp column name
KFolds
- replay.splitters.k_folds.KFolds(n_folds=5, strategy='query', drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
Splits interactions inside each query into folds at random.
TimeSplitter
- class replay.splitters.time_splitter.TimeSplitter(time_threshold, query_column='query_id', drop_cold_users=False, drop_cold_items=False, item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test', time_column_format='%Y-%m-%d %H:%M:%S')
Split interactions by time.
>>> from datetime import datetime >>> import pandas as pd >>> columns = ["query_id", "item_id", "timestamp"] >>> data = [ ... (1, 1, "01-01-2020"), ... (1, 2, "02-01-2020"), ... (1, 3, "03-01-2020"), ... (1, 4, "04-01-2020"), ... (1, 5, "05-01-2020"), ... (2, 1, "06-01-2020"), ... (2, 2, "07-01-2020"), ... (2, 3, "08-01-2020"), ... (2, 9, "09-01-2020"), ... (2, 10, "10-01-2020"), ... (3, 1, "01-01-2020"), ... (3, 5, "02-01-2020"), ... (3, 3, "03-01-2020"), ... (3, 1, "04-01-2020"), ... (3, 2, "05-01-2020"), ... ] >>> dataset = pd.DataFrame(data, columns=columns) >>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y") >>> dataset query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 2 1 3 2020-01-03 3 1 4 2020-01-04 4 1 5 2020-01-05 5 2 1 2020-01-06 6 2 2 2020-01-07 7 2 3 2020-01-08 8 2 9 2020-01-09 9 2 10 2020-01-10 10 3 1 2020-01-01 11 3 5 2020-01-02 12 3 3 2020-01-03 13 3 1 2020-01-04 14 3 2 2020-01-05 >>> train, test = TimeSplitter( ... time_threshold=datetime.strptime("2020-01-04", "%Y-%M-%d") ... ).split(dataset) >>> train query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 2 1 3 2020-01-03 3 1 4 2020-01-04 10 3 1 2020-01-01 11 3 5 2020-01-02 12 3 3 2020-01-03 13 3 1 2020-01-04 >>> test query_id item_id timestamp 4 1 5 2020-01-05 5 2 1 2020-01-06 6 2 2 2020-01-07 7 2 3 2020-01-08 8 2 9 2020-01-09 9 2 10 2020-01-10 14 3 2 2020-01-05
- __init__(time_threshold, query_column='query_id', drop_cold_users=False, drop_cold_items=False, item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test', time_column_format='%Y-%m-%d %H:%M:%S')
- Parameters
time_threshold (
Union
[datetime
,str
,float
]) – Test threshold, can be datetime, string, int or float. datetime is in case of splitting by datetime, int is in case of splitting by datetime (Unix format), string will be converted to datetime usingtime_column_format
, float is in case of splitting by ratio, the value must be between 0 and 1.query_column (
str
) – Name of user interaction column.drop_cold_users (
bool
) – Drop users from test DataFrame. which are not in train DataFrame, default: False.drop_cold_items (
bool
) – Drop items from test DataFrame which are not in train DataFrame, default: False.item_column (
str
) – Name of item interaction column. Ifdrop_cold_items
isFalse
, then you can omit this parameter. Default:item_id
.timestamp_column (
str
) – Name of time column, Default:timestamp
.session_id_column (
Optional
[str
]) – Name of session id column, which values can not be split, default:None
.session_id_processing_strategy (
str
) – strategy of processing session if it is split, values:train, test
, train: whole split session goes to train. test: same but to test. default:test
.
LastNSplitter
- class replay.splitters.last_n_splitter.LastNSplitter(N, divide_column='query_id', time_column_format='yyyy-MM-dd HH:mm:ss', strategy='interactions', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
Split interactions by last N interactions/timedelta per user. Type of splitting depends on the
strategy
parameter.>>> from datetime import datetime >>> import pandas as pd >>> columns = ["query_id", "item_id", "timestamp"] >>> data = [ ... (1, 1, "01-01-2020"), ... (1, 2, "02-01-2020"), ... (1, 3, "03-01-2020"), ... (1, 4, "04-01-2020"), ... (1, 5, "05-01-2020"), ... (2, 1, "06-01-2020"), ... (2, 2, "07-01-2020"), ... (2, 3, "08-01-2020"), ... (2, 9, "09-01-2020"), ... (2, 10, "10-01-2020"), ... (3, 1, "01-01-2020"), ... (3, 5, "02-01-2020"), ... (3, 3, "03-01-2020"), ... (3, 1, "04-01-2020"), ... (3, 2, "05-01-2020"), ... ] >>> dataset = pd.DataFrame(data, columns=columns) >>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y") >>> dataset query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 2 1 3 2020-01-03 3 1 4 2020-01-04 4 1 5 2020-01-05 5 2 1 2020-01-06 6 2 2 2020-01-07 7 2 3 2020-01-08 8 2 9 2020-01-09 9 2 10 2020-01-10 10 3 1 2020-01-01 11 3 5 2020-01-02 12 3 3 2020-01-03 13 3 1 2020-01-04 14 3 2 2020-01-05 >>> splitter = LastNSplitter( ... N=2, ... divide_column="query_id", ... time_column_format="yyyy-MM-dd", ... query_column="query_id", ... item_column="item_id" ... ) >>> train, test = splitter.split(dataset) >>> train query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 2 1 3 2020-01-03 5 2 1 2020-01-06 6 2 2 2020-01-07 7 2 3 2020-01-08 10 3 1 2020-01-01 11 3 5 2020-01-02 12 3 3 2020-01-03 >>> test query_id item_id timestamp 3 1 4 2020-01-04 4 1 5 2020-01-05 8 2 9 2020-01-09 9 2 10 2020-01-10 13 3 1 2020-01-04 14 3 2 2020-01-05
- __init__(N, divide_column='query_id', time_column_format='yyyy-MM-dd HH:mm:ss', strategy='interactions', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
- Parameters
N (
int
) – Array of interactions/timedelta to split.divide_column (
str
) – Name of column for dividing in dataframe, default:query_id
.time_column_format (
str
) – Format of time_column, needs for convert time_column into unix_timestamp type. If strategy is set to ‘interactions’, then you can omit this parameter. If time_column has already transformed into unix_timestamp type, then you can omit this parameter. default:yyyy-MM-dd HH:mm:ss
strategy (
Literal
['interactions'
,'timedelta'
]) – Defines the type of data splitting. Must beinteractions
ortimedelta
. default:interactions
.query_column (
str
) – Name of query interaction column.drop_cold_users (
bool
) – Drop users from test DataFrame. which are not in train DataFrame, default: False.drop_cold_items (
bool
) – Drop items from test DataFrame which are not in train DataFrame, default: False.item_column (
str
) – Name of item interaction column. Ifdrop_cold_items
isFalse
, then you can omit this parameter. Default:item_id
.timestamp_column (
str
) – Name of time column, Default:timestamp
.session_id_column (
Optional
[str
]) – Name of session id column, which values can not be split, default:None
.session_id_processing_strategy (
str
) – strategy of processing session if it is split, values:train, test
, train: whole split session goes to train. test: same but to test. default:test
.
RatioSplitter
- class replay.splitters.ratio_splitter.RatioSplitter(test_size, divide_column='query_id', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', min_interactions_per_group=None, split_by_fractions=True, session_id_column=None, session_id_processing_strategy='test')
Split interactions into train and test by ratio. Split is made for each user separately.
>>> from datetime import datetime >>> import pandas as pd >>> columns = ["query_id", "item_id", "timestamp"] >>> data = [ ... (1, 1, "01-01-2020"), ... (1, 2, "02-01-2020"), ... (1, 3, "03-01-2020"), ... (1, 4, "04-01-2020"), ... (1, 5, "05-01-2020"), ... (2, 1, "06-01-2020"), ... (2, 2, "07-01-2020"), ... (2, 3, "08-01-2020"), ... (2, 9, "09-01-2020"), ... (2, 10, "10-01-2020"), ... (3, 1, "01-01-2020"), ... (3, 5, "02-01-2020"), ... (3, 3, "03-01-2020"), ... (3, 1, "04-01-2020"), ... (3, 2, "05-01-2020"), ... ] >>> dataset = pd.DataFrame(data, columns=columns) >>> dataset["timestamp"] = pd.to_datetime(dataset["timestamp"], format="%d-%m-%Y") >>> dataset query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 2 1 3 2020-01-03 3 1 4 2020-01-04 4 1 5 2020-01-05 5 2 1 2020-01-06 6 2 2 2020-01-07 7 2 3 2020-01-08 8 2 9 2020-01-09 9 2 10 2020-01-10 10 3 1 2020-01-01 11 3 5 2020-01-02 12 3 3 2020-01-03 13 3 1 2020-01-04 14 3 2 2020-01-05 >>> splitter = RatioSplitter( ... test_size=0.5, ... divide_column="query_id", ... query_column="query_id", ... item_column="item_id" ... ) >>> train, test = splitter.split(dataset) >>> train query_id item_id timestamp 0 1 1 2020-01-01 1 1 2 2020-01-02 5 2 1 2020-01-06 6 2 2 2020-01-07 10 3 1 2020-01-01 11 3 5 2020-01-02 >>> test query_id item_id timestamp 2 1 3 2020-01-03 3 1 4 2020-01-04 4 1 5 2020-01-05 7 2 3 2020-01-08 8 2 9 2020-01-09 9 2 10 2020-01-10 12 3 3 2020-01-03 13 3 1 2020-01-04 14 3 2 2020-01-05
- __init__(test_size, divide_column='query_id', drop_cold_users=False, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', min_interactions_per_group=None, split_by_fractions=True, session_id_column=None, session_id_processing_strategy='test')
- Parameters
ratio – test size, must be in \((0, 1)\).
divide_column (
str
) – Name of column for dividing in dataframe, default:query_id
.drop_cold_users (
bool
) – Drop users from test DataFrame. which are not in train DataFrame, default: False.drop_cold_items (
bool
) – Drop items from test DataFrame which are not in train DataFrame, default: False.query_column (
str
) – Name of query interaction column. Ifdrop_cold_users
isFalse
, then you can omit this parameter. Default:query_id
.item_column (
str
) – Name of item interaction column. Ifdrop_cold_items
isFalse
, then you can omit this parameter. Default:item_id
.timestamp_column (
str
) – Name of time column, Default:timestamp
.min_interactions_per_group (
Optional
[int
]) – minimal required interactions per group to make first split. if value is less than min_interactions_per_group, than whole group goes to train. If not set, than any amount of interactions will be split. default:None
.split_by_fractions (
bool
) – the variable that is responsible for using the split by fractions. Split by fractions means that each line is marked with its fraq (line number / number of lines) and only those lines with a fraq > test_ratio get into the test. Split not by fractions means that the number of rows in the train is calculated by rounding the formula: the total number of rows minus the number of rows multiplied by the test ratio. The difference between these two methods is that due to rounding in the second method, 1 more interaction in each group (1 item for each user) falls into the train. When split by fractions, these items fall into the test. default:True
.session_id_column (
Optional
[str
]) – Name of session id column, which values can not be split, default:None
.session_id_processing_strategy (
str
) – strategy of processing session if it is split, values:train, test
, train: whole split session goes to train. test: same but to test. default:test
.
RandomSplitter
- class replay.splitters.random_splitter.RandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id')
Assign records into train and test at random.
- __init__(test_size, drop_cold_items=False, drop_cold_users=False, seed=None, query_column='query_id', item_column='item_id')
- Parameters
test_size (
float
) – test size 0 to 1drop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from testseed (
Optional
[int
]) – random seedquery_column (
str
) – Name of query interaction columnitem_column (
str
) – Name of item interaction column
NewUsersSplitter
- class replay.splitters.new_users_splitter.NewUsersSplitter(test_size, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
Only new users will be assigned to test set. Splits interactions by timestamp so that test has test_size fraction of most recent users.
>>> from replay.splitters import NewUsersSplitter >>> import pandas as pd >>> data_frame = pd.DataFrame({"query_id": [1,1,2,2,3,4], ... "item_id": [1,2,3,1,2,3], ... "relevance": [1,2,3,4,5,6], ... "timestamp": [20,40,20,30,10,40]}) >>> data_frame query_id item_id relevance timestamp 0 1 1 1 20 1 1 2 2 40 2 2 3 3 20 3 2 1 4 30 4 3 2 5 10 5 4 3 6 40 >>> train, test = NewUsersSplitter(test_size=0.1).split(data_frame) >>> train query_id item_id relevance timestamp 0 1 1 1 20 2 2 3 3 20 3 2 1 4 30 4 3 2 5 10 >>> test query_id item_id relevance timestamp 0 4 3 6 40
Train DataFrame can be drastically reduced even with moderate test_size if the amount of new users is small.
>>> train, test = NewUsersSplitter(test_size=0.3).split(data_frame) >>> train query_id item_id relevance timestamp 4 3 2 5 10
- __init__(test_size, drop_cold_items=False, query_column='query_id', item_column='item_id', timestamp_column='timestamp', session_id_column=None, session_id_processing_strategy='test')
- Parameters
test_size (
float
) – test size 0 to 1drop_cold_items (
bool
) – flag to drop cold items from testquery_column (
str
) – query id column nameitem_column (
Optional
[str
]) – item id column nametimestamp_column (
Optional
[str
]) – timestamp column namesession_id_column (
Optional
[str
]) – name of session id column, which values can not be split.session_id_processing_strategy (
str
) – strategy of processing session if it is split, values:train, test
, train: whole split session goes to train. test: same but to test. default:test
.
ColdUserRandomSplitter
- class replay.splitters.cold_user_random_splitter.ColdUserRandomSplitter(test_size, drop_cold_items=False, seed=None, query_column='query_id', item_column='item_id')
Test set consists of all actions of randomly chosen users.
- __init__(test_size, drop_cold_items=False, seed=None, query_column='query_id', item_column='item_id')
- Parameters
test_size (
float
) – fraction of users to be in testdrop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users – flag to drop cold users from test
seed (
Optional
[int
]) – random seedquery_column (
str
) – query id column nameitem_column (
Optional
[str
]) – item id column name