Settings

Spark session

This library uses session_handler.State to provide universal access to the same session for all modules. Default session will be created automatically and can be accessed as a session attribute.

from replay.utils.session_handler import State
State().session

There is also a helper function to provide basic settings for the creation of Spark session

replay.utils.session_handler.get_spark_session(spark_memory=None, shuffle_partitions=None, core_count=None)

Get default SparkSession

Parameters

spark_memory (Optional[int]) – GB of memory allocated for Spark; 70% of RAM by default.
shuffle_partitions (Optional[int]) – number of partitions for Spark; triple CPU count by default
core_count (Optional[int]) – Count of cores to execute, -1 means using all available cores. If None then checking out environment variable REPLAY_SPARK_CORE_COUNT, if variable is not set then using -1. Default: None.

Return type

SparkSession

You can pass any Spark session to State for it to be available in library.

from replay.utils.session_handler import get_spark_session
session = get_spark_session(2)
State(session)

class replay.utils.session_handler.State(session=None): All modules look for Spark session via this class. You can put your own session here.

Logging

Logger name is replay. Default level is logging.INFO.

import logging
logger = logging.getLogger("replay")
logger.setLevel(logging.DEBUG)