Data Exploration and Profiling¶
- class luminaire.exploration.data_exploration.DataExploration(freq='D', min_ts_mean=None, fill_rate=None, sig_level=0.05, min_ts_length=None, max_ts_length=None, is_log_transformed=None, data_shift_truncate=True, min_changepoint_padding_length=None, change_point_threshold=2, window_length=None, *args, **kwargs)[source]¶
This is a general class for time series data exploration and pre-processing.
- Parameters:
freq (str) – The frequency of the time-series. A Pandas offset such as ‘D’, ‘H’, or ‘M’. Luminaire currently supports the following pandas frequency types: ‘H’, ‘D’, ‘W’, ‘W-SUN’, ‘W-MON’, ‘W-TUE’, ‘W-WED’, ‘W-THU’, ‘W-FRI’, ‘W-SAT’.
sig_level (float) – The significance level to use for any statistical test withing data profile. This should be a number between 0 and 1.
min_ts_mean (float) – The minimum mean value of the time series required for the model to run. For data that originated as integers (such as counts), the ARIMA model can behave erratically when the numbers are small. When this parameter is set, any time series whose mean value is less than this will automatically result in a model failure, rather than a mostly bogus anomaly.
fill_rate (float) – Minimum proportion of data availability in the recent data window. Should be a fraction between 0 and 1.
max_window_length (int) – The maximum size of the sub windows for input data segmentation.
window_length (int) – The size of the sub windows for input data segmentation.
min_ts_length (int) – The minimum required length of the time series for training.
max_ts_length (int) – The maximum required length of the time series for training.
is_log_transformed (bool) – A flag to specify whether to take a log transform of the input data. If the data contain negatives, is_log_transformed is ignored even though it is set to True.
data_shift_truncate (bool) – A flag to specify whether left side of the most recent change point needs to be truncated from the training data.
min_changepoint_padding_length (int) – A padding length between two change points. This parameter makes sure that two consecutive change points are not close to each other.
change_point_threshold (float) – Minimum threshold (a value > 0) to flag change points based on KL divergence. This parameter can be used to tune the sensitivity of the change point detection method.
- add_missing_index(df=None, freq=None)[source]¶
This function reindexes a pandas dataframe with missing dates for a given time series frequency.
Note: If duplicate dates dates are present in the dataframe, this function takes average of the duplicate data dates and merges them as a single data date.
- Parameters:
df (pandas.DataFrame) – Input pandas dataframe containing the time series
freq (str) – The frequency of the time-series. A Pandas offset such as ‘D’, ‘H’, or ‘M’
- Returns:
pandas dataframe after reindexing missing data dates
- Return type:
pandas.DataFrame
- kf_naive_outlier_detection(input_series, idx_position)[source]¶
This function detects outlier for the specified index position of the series.
- Parameters:
input_series (numpy.array) – Input time series
idx_position (int) – Target index position
- Returns:
Anomaly flag
- Return type:
bool
>>> input_series = [110, 119, 316, 248, 451, 324, 241, 275, 381] >>> self.kf_naive_outlier_detection(input_series, 6) False
- profile(df, impute_only=False, **kwargs)[source]¶
This function performs required data profiling and pre-processing before hyperparameter optimization or time series model training.
- Parameters:
df (list/pandas.DataFrame) – Input time series.
impute_only (bool) – Flag to perform preprocessing until imputation OR full preprocessing.
- Returns:
Preprocessed dataframe with batch data summary.
- Return type:
tuple[pandas.dataFrame, dict]
>>> de_obj = DataExploration(freq='D', data_shift_truncate=1, is_log_transformed=0, fill_rate=0.9) >>> data raw index 2020-01-01 1326.0 2020-01-02 1552.0 2020-01-03 1432.0 2020-01-04 1470.0 2020-01-05 1565.0 ... ... 2020-06-03 1934.0 2020-06-04 1873.0 2020-06-05 1674.0 2020-06-06 1747.0 2020-06-07 1782.0 >>> data, summary = de_obj.profile(data) >>> data, summary ( raw interpolated 2020-03-16 1371.0 1371.0 2020-03-17 1325.0 1325.0 2020-03-18 1318.0 1318.0 2020-03-19 1270.0 1270.0 2020-03-20 1116.0 1116.0 ... ... ... 2020-06-03 1934.0 1934.0 2020-06-04 1873.0 1873.0 2020-06-05 1674.0 1674.0 2020-06-06 1747.0 1747.0 2020-06-07 1782.0 1782.0 [84 rows x 2 columns], {'success': True, 'trend_change_list': ['2020-04-01 00:00:00'], 'change_point_list': ['2020-03-16 00:00:00'], 'is_log_transformed': 0, 'min_ts_mean': None, 'ts_start': '2020-01-01 00:00:00', 'ts_end': '2020-06-07 00:00:00'})
- stream_profile(df, impute_only=False, **kwargs)[source]¶
This function performs data preparation for streaming data.
- Parameters:
df – list/pandas.DataFrame df: Input time series.
impute_only – Flag to perform preprocessing until imputation OR full preprocessing.
kwargs – Other input parameters.
- Returns:
Prepared ppandas dataframe with profile information.
- Return type:
tuple[pandas.dataFrame, dict]