Using Tableau to detect outlying trends and ruptures

Tableau detects outlying trends using Tabpy.

To do that, we are going to get the outliers of a dataset, the change points and a piecewise approximation.

For detailled information about the source code, you can look at this article :

Outlier and Change Point Detection in Data Integration Performance Metrics

The steps to adapt the code for Tableau and tabpy are the following:

  • Prepare the data to fit the format wanted,
  • Create new functions to return lists with the right dimension for Tableau,
  • Deploy the new functions,
  • Call them in Tableau.

Below, in the order, there are the results for the outliers, the segmentation and the approximation of one series (JOB_DA_AUDIT).

The data points represent batch executions.

We are working in an anaconda  environnement and with a jupyter notebook.

You can find the notebook and the Tableau workbook on our github : https://github.com/aetperf/aetp-tableau-tabpy.git

Prepare data for tabpy

Tabpy is excpecting lists in input and will return lists as output. Here, two lists are expected in arguments, one with the date for the index and the data of the elasped time.

If we look at the detect_outliers function (see the article linked above to see the code), it needs to have panda series with datetime index and a daily frequency.

The first step is therefore to transform the data lists into panda series with date time index. It is done in the function prepare_data on the right.

def prepare_data(data, index):
    # format index to have a datatime index
    t1 = pd.to_datetime(index)
    t2 = t1.to_frame()[0].dt.date.values
    index_date = pd.to_datetime(t2)

    # create the dataframe
    df = pd.DataFrame(data, index=index_date)
    df_notnull = df[pd.notnull(df.index)]

    # resample and tranform to serie
    df_notnull = df_notnull.resample("D").mean()
    ts = df_notnull.squeeze()
    return ts, df

Create the functions with Python

After preparing the data, we can now find the outliers.

« detect_outliers » returns a dataframe with a column « outlier » that contains True if the point is an outlier.

Tableau expects the output size to be equal to the input size. In « detect_outliers », the null values are dropped, therefore the input and output length would be different if there are null values. Before returning the result, we need to merge the data frames to have the right length.

The « error » function writes in the log the error with the message given in the first argument. It creates a list of the right size (here len (data)) with the value given in the second argument. It avoids having Tableau’s alert popping. The error is written in the console and in the log and identical values are sent as output.

def detect_outliers_list(data, index):
    ts, df1 = prepare_data(data, index)

    if not isinstance(ts, pd.core.series.Series):
        return error("ts must be a Pandas Series", True, len(data))
    if not isinstance(ts.index, pd.DatetimeIndex):
        return error("Index must be a DatetimeIndex",True, len(data))
    if ts.index.freq != "D":
        return error("Day frequency needed", True, len(data))

    outliers = detect_outliers(ts)
    concat = pd.merge(df1, outliers, left_index=True, right_index=True, how="left")
    return concat["outlier"].values.tolist()

« Extract_features » returns a dataframe with the slope, the segments, value of the trend.

In « Extract_features_list », there is an argument that allows to choose if we want to take the trend values, the segments or the slope.

def extract_feature_list(data, index, element):
    ts, df1 = prepare_data(data, index)

    if not isinstance(ts, pd.core.series.Series):
        return error("ts must be a Pandas Series", 0, len(data))
    if not isinstance(ts.index, pd.DatetimeIndex):
        return error("Index must be a DatetimeIndex", 0, len(data))
    if ts.index.freq != "D":
        return error("Day frequency needed", 0, len(data))

    res = extract_features(ts, True)
    concat = pd.merge(df1, res, left_index=True, right_index=True, how="left")
    return concat[element].values.tolist()

« get_segement » returns a list with an integer. Every point that belong to the same segment has the same number.

« get_approximation » returns the value of the trend for each date.

« get_slope » returns the slope of each piecewise approximation. Each data point has the slope of the corresponding segment.

def get_segment(data, index):
    return extract_feature_list(data, index, "segment")


def get_approximation(data, index):
    return extract_feature_list(data, index, "piecelin")


def get_slope(data, index):
    return extract_feature_list(data, index, "slope")

Deploy the functions in Tabpy

Tabpy needs to be started. We are using the default command « Tabpy » in the anaconda prompt.

The functions need to be deployed on Tabpy server  to be used in Tableau with the code on the right.

from tabpy.tabpy_tools.client import Client

connection = Client("http://localhost:9004/")
connection.deploy(
    "detect_outliers_list",
    detect_outliers_list,
    "return true if outlier",
    override=True,
)
connection.deploy(
    "get_segment", get_segment, 
    "return which segment it belongs to", override=True
)
connection.deploy(
    "get_approximation",
    get_approximation,
    "return piecewise approximation",
    override=True,
)
connection.deploy(
    "get_slope", get_slope, 
    "return slope of the piecewise approximation", override=True
)

 

Using Tabpy in Tableau

Then, in Tableau, we call the functions by using « tabpy.query » in calculated fileds. They call the appropriate function to get the outliers, the segments, the  approximation and the slope. It is important to calculate the field along the date [Dt Deb Donnees] to have the right format.

« detect_outliers_list » returns a list of booleans. Therefore, « Script_bool » must be used to tell Tableau that the output type is going to be boolean.

The elapsed time depending on the date is plotted. By adding the outlier calculated filed to the color mark of the plot, we get the following result (outliers are in orange).

//Outliers
SCRIPT_BOOL("return tabpy.query('detect_outliers_list',_arg1, _arg2 )['response']", SUM([elapsed time]), MIN([Dt Deb Donnees]))

« get_segment » returns a list of integers. So, we must use « Script_int » to tell Tableau that the output type is going to be integer.

In the chart here, the different colors show the different segments. We can see there are 8 segments.

//Segment
SCRIPT_INT("return tabpy.query('get_segment',_arg1, _arg2 )['response']", SUM([elapsed time]), MIN([Dt Deb Donnees]))

« get_approximation » and « get_slope » both returns a list of floats. Therefore, « script_real » is used to tell Tableau that the output type is going to be real.

On the right, there are the curve of piecewise approximation in red and the data points in blue. We can see in the tooltip the slope of the curve.

//Curve 
SCRIPT_REAL("return tabpy.query('get_approximation',_arg1, _arg2 )['response']", SUM([elapsed time]), MIN([Dt Deb Donnees])) 
//Slope 
SCRIPT_REAL("return tabpy.query('get_slope',_arg1, _arg2 )['response']", SUM([elapsed time]),MIN([Dt Deb Donnees]))

To visualize the outlying slope, we create a calculated field « alert » with a parameter that is used as a threshold above which the slope is considered as abnormal. The calculated field is applied to the color mark of the curve.

//alert 
IF ABS([Slope]) > [Slope alert (>0)] THEN "abnormal" ELSE "ok" END

It is possible to visualize the trends on multiple series on the same graph. The functions will be called for each series.

We can visualize the trends on multiple series on a dashaboard by choosing a series with a selector.

Conclusion

We have seen that Tableau can detect outlying trends using machine learning technics thanks to tabpy. In this article, data was based on job batch executions but we can easily imagine that indicators can be switched to supply-chain, sales, or websites analytics.

The execution of python code is a great extension that offers a lot of new possibilities to Tableau because it allows to enrich Tableau with machine learning insight.

However, the execution is not as quick as if we were only using Tableau.