Determining the most highly correlated pairs

You can calculate the correlation coefficient of two items using g_cor(G;S;X;Y), but sometimes it is necessary to compute the correlation coefficient for multiple pairs and determine which items are the most highly correlated.



You want to identify the relationship between multiple pairs of items determine which items are most closely related. For example, you want to know how a day's weather condition correlates to the products sold that day. Then you want to select the most correlated product for each weather condition.


  <block name="correlate_weather_condition" condition="">
    <base table=""/>
    <sel value="year(datefor)=2016"/>
    <sel value="period<>1"/>
    <willbe name="conditions2" value="if((contains(conditions;'Snow'));
    <link table2="pub.doc.retail.altseg.sales_detail_transid" col="zipcode,
    dateof" col2="zip_txt,trans_date" type="select" expand="1">
      <sel value="year(trans_date)=2016"/>
      <link table2="pub.doc.retail.altseg.stores" col="store" col2="store" 
        <sel value="state='CA'"/>
      <willbe name="zip_txt" value="strsubst(padleft(string(zip);5)
      ;' ';0;'0')"/>
      <link table2="" 
       col="trans_date,zip_txt" col2="dateof,zipcode" type="include"/>
      <tabu label="Tabulation on Sales Detail" breaks="zip_txt,
        <tcol fun="sum" name="vol" source="qty" label="Volume"/>
      <willbe name="relative_vol" value="vol/g_avg(zip_txt;;vol)" 
       label="Volume Relative to`Avg for Zip"/>
      <colord cols="zip_txt,trans_date,sku,relative_vol"/>
    <colord cols="zipcode,dateof,period,datefor,conditions,sku,
    <willbe name="target_condition" value="'{@condition}'"/>
    <willbe name="is_target_condition" value="conditions=target_condition"/>
    <willbe name="target_condition_cumcnt" 
     value="g_cumsum(dateof zipcode sku;;period;is_target_condition)"/>
    <willbe name="pct_target_condition_in_forecast_horizon" 
    <willbe name="sku_horizon_corr" 
    value="g_cor(sku period;;relative_vol;
     pct_target_condition_in_forecast_horizon)" format="dec:9"/>
    <willbe name="abs_sku_horizon_corr" value="abs(sku_horizon_corr)" 
    <willbe name="num_obs" value="g_cnt(sku period;)"/>
    <sel value="num_obs>=100"/>
    <sel value="g_first1(sku period;;)"/>
    <sel value="g_hi(sku;;abs_sku_horizon_corr)=abs_sku_horizon_corr"/>
    <sel value="g_last1(sku;;period)"/>
    <sel value="(abs_sku_horizon_corr<>NA)"/>
    <willbe name="rank_corr" value="g_rank(;;;abs_sku_horizon_corr)"/>
    <sel value="rank_corr<=10"/>
    <link table2="pub.doc.retail.altseg.products" col="sku" col2="sku" 
    <colord cols="period,sku,description,target_condition,
<block name="top_correlated">
  <call block="correlate_weather_condition" condition="Rain"/>
  <foreach target="Clear,Fog,Thunderstorm,Overcast,Snow,Ice Pellets">
    <merge table2="">
      <call block="correlate_weather_condition" condition="{@target}"/>
  <sort col="abs_sku_horizon_corr" dir="down"/>
  <sel value="g_first1(target_condition;;)"/>


The use of blocks is helpful when you want to perform the same set of operations multiple times. There are two ways to reuse a query contained in a <block>. One way is to put the block inside a <library>. The other way is to use a <defblock>. With both methods, when you want to use the block, you need to use either the <insert> or <call>operation. Additionally, when using <library>, your library either needs to be present in your current query, or if it is saved elsewhere, it needs to be imported using the <import> operation.

For this recipe, the <block> is contained in a <library>, which is present at the beginning of the query. The block is structured so that you can correlate to any condition. This is done by substituting a variable, condition, which is initialized in the block, for every instance in the query that requires a named condition. Therefore, when you insert the <block>, you also state which condition you want to correlate to.

Before starting the analysis, the Forecast table is reduced by selecting only future weather forecasts, i.e. where the dateof is not equal to the datefor, or the period is not 1, and also by selecting forecasts for the year 2016. Then, you want to compare every forecast to every item bought on that day. A <link> is performed to combine the current Forecast worksheet to the Sales Detail table. However, in order to compare all combinations, expand="1" needs to be specified. This attribute will duplicate each entry in the Forecast worksheet to allow for multiple matches in the foreign table.

Within the link to Sales Detail, multiple operations are used to reduce the table. Similar to the reductions on the Forecast table, only transactions occurring in 2016 are selected. A link and select to the Store Master table, causes only the transactions from stores in California to be retained. In order to make a further reduction, and strictly include transactions from stores in locations that there is forecast data for, a link and include is performed with the Forecast table. Now that the table is reduced, a tabulation is performed. This tabulation finds the total amount of each SKU sold on each date for each store location.

After the worksheets are linked, you are left with a record of the relative volume sold for each product, location, date combination. To start calculating the correlation coefficients, you need to first flag the weather condition you want to correlate to. Then you count how many times that condition will appear within the forecast horizon and determine the percentage.

The correlation between the relative volume sold and the percentage of days the specified weather condition appears in the forecast horizon is determined with g_cor(G;S;X;Y). However, this function requires the G argument to contain the column(s) on which the table is segmented, and currently the worksheet is segmented by zipcode. Therefore, an empty merge, <merge/>, is performed to condense the table to a single segment.
Note: You should only perform an empty merge on tables smaller than ten million rows, or the system will have difficulty completing the query.
After the table is condensed into a single segment, you can either leave the G argument blank, or use your desired grouping columns.

To make sure you are including results calculated from a suitable amount of data points, a selection is done to retain sku/period combinations that had greater than 100 observations. Additional selections are done to select rows where the correlation coefficient isn't NA, where the sku/period combination is the first occurrence, and where the correlation coefficient is the highest for that SKU. After these selections, if there is still more than one occurrence of each SKU, then the occurrence with the longest horizon, or highest period is selected.

These correlations are ranked using g_rank(G;S;O;X), and the top ten correlated products are selected. To obtain more information about the highest correlated products, the Product Master table is linked in and the description column is added to the worksheet.

Within a second block, an <insert> statement with the condition variable initialized to rain will give you the top ten correlated products for rainy weather. Then, with the addition of the <foreach> loop, you can obtain the top ten correlated products for every weather condition by inserting the block for each condition and merging the results together. Lastly, a <sort> combined with a <sel> statement will create a table that includes the most highly correlated product for each weather condition.

Further reading

If you would like to learn more about the functions and operations discussed in this recipe, click on the links below: