Determining the most highly correlated pairs
You can calculate the correlation coefficient of two items using
g_cor(G;S;X;Y)
, but sometimes it is necessary to compute the correlation
coefficient for multiple pairs and determine which items are the most highly
correlated.
Difficulty
Objective
You want to identify the relationship between multiple pairs of items determine which items are most closely related. For example, you want to know how a day's weather condition correlates to the products sold that day. Then you want to select the most correlated product for each weather condition.
Solution
<library> <block name="correlate_weather_condition" condition=""> <base table="pub.demo.weather.wunderground.forecast"/> <sel value="year(datefor)=2016"/> <sel value="period<>1"/> <willbe name="conditions2" value="if((contains(conditions;'Snow')); 'Snow';(contains(conditions;'Rain'));'Rain';(contains(conditions; 'Cloudy'));'Overcast';(contains(conditions;'Thunderstorm')); 'Thunderstorm';conditions)"/> <link table2="pub.doc.retail.altseg.sales_detail_transid" col="zipcode, dateof" col2="zip_txt,trans_date" type="select" expand="1"> <sel value="year(trans_date)=2016"/> <link table2="pub.doc.retail.altseg.stores" col="store" col2="store" type="select"> <sel value="state='CA'"/> </link> <willbe name="zip_txt" value="strsubst(padleft(string(zip);5) ;' ';0;'0')"/> <link table2="pub.demo.weather.wunderground.forecast" col="trans_date,zip_txt" col2="dateof,zipcode" type="include"/> <tabu label="Tabulation on Sales Detail" breaks="zip_txt, sku,trans_date"> <tcol fun="sum" name="vol" source="qty" label="Volume"/> </tabu> <willbe name="relative_vol" value="vol/g_avg(zip_txt;;vol)" label="Volume Relative to`Avg for Zip"/> <colord cols="zip_txt,trans_date,sku,relative_vol"/> </link> <colord cols="zipcode,dateof,period,datefor,conditions,sku, relative_vol"/> <willbe name="target_condition" value="'{@condition}'"/> <willbe name="is_target_condition" value="conditions=target_condition"/> <willbe name="target_condition_cumcnt" value="g_cumsum(dateof zipcode sku;;period;is_target_condition)"/> <willbe name="pct_target_condition_in_forecast_horizon" value="100*target_condition_cumcnt/(period-1)"/> <merge/> <willbe name="sku_horizon_corr" value="g_cor(sku period;;relative_vol; pct_target_condition_in_forecast_horizon)" format="dec:9"/> <willbe name="abs_sku_horizon_corr" value="abs(sku_horizon_corr)" format="dec:9"/> <willbe name="num_obs" value="g_cnt(sku period;)"/> <sel value="num_obs>=100"/> <sel value="g_first1(sku period;;)"/> <sel value="g_hi(sku;;abs_sku_horizon_corr)=abs_sku_horizon_corr"/> <sel value="g_last1(sku;;period)"/> <sel value="(abs_sku_horizon_corr<>NA)"/> <willbe name="rank_corr" value="g_rank(;;;abs_sku_horizon_corr)"/> <sel value="rank_corr<=10"/> <link table2="pub.doc.retail.altseg.products" col="sku" col2="sku" cols="description"/> <colord cols="period,sku,description,target_condition, sku_horizon_corr,abs_sku_horizon_corr"/> </block> </library> <block name="top_correlated"> <call block="correlate_weather_condition" condition="Rain"/> <foreach target="Clear,Fog,Thunderstorm,Overcast,Snow,Ice Pellets"> <merge table2="pub.demo.weather.wunderground.forecast"> <call block="correlate_weather_condition" condition="{@target}"/> </merge> </foreach> <sort col="abs_sku_horizon_corr" dir="down"/> <sel value="g_first1(target_condition;;)"/> </block>
Discussion
The use of blocks is helpful when you want to perform the same set of operations multiple
times. There are two ways to reuse a query contained in a <block>
. One
way is to put the block inside a <library>
. The other way is to use a
<defblock>
. With both methods, when you want to use the block, you
need to use either the <insert>
or <call>
operation.
Additionally, when using <library>
, your library either needs to be
present in your current query, or if it is saved elsewhere, it needs to be imported using
the <import>
operation.
For this recipe, the <block>
is contained in a
<library>
, which is present at the beginning of the query. The block
is structured so that you can correlate to any condition. This is done by substituting a
variable, condition
, which is initialized in the block, for every instance
in the query that requires a named condition. Therefore, when you insert the
<block>
, you also state which condition you want to correlate to.
Before starting the analysis, the Forecast table is reduced by
selecting only future weather forecasts, i.e. where the dateof
is not equal
to the datefor
, or the period
is not 1
,
and also by selecting forecasts for the year 2016. Then, you want to compare every forecast
to every item bought on that day. A <link>
is performed to combine the
current Forecast worksheet to the Sales Detail
table. However, in order to compare all combinations, expand="1"
needs to
be specified. This attribute will duplicate each entry in the
Forecast worksheet to allow for multiple matches in the foreign
table.
Within the link to Sales Detail, multiple operations are used to reduce the table. Similar to the reductions on the Forecast table, only transactions occurring in 2016 are selected. A link and select to the Store Master table, causes only the transactions from stores in California to be retained. In order to make a further reduction, and strictly include transactions from stores in locations that there is forecast data for, a link and include is performed with the Forecast table. Now that the table is reduced, a tabulation is performed. This tabulation finds the total amount of each SKU sold on each date for each store location.
After the worksheets are linked, you are left with a record of the relative volume sold for each product, location, date combination. To start calculating the correlation coefficients, you need to first flag the weather condition you want to correlate to. Then you count how many times that condition will appear within the forecast horizon and determine the percentage.
g_cor(G;S;X;Y)
. However, this function requires the G
argument to contain the column(s) on which the table is segmented, and currently the
worksheet is segmented by zipcode
. Therefore, an empty merge,
<merge/>
, is performed to condense the table to a single segment.
G
argument blank, or use
your desired grouping columns.To make sure you are including results calculated from a suitable amount of data points, a
selection is done to retain sku
/period
combinations that
had greater than 100 observations. Additional selections are done to select rows where the
correlation coefficient isn't NA
, where the
sku
/period
combination is the first occurrence, and
where the correlation coefficient is the highest for that SKU. After these selections, if
there is still more than one occurrence of each SKU, then the occurrence with the longest
horizon, or highest period is selected.
These correlations are ranked using g_rank(G;S;O;X)
, and the top ten
correlated products are selected. To obtain more information about the highest correlated
products, the Product Master table is linked in and the
description
column is added to the worksheet.
Within a second block, an <insert>
statement with the
condition
variable initialized to rain
will give you the
top ten correlated products for rainy weather. Then, with the addition of the
<foreach>
loop, you can obtain the top ten correlated products for
every weather condition by inserting the block
for each condition and
merging the results together. Lastly, a <sort>
combined with a
<sel>
statement will create a table that includes the most highly
correlated product for each weather condition.
Further reading
If you would like to learn more about the functions and operations discussed in this recipe, click on the links below: