Removing values n standard deviations from the mean
You can determine outliers by identifying values in specific columns that fall n standard deviations outside of the mean for that column in a given data set. Rows that fall outside the desired range can then be eliminated.
Difficulty
Objective
You are working with a table that contains outlier values which could skew your analysis. You want to identify all of the values that fall outside n standard deviations from the mean and eliminate them from your data set to get a more accurate picture.
Solution
<base table="pub.demo.weather.wunderground.observed_hourly"/> <willbe name="avg_tempi" value="g_avg(zipcode;;tempi)"/> <willbe name="std_tempi" value="g_std(zipcode;;tempi)"/> <willbe name="lower_limit" value="avg_tempi-(1*std_tempi)"/> <willbe name="upper_limit" value="avg_tempi+(1*std_tempi)"/> <sel value="between(tempi;lower_limit;upper_limit)"/>
Discussion
Having outlier values in your data set can cause your analyses and aggregations to be skewed. Creating a range of acceptable values to select on will help improve results. This recipe selects values lying within one standard deviation of the mean, but you can change this number to create a different selection range.
This solution uses two g_functions: one to determine the average of the values in the data,
g_avg(G;S;X)
, and one to determine the
standard deviation, g_std(G;S;X)
. Upper and lower
limits are created by adding and subtracting n
(in this case 1) multiplied by the standard deviation from the
average temperature. Then, to eliminate the values further than one
standard deviation away, only the values within this range are
selected.
Further reading
If you would like to learn more about the functions and operations discussed in this recipe, click on the links below: