g_outlier(G;S;X;Z)
Returns a numeric value indicating outliers or extreme values. (Available as of version 10.44)
Function type
Vector only
Syntax
g_outlier(G;S;X;Z)
Input
Argument | Type | Description |
---|---|---|
G |
any | A space- or comma-separated list of column names Rows are in the same group
if their values for all of the columns listed in If If any of the columns listed in |
S |
integer | The name of a column in which every row evaluates to a 1 or 0, which determines
whether or not that row is selected to be included in the calculation If
If any of the values in
|
X |
any numeric type | A column name The column on which to apply the outlier algorithm. |
Z |
text or package | A string of pairs of option names and option values, or a package that
specifies these values (optional) The options you may specify for the
The following option can also be specified when using the
'tukeys' method:
The following options can also be specified when using the
'gesd' method:
For example, since Z can either consist of a string of
name/value pairs or a package, the same values can be specified for
Z in either of the following forms:
If |
Return Value
For every row in each group defined by G
(and for those rows where
S
=1, if specified),
g_outlier(G;S;X;Z)
applies an outlier algorithm specified in
Z
to the data in X
.
-
If Tukey's method is selected, it returns a boolean value labeling outliers with a
1
and non-outliers with a0
.Tukey’s method (Tukey 1977), which does not assume a distribution and is sometimes called the boxplot method, computes upper and lower fences to determine outliers. If
X
values are less than the lower fence or greater than the upper fence, the return value is 1; otherwise, it is 0. As the optionfence_mult
increases, the distance between the lower fence and upper fence increases. -
If the GESD method is selected, it returns
0
for each row that is not an outlier and a ranking of the outliers otherwise (i.e.,1
is the most extreme outlier,2
is the second most extreme outlier, etc.).GESD, which assumes approximate normality of the data, iteratively computes
max_num
separate tests for outliers based on Rosner’s (Rosner 1983) generalization of Grubbs’s test (Grubbs 1950). The return value for GESD provides a ranking for outliers: 0 for non-outliers, and NA for rows with NA or values for whichS
=0.
Example
The following example uses g_outlier(G;S;X;Z)
to determine the outliers in
the duration
column of the
pub.demo.mleg.uci.bankmarketing table, grouped by the values in the
contact
column. The first instance of g_outlier(G;S;X;Z)
uses Tukey's method with a fence multiplier of 1.5, as specified by the Z
parameter. This determines the outliers when the default fence multiplier is used. The
second instance of g_outlier(G;S;X;Z)
sets the fence multiplier to 3.0,
which determines the more extreme outliers. A tabulation is then performed, which produces
the total number of both the default outliers and extreme outliers for each group in the
contact
column.
<base table="pub.demo.mleg.uci.bankmarketing"/> <willbe name="outs" value="g_outlier(contact;;duration;pkg('method' 'fence_mult';'tukeys' 1.5))"/> <willbe name="outs_extreme" value="g_outlier(contact;;duration;pkg('method' 'fence_mult';'tukeys' 3.0))"/> <tabu breaks="contact" label="Tabulation"> <tcol fun="sum" name="outliers" source="outs" label="Number of`Outliers"/> <tcol fun="sum" name="outliers_extreme" source="outs_extreme" label="Number of Extreme`Outliers"/> </tabu>
References
- Tukey, John. 1977. Exploratory Data Analysis. Addison-Wesley.
- Rosner, Bernard. 1983. "Percentage Points for a Generalized ESD Many-Outlier Procedure." Technometrics 25(2):165-172.
- Grubbs, Frank. 1950. "Sample Criteria for Testing Outlying Observations." Annals of Mathematical Statistics 21(1):27-58.