g_outlier(G;S;X;Z)

Returns a numeric value indicating outliers or extreme values. (Available as of version 10.44)

Function type

Vector only

Syntax

g_outlier(G;S;X;Z)

Input

Argument Type Description
G any A space- or comma-separated list of column names

Rows are in the same group if their values for all of the columns listed in G are the same.

If G is omitted, all rows are considered to be in the same group.

If any of the columns listed in G contain N/A, the N/A value is considered a valid grouping value.

S integer The name of a column in which every row evaluates to a 1 or 0, which determines whether or not that row is selected to be included in the calculation

If S is omitted, all rows will be considered by the function (subject to any prior row selections).

If any of the values in S are neither 1 nor 0, an error is returned.

X any numeric type A column name

The column on which to apply the outlier algorithm.

Z text or package A string of pairs of option names and option values, or a package that specifies these values (optional)

The options you may specify for the Z parameter are:

'method' value

Valid values for value are:

'tukeys'
Specifies that the function use Tukey’s method, which returns values indicating outliers.
'gesd'
Specifies that the function use Generalized Extreme Studentized Distribution (GESD), which returns a ranking of outliers in the data.

The default value is 'tukeys'.

The following option can also be specified when using the 'tukeys' method:
'fence_mult' value
Specifies the fence multiplier, which defines the distance between the upper and lower fences used to determine outliers.

The default value is 1.5.

The following options can also be specified when using the 'gesd' method:
'alpha' value
Specifies the significance level.

The default value is 0.05.

'max_num' value
Specifies the suspected maximum number of outliers.

The default value is 10.

For example, since Z can either consist of a string of name/value pairs or a package, the same values can be specified for Z in either of the following forms:
  • 'method' 'gesd' 'alpha' 0.075 'max_num' 22
  • pkg('method' 'alpha' 'max_num';'gesd' 0.075 22)

If Z is omitted, Tukey's method is used by default with a fence multiplier of 1.5.

Return Value

For every row in each group defined by G (and for those rows where S=1, if specified), g_outlier(G;S;X;Z) applies an outlier algorithm specified in Z to the data in X.

  • If Tukey's method is selected, it returns a boolean value labeling outliers with a 1 and non-outliers with a 0.

    Tukey’s method (Tukey 1977), which does not assume a distribution and is sometimes called the boxplot method, computes upper and lower fences to determine outliers. If X values are less than the lower fence or greater than the upper fence, the return value is 1; otherwise, it is 0. As the option fence_mult increases, the distance between the lower fence and upper fence increases.

  • If the GESD method is selected, it returns 0 for each row that is not an outlier and a ranking of the outliers otherwise (i.e., 1 is the most extreme outlier, 2 is the second most extreme outlier, etc.).

    GESD, which assumes approximate normality of the data, iteratively computes max_num separate tests for outliers based on Rosner’s (Rosner 1983) generalization of Grubbs’s test (Grubbs 1950). The return value for GESD provides a ranking for outliers: 0 for non-outliers, and NA for rows with NA or values for which S=0.

Example

The following example uses g_outlier(G;S;X;Z) to determine the outliers in the duration column of the pub.demo.mleg.uci.bankmarketing table, grouped by the values in the contact column. The first instance of g_outlier(G;S;X;Z) uses Tukey's method with a fence multiplier of 1.5, as specified by the Z parameter. This determines the outliers when the default fence multiplier is used. The second instance of g_outlier(G;S;X;Z) sets the fence multiplier to 3.0, which determines the more extreme outliers. A tabulation is then performed, which produces the total number of both the default outliers and extreme outliers for each group in the contact column.

<base table="pub.demo.mleg.uci.bankmarketing"/>
<willbe name="outs" 
 value="g_outlier(contact;;duration;pkg('method' 'fence_mult';'tukeys' 1.5))"/>
<willbe name="outs_extreme" 
 value="g_outlier(contact;;duration;pkg('method' 'fence_mult';'tukeys' 3.0))"/>
<tabu breaks="contact" label="Tabulation">
  <tcol fun="sum" name="outliers" source="outs" label="Number of`Outliers"/>
  <tcol fun="sum" name="outliers_extreme" source="outs_extreme" 
   label="Number of Extreme`Outliers"/>
</tabu>

References

  • Tukey, John. 1977. Exploratory Data Analysis. Addison-Wesley.
  • Rosner, Bernard. 1983. "Percentage Points for a Generalized ESD Many-Outlier Procedure." Technometrics 25(2):165-172.
  • Grubbs, Frank. 1950. "Sample Criteria for Testing Outlying Observations." Annals of Mathematical Statistics 21(1):27-58.