Selections

The order in which selections are made and the method in which selections are made can affect the efficiency of your query.

Key concepts

  • In most cases, make selections in order of those that make the greatest reduction in table size.
  • Avoid creating selection columns by making direct selections instead.
  • Making selections on indexed columns provides increased performance.

Discussion

When making selections on a table using <sel>, you normally want to start with the selection that will make the greatest reduction in the number of rows. This will produce a smaller worksheet for subsequent operations to perform. Because an operation has to consider every row of the table, retaining the smallest number of rows necessary to complete the analysis will decrease computation time.

An exception to this general rule occurs when selecting on computed columns rather than native columns. If the selection that reduces the table the most is based on a computed column where the value expression is expensive to compute, it could be better to make smaller selections first. For example, you need to make a selection based on a column whose value expression depends on multiple g_functions and a link to another worksheet. This selection will produce a worksheet that is 1% of the original table size. You also need to make a selection on a native column that will reduce the table to 10% of it's original size. Although the selection on the computed will produce a greater reduction, due to the cost in evaluating this column, it's better to evaluate it on less of the table. Therefore, you should make the selection on the native column first.

Additionally, unless necessary try to avoid creating computed columns for the purpose of making selections. Instead, use the value expression of the <willbe> as the value expression of the <sel> operation. For example, instead of creating a boolean column using g_first1(G;S;X) and then using that boolean column as the value expression of a selection, use the g_function directly in the selection.

Furthermore, making selections on indexed columns is faster than making selections on non-indexed columns. You can determine if a column is indexed by viewing the information about the table. If there are any indexed columns, they will be listed there.

Example

Say you want to find the sum of sales per group for a specific date, store, and department. The following example explores the impact of the order in which you make your selections before completing your analysis.

For this example, the path of the base table is pub.doc.retail.altseg.sales_detail_sku, and it has 28,888,576 rows.

Consider the following order of selections:

<sel value="store=25"/>
<sel value="trans_date=20160621"/>
<sel value="dept=13"/>

Selecting the store reduces the worksheet to 3,656,764 rows. Then selecting the date reduces it further to 4,970 rows, until ultimately you're left with 182 rows after selecting the department.

Now, consider what happens when the selections are ordered differently:

<sel value="dept=13"/>
<sel value="store=25"/>
<sel value="trans_date=20160621"/>

With this order of selections, you go from 28,888,576 rows to 978,850 to 177,463, and then finally to 182 rows.

Lastly, consider this ordering:

<sel value="trans_date=20160621"/>
<sel value="dept=13"/>
<sel value="store=25"/>

When you select the date first, you reduce the worksheet to 17,693 rows. From there, you go down to 449 rows after the department selection and then to 182 rows after selecting the store.

As you can see, there is one selection that reduces the worksheet significantly more than the other two. By selecting the date first, you reduce the worksheet by about 99.9% right away. This leaves a very small percentage of rows for the next <sel> to operate on. However, if you select the store first, you still have 13% of the rows on which the second selection will operate.