Writing Efficient Queries

When using large tables with more than 50 million rows, the efficiency of a query can have a significant impact on run time and memory usage.

Here are some suggestions that will make the 1010data Insights Platform perform optimally.

General

Do it step-by-step
Writing a query in its entirety and executing it all at once can be risky on large tables. It may take a long time or may not work at all, and you won't necessarily know what operations are causing the problem. Instead, run the instructions line by line to catch and pinpoint any inefficiencies.
Use sample data
While developing a query, use a small subset of the entire database. For example, select the first 1000 rows of the table with the selection expression: i_<=1000. If you are working with a multi-segment table (>5M rows), try to capture a subset of 2-3 segments.
Cancel queries completely
If a query is running longer than expected and does not seem to be finishing, you cannot stop the process by clicking the browser's stop button. Clicking the stop button frees up the browser window, but the query continues to be processed, using system resources and potentially impacting other users. Even closing the browser window will not stop the process on the 1010data servers. Instead, login to 1010data and choose the option End existing session when prompted. Logging in this way kills the previous query process and cleans things up.
Consider using TenDo or the 1010data Excel Add-in
In the case of an extremely large query, the Insights Platform may not finish processing it before reaching a set time-out period. When this occurs, the query is automatically resubmitted. Fortunately, because each query operation is cached after it is run, the Insights Platform continues to process the query from the point at which it timed-out instead of from the beginning. In nearly all cases, this allows the query to finish processing.

If the query neither completes nor fails after being resubmitted 20 times, an error message is displayed. Should this occur, 1010data recommends canceling the transaction and submitting it using TenDo or the 1010data Excel Add-in. For instructions, see the TenDo User's Guide or the 1010data Excel Add-in User's Guide.

Selections

Select using native columns first
Where possible, use native columns to perform selection statements.  
  • Select on computed columns or perform a link-and-select only when the above is not possible.
  • A link with type="select" is more efficient than a link followed by a select when trying to find only those rows in the base table that are in the foreign table.
  • When batching together queries via TenDo or the Excel Add-in, use parameter substitution where the parameters are the results of a query whenever possible. This technique is much faster than the alternatives.
Use the right order
When performing multiple selections, always start with the most highly selective statements first and the least selective statements last. For example, if you wanted to see the transaction data of a single store for a given date range, you should perform the store selection before the date range selection.
Don't select too many rows
Make your selections count! Selections that keep too many rows (e.g., more than 50 million) use a significant amount of memory. For large tables, try to especially avoid selection statements that select almost the entire table. If you must select a large number of records, try an alternative approach. For example, if you select rows and then tabulate:
  1. Define a computed column named include, whose value is the selection expression. This column will be 1 for rows you want to keep and 0 for rows you want to omit.
  2. Make include one of the break columns in the tabulation.
  3. After the tabulation, select rows where include is 1.
Learn what columns are indexed
Indexing is not explicitly revealed in the user interface, but can make selections on large tables much more memory efficient.

Defining Columns and Writing Selection Expressions

Choose the appropriate function
Different types of functions use different amounts of system resources. When you are defining a column, always use functions in this order for best performance: regular functions, G_Functions, tabulations.
Use G_Functions on segmented tables before resorting to tabulation
Computed columns are only calculated as needed, whereas tabulations calculate all results. Remember, time series and group summarization functions (beginning with g_ or t_) use significantly more time and memory than other functions.
Tabulate and filter before creating computed columns
Where possible, define computed columns after tabulations. Computed columns add overhead; the fewer rows they are applied to, the better.
Create a reference column
If the same computation is referenced multiple times, create a column with it and reference that thereafter.
Use built-in functions
Use 1010data functions wherever possible instead of complex expressions.
Eliminate unnecessary if's or if-then-else's
When possible use logical arithmetic or another categorization function. Logical arithmetic applies if the desired results are 0 if false and 1 if true.
Avoid complicated text searches
Manipulating or searching on text values is more expensive than similar operations on numeric operations. In particular, don't use the function contains (x;y) if you can use equals (=).
Handle divide-by-zero's
Select the Error and Infinity Handling check box in the Advanced Settings dialog. Alternatively, check if a denominator is zero before dividing and adjust your computation accordingly.
System variables
The system variables i_ or zi_ are equivalent to creating computed columns. Consider this when using them in selections or tabulations.
Be aware of data types
Integer arithmetic is faster than decimal arithmetic, which is faster than string manipulation.

Sorting

Sort after tabulating
Sort the table during or after a tabulation rather than before, especially if there are more than a few million rows.

Linking

Avoid large worksheets
Linking to large worksheets (more than a few million rows or many columns) can use a significant amount of memory.  
  1. If possible, avoid linking to worksheets entirely, except in cases where you can significantly pare down the number of rows.
  2. To reduce the size of the foreign worksheet, select rows before linking.
  3. Use a colord to only link in the needed columns.
  4. Try to do the analysis within the same worksheet. If necessary, save the worksheet as a table, then link it in.
Suffix and labels
Consider using a suffix and a label when you link together two data sets that share common column names. That way, you can avoid redefining a column so it has a meaningful name (not c1 or t1). Plus, any macro code that refers to a renamed column will always use more resources than those that directly reference the original column.

Tabulating

Understand the costs
Memory usage and run time is a function of the number of breaks, type of breaks (original column vs. computed column), number of result columns, complexity of the break columns (how many characters define a unique break value), and type of aggregation functions. Tabulate on original columns whenever possible.
Limit the size of the result
Do not break on a column (or set of columns) that has more than a few million different values.
Only create tabulation groups needed to give unique results
Unnecessary breaks still create overhead.
Do things piecewise
If your data set is very large and you are running into memory problems, apply the tabulation to subsets of the data and combine the results afterwards.
Round real numbers
Before using a real (floating point) column as a break column, round the values. The system may have trouble distinguishing between real numbers that are very close in value, and rounding will help ensure that the values are not that close to one another.
Adjust your advanced settings
Change your advanced settings for very large data sets and complex queries that require a lot of memory. In the Advanced Settings dialog, select a lower Blocking Level or select Do step-wise aggregation, as these two settings work together to use less memory. Note: This may slow down execution time.
Supply column names in tabulations
It is better to name the variables produced in tabulations instead of relying on the default names given in the tabulation (e.g., t0, t1, t2). If you supply column names, you don't have to create an additional computed column to rename a default tabulation name. Plus, any later code that refers to the new computed column names will always use more resources than those that directly referenced t1.

Merging

Row limitations of merged tables
Merged tables are limited to ~16 million cells (number of rows * number of columns).
Unrestricted Merges
If the <merge> operation appears before all other operations in a macro (except <note>) and the default attributes type=all and match=names are used, and no other tags are contained within the <merge> operation, then the merged table size is unlimited.

Saving and Downloading Tables

Understand the costs
Saving large tables can take a long time just like running large tabulations. When you save a table, it requires the system to compute all the rows and columns. This can tax the system resources unnecessarily. In some cases, it is actually faster to run the query again instead of saving the results.