When using large tables with more than 50 million rows, the efficiency of a query can
have a significant impact on run time and memory usage.
Here are some suggestions that will make the 1010data Insights Platform perform
- Do it step-by-step
- Writing a query in its entirety and executing it all at once can be risky on large
tables. It may take a long time or may not work at all, and you won't necessarily know
what operations are causing the problem. Instead, run the instructions line by line to
catch and pinpoint any inefficiencies.
- Use sample data
- While developing a query, use a small subset of the entire database. For example,
select the first 1000 rows of the table with the selection expression:
i_<=1000. If you are working with a multi-segment table
(>5M rows), try to capture a subset of 2-3 segments.
- Cancel queries completely
- If a query is running longer than expected and does not seem to be finishing, you
cannot stop the process by clicking the browser's stop button. Clicking the stop
button frees up the browser window, but the query continues to be processed, using
system resources and potentially impacting other users. Even closing the browser
window will not stop the process on the 1010data servers. Instead, login to
1010data and choose the option End existing session when
prompted. Logging in this way kills the previous query process and cleans things
- Consider using TenDo or the 1010data Excel Add-in
- In the case of an extremely large query, the Insights Platform may not finish
processing it before reaching a set time-out period. When this occurs, the query is
automatically resubmitted. Fortunately, because each query operation is cached after
it is run, the Insights Platform continues to process the query from the point at
which it timed-out instead of from the beginning. In nearly all cases, this allows the
query to finish processing.
If the query neither completes nor fails after being
resubmitted 20 times, an error message is displayed. Should this occur, 1010data
recommends canceling the transaction and submitting it using TenDo or the 1010data
Excel Add-in. For instructions, see the TenDo User's Guide or the 1010data Excel Add-in User's
- Select using native columns first
- Where possible, use native columns to perform selection statements.
- Select on computed columns or perform a link-and-select only when the
above is not possible.
- A link with type="select" is more efficient than a link
followed by a select when trying to find only those rows in the base table that
are in the foreign table.
- When batching together queries via TenDo or the Excel
Add-in, use parameter substitution where the parameters
are the results of a query whenever possible. This technique is much faster than
- Use the right order
- When performing multiple selections, always start with the most highly selective
statements first and the least selective statements last. For example, if you wanted
to see the transaction data of a single store for a given date range, you should
perform the store selection before the date range selection.
- Don't select too many rows
- Make your selections count! Selections that keep too many rows (e.g., more than 50
million) use a significant amount of memory. For large tables, try to especially avoid
selection statements that select almost the entire table. If you must select a large
number of records, try an alternative approach. For example, if you select rows and
- Define a computed column named include, whose value is
the selection expression. This column will be 1 for rows you want to keep
and 0 for rows you want to omit.
- Make include one of the break columns in the
- After the tabulation, select rows
where include is 1.
- Learn what columns are indexed
- Indexing is not explicitly revealed in the user interface, but can make selections
on large tables much more memory efficient.
Defining Columns and Writing Selection Expressions
- Choose the appropriate function
- Different types of functions use different amounts of system resources. When you are
defining a column, always use functions in this order for best performance: regular
functions, G_Functions, tabulations.
- Use G_Functions on segmented tables before resorting to tabulation
- Computed columns are only calculated as needed, whereas tabulations calculate all
results. Remember, time series and group summarization functions (beginning with
g_ or t_) use significantly more time and memory
than other functions.
- Tabulate and filter before creating computed columns
- Where possible, define computed columns after tabulations. Computed columns add
overhead; the fewer rows they are applied to, the better.
- Create a reference column
- If the same computation is referenced multiple times, create a column with it and
reference that thereafter.
- Use built-in functions
- Use 1010data functions wherever possible instead of complex expressions.
- Eliminate unnecessary if's or if-then-else's
- When possible use logical arithmetic or another categorization function. Logical
arithmetic applies if the desired results are 0 if false
and 1 if true.
- Avoid complicated text searches
- Manipulating or searching on text values is more expensive than similar operations
on numeric operations. In particular, don't use the function contains
(x;y) if you can use equals (=).
- Handle divide-by-zero's
- Select the Error and Infinity Handling check box in
the Advanced Settings dialog. Alternatively, check if a
denominator is zero before dividing and adjust your computation accordingly.
- System variables
- The system variables i_ or zi_ are equivalent to
creating computed columns. Consider this when using them in selections or
- Be aware of data types
- Integer arithmetic is faster than decimal arithmetic, which is faster than string
- Sort after tabulating
- Sort the table during or after a tabulation rather than before, especially if
there are more than a few million rows.
- Avoid large worksheets
- Linking to large worksheets (more than a few million rows or many columns) can use a
significant amount of memory.
- If possible, avoid linking to worksheets entirely, except in cases where you
can significantly pare down the number of rows.
- To reduce the size of the foreign worksheet, select rows before linking.
- Use a colord to only link in the needed columns.
- Try to do the analysis within the same worksheet. If necessary, save the
worksheet as a table, then link it in.
- Suffix and labels
- Consider using a suffix and a label when you link together two data sets that share
common column names. That way, you can avoid redefining a column so it has a
meaningful name (not c1 or t1). Plus, any macro code
that refers to a renamed column will always use more resources than those that
directly reference the original column.
- Understand the costs
- Memory usage and run time is a function of the number of breaks, type of breaks
(original column vs. computed column), number of result columns, complexity of the
break columns (how many characters define a unique break value), and type of
aggregation functions. Tabulate on original columns whenever possible.
- Limit the size of the result
- Do not break on a column (or set of columns) that has more than a few million
- Only create tabulation groups needed to give unique results
- Unnecessary breaks still create overhead.
- Do things piecewise
- If your data set is very large and you are running into memory problems, apply the
tabulation to subsets of the data and combine the results afterwards.
- Round real numbers
- Before using a real (floating point) column as a break column, round the values. The
system may have trouble distinguishing between real numbers that are very close in
value, and rounding will help ensure that the values are not that close to one
- Adjust your advanced settings
- Change your advanced settings for very large data sets and complex queries that
require a lot of memory. In the Advanced Settings dialog, select
a lower Blocking Level or select Do step-wise
aggregation, as these two settings work together to use less memory.
Note: This may slow down execution time.
- Supply column names in tabulations
- It is better to name the variables produced in tabulations instead of relying on the
default names given in the tabulation (e.g., t0, t1,
t2). If you supply column names, you don't have to create an
additional computed column to rename a default tabulation name. Plus, any later code
that refers to the new computed column names will always use more resources than those
that directly referenced t1.
- Row limitations of merged tables
- Merged tables are limited to ~16 million cells (number of rows * number of
- Unrestricted Merges
- If the <merge> operation appears before all other operations in
a macro (except <note>) and the default
attributes type=all and match=names are used, and
no other tags are contained within the <merge> operation, then the
merged table size is unlimited.
Saving and Downloading Tables
- Understand the costs
- Saving large tables can take a long time just like running large tabulations. When
you save a table, it requires the system to compute all the rows and columns. This can
tax the system resources unnecessarily. In some cases, it is actually faster to run
the query again instead of saving the results.