Segmentation options

The following is an overview of the segmentation options available in TenUp and TenDo.

TenUp and TenDo offer two different segmentation options, segby and sortseg, for segmenting your data. When data is organized in the right way, you can use much more powerful algorithms to analyze that data and thus improve the speed of your calculations.

Segby

Segments, or segby, are how data is broken up into files on the 1010data servers, and it is done specifically to ensure that all like values for a given segmentation arrangement are stored together. 1010data servers can handle segments of more than 8 million rows. However, 3 to 5 million row segments are the most common.

For example, you have a column with 40 million values, and all of those values are integers between 1 and 1,000,000. The columns are broken up into 10 segments of 4 million rows each, but you don't know how the data is sorted. You want to locate all values of 12345 as quickly as possible. If the table is segby this column, 12345 can appear in only one segment. Once the system finds the segment containing 12345, it can ignore the other segments. Thus, segby speeds up searches and allows for quick computation of aggregate/grouping functions such as sums and averages.

To use segby in TenUp or TenDo, add the switch --segby, followed by the column name or names, to your command.

For more information on the --segby switch, refer to the following:

To specify the segment size of a table in TenUp, use the --segment-size switch. See -b, --segment-size.

The default segment size for extract and load jobs that do not specify segment size is 8388608. Note that some segmentation strategies may cause the final segment size to be different than that specified with the -b option.

Sortseg

Segby allows for more powerful operations on bigger datasets, but it does not guarantee any kind of order on the table. A sortseg is a more powerful segmentation that actually sorts data in the table. Sortseg stipulates that each group of like values in a segment starts with the lowest value and ends with the highest value. Sortseg improves performance for some kinds of analysis by enabling the system to assume the start and end points and skip past entire groups of like values or even entire segments.

A table is sortseg by a column or columns if the following conditions are met:

  • The table is segby this column.
  • The sortseg column in each segment is sorted within the segment. The segments themselves do not have to be in order.
  • The minimum and maximum values of the sortseg column in each segment do not overlap with any other segment.

To use sortseg in TenUp or TenDo, add the --sortseg switch, followed by the column name or names, to your command.

For more information on the --sortseg switch, refer to the following:

Segby advise

Segby is even more powerful if you have information about how different columns are related. For example, if you have a table of data about grocery stores with the column names store_id, city, and state, you can be certain that every row with a given value for store_id has the same values for city and state (since the store can exist in only one city and state). Therefore, if a table is segby state, it is also segby store_id.

In addition, you can use segby advise with a sortseg column, when you know a column is segmented the same but not necessarily sorted the same. In the store example above, the column state is sortseg, but the column store_id is segby (the states are in order, but within each state, the stores might appear in any order).

To use segby advise in TenUp or TenDo, add the --segby-advise switch, followed by the column name or names, to your command.

For more information on the --segby-advise switch, refer to the following:

Sortseg advise

Just as with segby advise, you may know that a table that is sortseg by one column is necessarily sortseg by another column. In that case, you can use sortseg advise to improve performance.

To use sortseg advise in TenUp or TenDo, add the --sortseg-advise switch, followed by the column name or names, to your command.

For more information on the --sortseg-advise switch, refer to the following: