Which Columns to Use for Grouping
After looking at the last few examples, it should be fairly clear that the G in G_Functions stands for Group.
Whether you're aggregating sales data by store or temperature ranges by city, groups provide the center around which a data summary aggregates. However, for very large data tables not any column can be used as the Group argument in a G_Function.
One of the reasons G_Functions are both fast and efficient is that there is an established level of trust between a G_Function and the data table you are working with. By "trust" we simply mean that the G_Function is expecting the data is already arranged in a certain way, and we must make sure that it is before we pass a column to it.
The full explanation of why this is the case would require we delve into a lot of 1010data's
system architecture, which is slightly outside the scope of this guide. However, we can
address the basic reasons. In order to perform calculations on large datasets, very large
tables (over about 4 million rows, give or take) must be broken up, or, segmented. This
is true of all tables in 1010data that exceed the 4 million row limit. G_Functions must know
how a table is segmented in order to operate on the data correctly. So, in order for a column
to be usable as the G
argument in a G_Function, the table must be segmented
on that column. This means that all the data for a given group is contained in the same
segment. So, for our sales by store example above, if our table was over 4 million rows, we
would need to be sure that all records for store 1 are in the same segment. All of store 2
must be in the same segment, and all of store 3 must be in the same segment as well. Stores 1,
2, and 3 can all be in separate segments from one another, but we can't have some records for
store 1 in one segment and additional records from the same store in a different segment.
This works because if the G_Function can assume all the records for the G
parameter are all in the same segment it doesn't need to check anything. It can simply operate
across the entire segment and know it hasn't missed any records for the given group.
Your next question should be, "How do I know which column or columns a table is segmented by?" Great question! The good news is the system will tell you exactly which column or columns can be used as your Group. Simply go into any of the dialogs we've already used, most notably the Create Computed Column dialog, where you will see the following description: