TenFrame for the Pandas User#

The tenFrame library is an attempt to present 1010data queries in a form that’s closer to what Pandas-users (pandits) expect, so they can program in a familiar manner, instead of having to translate their ideas into 1010data XML macro code.

This guide is meant as an aid for users coming with some understanding of pandas (not necessarily pandas experts!), in order to learn a little about the differences involved in using tenFrame. Be sure to refer to the tenFrame reference documentation for the most detailed and up-to-date information. (This documentation is derived from the docstrings of the library, and so it is also available within python itself, using python’s help() feature.) As you learn more about tenFrame, it will be helpful to understand at least a little of how the 1010data XML macro language looks and works, so you might want to look at its documentation a little as you go along. Still, the aim of tenFrame is to shield you as much as possible from having to know our XML language.

Why Not Just Use Pandas?#

Because tenFrame tries to mimic pandas but is not a perfect copy and is not a drop-in replacement, the user can start to feel that it should be the same as pandas and the little differences seem to be a nuisance. Why aren’t we just using pandas, anyway?

The answer lies in our industry, the “Big Data” industry. 1010data helps its clients analyze and manage Big data. That is, datasets with millions or billions of rows (or more!), far more than any ordinary business computer could hold in its memory at once. If you could just load your data into pandas, you wouldn’t need to use 1010data on it. Conversely, when you crunch a bunch of numbers using 1010data, at the end you’re looking for a much smaller extract of the data that you can make sense out of, and from which you can draw useful conclusions and insights (which, considered as data, are even smaller.) So 1010data’s services help you “whittle down” your mountain of data, and as soon as it’s small enough for you, you can just download the smaller dataset you now have, and perhaps process it using pandas on your local computer.

Another reason for lack of feature parity with Pandas, of course, is that Pandas is an old established library many years in the making (v1.0.0 came out at the end of January 2020, but the library’s history goes back to 2011) with dozens of authors (at least), while tenFrame is much newer and has a much smaller developer community!

TenFrames Aren’t DataFrames – But They Pretend to Be#

The “Dataframe” object in the tenFrame library is the TenFrame. Sure, they try to act like DataFrames, as we’ll see, but it’s important to remember that they aren’t really and truly DataFrames or even tables of any kind. What a TenFrame is is a wrapper around a 1010data query, which is a bunch of instructions on how to look up and/or compute a table full of data (the result set). It’s important to keep track of when things are done, or in tenFrame’s case, when they are not done. For example, if you add a column to a DataFrame in pandas, like df['new'] = df.col1 * df.col2, say, then at the time that expression is evaluated, pandas builds a new column, allocates memory for it (which might be a lot, if there are many rows), and goes through all the rows in the table, multiplying col1 by col2 and putting the result in the new column. All that has to happen before control even passes to the next statement (or if you’re typing in the REPL, before your prompt can come back.)

But “Big Data” is, well, big. Too big to fit into your computer, and too big for you to wait for all the calculations to complete. Even if your computer could hold trillion-row columns, doing a trillion calculations is going to take too long. So when you do things like that with a tenFrame, with exactly the same syntax, frame['new'] = frame.col1 * frame.col2, nothing is computed, no columns are multiplied, and there might not even be any access of 1010data’s servers. All that happens is that the query inside the TenFrame has an instruction added to it that says to multiply those two columns and present the results as a new column with a particular name. And even when tenFrame actually runs the query and sends it to 1010data’s servers, even then no computations are done! The server just understands that there is to be a column with that name whose value is computed in a certain way. It’s only when you actually try to access some data that it is actually computed (which may cause other data which it depends on to be computed as well), and even then, the 1010data servers might compute only part of it, if that’s enough to give you the correct answer. If all my program is asking for is frame['new'][200], there’s no reason to compute all trillion values of the column.

TenFrames and TenSeries are python objects that translate python’s operations on them, like multiplication and assignment, above, and also pandas-like methods on them, into 1010data query instructions. In so doing, they try to act like tables and columns, but that’s not really what they are.

How 1010data Handles Big Data#

In order to understand what can be done in a 1010data query (and thus what can be done in a TenFrame), helps to understand at least a little of how 1010data deals with these huge amounts of data.

Processing in 1010data, roughly speaking, involves a single “main process,” running on a single machine, which in 1010data jargon is called the “accum” (short for accumulator.) Generally, when you’re talking to “the server” in 1010data, it’s the accum you’re talking to. The raw data in the tables, the columns containing trillions of entries, are stored columnwise on one or more machines (“subprocs”), each dealing with one “segment” of the data at a time. The accum sends instructions to the subprocs, which do the computation on all the segments in parallel and send the results back to the accum as needed.

Sometimes, computations result in a table that’s no longer distributed, but is all stored in memory on the accum. Tabulations (aggregations) are the best example. The accum receives results from all the subprocs and combines them all into a table, but the results are not part of the data, and have a completely different number of rows, so the results of the tabulation are all stored in memory on the accum. This can put limits on the size of the results of these sorts of computations.

The 1010data system also aggressively caches queries and partial queries, so building up a query step-by-step is reasonably efficient even if you run it between steps, and running the same query twice involves comparatively little overhead (tenFrame avoids running the same TenFrame twice anyway, but even if you wound up with the same query in another way it would run quickly).

1010data, py1010, and tenFrame#

The tenFrame library is built using another 1010data library called py1010, which provides access to the underlying 1010data tables. Py1010 performs transactions with the 1010data Edge servers, including logging in, running queries, and fetching data and metadata about the results. TenFrame does not require py1010, but without it you can only use it in “offline mode,” which we’ll get to later.

TenFrame attempts to reduce the need for knowing 1010data’s query-language, but knowledge of it is certainly helpful. Some py1010 objects are also exposed and used in tenFrame, so it’s a good idea to familiarize yourself with py1010 a little, if possible.

So a TenFrame is a presentation layer wrapped around a 1010data query, and it uses a py1010 Query object to send requests over to the 1010data server to run the query, when needed. That “when needed” is important to remember: you might develop a huge, complicated query and never have the slightest delay as you type it into an interactive session, because you aren’t necessarily doing anything except fiddling around with little operations on your local machine. Recall what was said above in the introduction. TenFrame will only actually send the query to be run when you ask for something it feels requires it. That might be when you ask for data from the results, but it also might just be when something you do needs to know the names of the column in the results. So don’t be surprised if things pause while the query is running and you didn’t think it needed to.

Getting Started#

So, let’s look at how to begin using tenFrame a little.

In [1]: import py1010
In [2]: import tenFrame as tf
In [3]: session = py1010.Session("https://www2.1010data.com/prime-latest/gw", USERNAME, PASSWORD, py1010.POSSESS)
In [4]: stations = tf.TenFrame(session, "pub.demo.weather.stations")
In [5]: stations
Out[5]:
     id        name state  tz     lat      lon  elev  anht
0  3103   FLAGSTAFF    AZ  -7  35.133  111.667  2135     6
1  3812   ASHEVILLE    NC  -5  35.433   82.533   661     6
2  3813       MACON    GA  -5  32.700   83.650   110     7
3  3816     PADUCAH    KY  -6  37.067   88.767   125    10
4  3820     AUGUSTA    GA  -5  33.367   81.967    45     6
5  3822    SAVANNAH    GA  -5  32.133   81.200    16     9
6  3856  HUNTSVILLE    AL  -6  34.650   86.767   190     8
7  3860  HUNTINGTON    WV  -5  38.367   82.550   255     6
8  3870  GREENVILLE    SC  -5  34.900   82.217   296     7
9  3872     BECKLEY    WV  -5  37.783   81.117   765    10
In [6]: len(stations)
Out[6]: 262
In [7]:

As you see, you import py1010 and the TenFrame object (you should import py1010 first), and then start a session using py1010’s Session object constructor. You give it the gateway URL, username, password and login type (KILL, NOKILL, or POSSESS. See the py1010 or 1010data documentation.) Then you can create a TenFrame based on that session and the table “pub.demo.weather.stations”. Note how much the TenFrame looks like a Pandas DataFrame when it’s presented! It’s nice, but gets there by “cheating” a little: it really just makes a little DataFrame our of the first ten rows (see the box below if you’re curious). Pandas shows you first and last few rows of a long frame, and also tells you how many there are, but here you’re only seeing a complete DataFrame made out of just the top 10 rows. Still, column-wise, it’s basically accurate: It’s a table (frame) with eight columns: id, name, state, tz, lat, lon, elev, and anht. This table represents weather stations, giving each one’s ID and name (generally a city), state, time-zone, latitude, longitude, elevation and the height of its anemometer (that’s something you need to know when you’re doing weather.)

Even with just this one table, we can do some simple operations:

In [7]: stations['coord_sum'] = stations.lat + stations.lon
In [8]: stations
Out[8]:
     id        name state  tz     lat      lon  elev  anht  coord_sum
0  3103   FLAGSTAFF    AZ  -7  35.133  111.667  2135     6    146.800
1  3812   ASHEVILLE    NC  -5  35.433   82.533   661     6    117.966
2  3813       MACON    GA  -5  32.700   83.650   110     7    116.350
3  3816     PADUCAH    KY  -6  37.067   88.767   125    10    125.834
4  3820     AUGUSTA    GA  -5  33.367   81.967    45     6    115.334
5  3822    SAVANNAH    GA  -5  32.133   81.200    16     9    113.333
6  3856  HUNTSVILLE    AL  -6  34.650   86.767   190     8    121.417
7  3860  HUNTINGTON    WV  -5  38.367   82.550   255     6    120.917
8  3870  GREENVILLE    SC  -5  34.900   82.217   296     7    117.117
9  3872     BECKLEY    WV  -5  37.783   81.117   765    10    118.900
In [9]:

This works just like you would expect it to: adds a new column called coord_sum to the table, and its value is the sum of the latitude and the longitude for each row. (Why you would ever want to add the latitude and longitude together is another question.)

You can also create an “in-memory” table in 1010data directly from a DataFrame by using the data= keyword. This is a table that’s actually baked into your query as a table “literal”, so to speak, so you can operate on it within 1010data. This isn’t generally useful by itself (you already have the data locally, you can operate on it with Pandas), but sometimes it might be needed for combining with 1010data tables – or for examples:

In [1]: pdiris = pd.read_csv("iris.csv")
In [2]: iris = tf.TenFrame(session, data=pdiris)
In [3]: iris
Out[3]:
   SepalLength  SepalWidth  PetalLength  PetalWidth Species
0          5.1         3.5          1.4         0.2  setosa
1          4.9         3.0          1.4         0.2  setosa
2          4.7         3.2          1.3         0.2  setosa
3          4.6         3.1          1.5         0.2  setosa
4          5.0         3.6          1.4         0.2  setosa
5          5.4         3.9          1.7         0.4  setosa
6          4.6         3.4          1.4         0.3  setosa
7          5.0         3.4          1.5         0.2  setosa
8          4.4         2.9          1.4         0.2  setosa
9          4.9         3.1          1.5         0.1  setosa
In [4]:

Here are some other simple things you can do:

In [9]: stations[stations.elev > 1000]
Out[9]:
      id                    name state  tz  ...      lon  elev  anht  coord_sum
0   3103               FLAGSTAFF    AZ  -7  ...  111.667  2135     6    146.800
1  23009                 ROSWELL    NM  -7  ...  104.533  1112     6    137.833
2  23044                 EL PASO    TX  -7  ...  106.400  1194    10    138.200
3  23047                AMARILLO    TX  -6  ...  101.700  1098     7    136.933
4  23050             ALBUQUERQUE    NM  -7  ...  106.617  1619     7    141.667
5  23061                 ALAMOSA    CO  -7  ...  105.867  2297     9    143.317
6  23062  DENVER-STAPLETON INT'L    CO  -7  ...  104.867  1610    10    144.634
7  23065                GOODLAND    KS  -7  ...  101.700  1124    10    141.067
8  23066          GRAND JUNCTION    CO  -7  ...  108.533  1475     7    147.650
9  23153                 TONOPAH    NV  -8  ...  117.133  1653     8    155.200
In [10]: len(stations[stations.elev > 1000])
Out[10]: 31
In [11]: stations2 = stations.groupby("state").agg({'lat':'mean', 'lon':'mean', 'elev':'mean'})
In [12]: stations2
Out[12]:
  state   lat_mean    lon_mean    elev_mean
0    AZ  33.833250  111.762500  1196.000000
1    NC  35.355667   79.274833   218.333333
2    GA  33.052833   83.252833   144.333333
3    KY  37.990000   85.416800   251.800000
4    AL  32.800000   87.041750   127.750000
5    WV  38.350000   81.279250   476.000000
6    SC  33.916667   81.122333   125.666667
7    TX  31.079353   98.658882   383.411765
8    KS  38.680200   98.473400   607.400000
9    LA  30.775000   92.108500    27.000000
In [13]:

In Pandas, the same queries would be

>>> df[df.elev > 1000]
...
>>> df2 = df.groupby("state").agg({'lat':'mean', 'lon':'mean', 'elev':'mean'})

that is, exactly the same as with tenFrame, as you can see if you try them out. But you will notice a difference in the output of the second one, though. In Pandas, the results are automatically re-sorted by state, which has become the index of the new frame, whereas tenFrame doesn’t have a concept of a true “index,” so the order is unchanged. You can always apply the .sort_values() method, like you might do in Pandas.

But more generally, note that the .agg() function works like it does in pandas. It’s probably best to provide the full aggregation data, which columns aggregated by which functions, but tenFrame will generally interpret the parameters of .agg() about the same way as Pandas:

In [13]: stations.agg(['sum'])
Out[13]:
      id_sum  tz_sum    lat_sum    lon_sum  elev_sum  anht_sum
0  7465955.0 -1661.0  10511.606  26105.871   98055.0    1944.0
In [14]:

The TenFrame#

TenFrame is the DataFrame-like class of tenFrame. As we saw earlier, it really only pretends to be DataFrame-like: it’s really a wrapper around a 1010data query.

Constructing#

The __init__ method of the TenFrame class has the following signature:

  • __init__(self, session, table=None, ops=None, data=None, meta=None, load=None, imports=None, imports_once=None)

Although it appears that the only mandatory parameter is the session, in fact, you must supply one of table, ops, data, or load. If the session is None, you can only work in “offline mode” (q.v.) Normally, you’ll just specify the table, a string (pathname) of the name of the table in the 1010data object tree. If you want to work from a pandas DataFrame you already have on your local machine, uploading it into 1010data as an in-memory table, you can use data=dataframe instead. (The ops parameter is only if you have some specific set of ops with which to initialize the query, and should be considered deprecated.)

If the load parameter is given, then the path named should be a saved 1010data QuickQuery, and instead of being treated like a table (i.e., being run and its output being used as the base dataset of this query), the query is actually read in as a query, and its operations parsed to become the operations of the new TenFrame. This feature is used to implement the Library class, and probably is not very useful to most users directly on TenFrames.

Warning

This functionality is intended to work mainly with queries that were saved with tenFrame in the first place. Although most pre-existing queries can be loaded in, the resulting code query might not conform to what tenFrame expects, so the resulting TenFrame may behave in unexpected ways. Also, some existing queries may have quirks in their XML code that can cause the load functionality to fail and raise an exception.

The imports and imports_once parameters each can take the pathname of a saved library in the 1010data object tree (or a list of pathnames), which will be inserted into <import> ops (with once="1" for import_once). See the documentation on the import op.

Subsetting#

Subsetting (selecting a subset of a frame or series using square brackets) is used in many different ways by Pandas DataFrames, with different types of subset entailing various different behaviors. TenFrames attempt to emulate those where possible.

Most straightforward is subsetting by a string. In Pandas and tenFrame alike, this will return the column of that name which is in the frame (a Series object in Pandas, a TenSeries object in tenFrame). This kind of subsetting can also be done as an attribute, like in pandas, i.e., frame.price is the same as frame['price'], for any name that isn’t already used for some other attribute of TenFrame. (Also like in pandas, you cannot set a new column this way, but must use brackets, i.e., frame['new'] = ..., not frame.new = ....)

In [14]: stations['state']
Out[14]:
  state
0    AZ
1    NC
2    GA
3    KY
4    GA
5    GA
6    AL
7    WV
8    SC
9    WV
In [15]: stations.lat
Out[15]:
      lat
0  35.133
1  35.433
2  32.700
3  37.067
4  33.367
5  32.133
6  34.650
7  38.367
8  34.900
9  37.783
In[16]:

Warning

One important quirk to keep in mind is that although subsetting with the name of a non-existent column will raise a KeyError in pandas, subsetting a TenFrame by a string (or specifying an attribute) will always succeed (though it may cause an exception afterward, see below). A TenSeries is little more than a column name and a reference to a TenFrame, so it can always be created. We need to do this because there is no way to know what the columns in a 1010data query are without running the query, and that might take a long time, so we want to avoid doing that unless we really need data. This can potentially lead to some puzzling errors or lack of errors, so watch out for it. One way this situation can arise is if you try to access a member of the TenFrame object that you expect to be there because it’s in Pandas, but isn’t in actually in tenFrame. (In order to try to make this happen a little less often and a little less confusingly, trying to look at a TenSeries by just typing its name in the REPL and if it isn’t really there in the TenFrame, an exception will be raised.)

In [16]: stations.nosuchcolumn       #  The REPL knows
Out[16]:
AttributeError
....
AttributeError: nosuchcolumn not found in this TenFrame.
In [17]: stations['xyz'] = stations.nosuchcolumn - stations['anothercolumn']   # But in general, no error.
In [18]: #  No error!!!
In [18]: stations   # But if you ever try to run it or get data from it...
Out[18]:
TentenException
...
TentenException: ('function run failed, returned 1 (Computed column xyz cannot be parsed: invalid column or function: nosuchcolumn)', 1)
In [19]:

Also unlike pandas, you can also (always) subset by an integer. Pandas does not have a concept of the order of columns, but TenFrame does, so you can select a column by its number.

As in pandas, you can subset with a list of column-names to restrict the TenFrame to just those columns, or reorder them:

In [19]: stations[["name", "tz", "state]]
Out[19]:
         name  tz state
0   FLAGSTAFF  -7    AZ
1   ASHEVILLE  -5    NC
2       MACON  -5    GA
3     PADUCAH  -6    KY
4     AUGUSTA  -5    GA
5    SAVANNAH  -5    GA
6  HUNTSVILLE  -6    AL
7  HUNTINGTON  -5    WV
8  GREENVILLE  -5    SC
9     BECKLEY  -5    WV
In [20]:

However, in 1010data code, restricting the columns this way doesn’t actually get rid of the old columns. They’re still there and can still be accessed and referenced by subsequent operations:

In [20]: somecols = stations[["name", "tz", "state"]]
In [21]: somecols['coordprod'] = somecols.lat * somecols.lon   # would not work in pandas!
In [22]: somecols
Out[22]:
         name  tz state    coordprod
0   FLAGSTAFF  -7    AZ  3923.196711
1   ASHEVILLE  -5    NC  2924.391789
2       MACON  -5    GA  2735.355000
3     PADUCAH  -6    KY  3290.326389
4     AUGUSTA  -5    GA  2734.992889
5    SAVANNAH  -5    GA  2609.199600
6  HUNTSVILLE  -6    AL  3006.476550
7  HUNTINGTON  -5    WV  3167.195850
8  GREENVILLE  -5    SC  2869.373300
9     BECKLEY  -5    WV  3064.843611
​In [23]:

Subsetting can also select rows instead of columns. Saying df[df['price'] > 10.0] in pandas is really creating a boolean Series with df['price'] > 10.0] and then selecting on it by putting it in the brackets. It works exactly the same in tenFrame. frame['price'] > 10.0 evaluates to a boolean TenSeries, analogous to how it works in pandas, so the same syntax (frame[frame['price'] > 10.0]) with TenFrames will produce the same result: a TenFrame which selects just those rows where the price column is greater than 10. We saw this in the first examples.

Subsetting on a slice also slices along the rows and not the columns, and produces a “server-side slice,” which is a query with operations that select a specific sequence of rows based on their row number.

Iterating over a TenFrame produces a list of the column names, as it does in pandas. As mentioned above, anything that requires actually knowing what columns really are there will entail running the query.

loc and iloc#

Currently, tenFrame does not support the .loc member of pandas DataFrames for selecting subsets. .loc is mostly used for selection by index, and tenFrame does not have the same concept of indexing that pandas has. The .iloc member of the TenFrame object returns the .rows member of the underlying query, which is not quite the same, but seems to work for most cases.

Setting Values#

As we saw above, just like in pandas, you can add a new column by setting a new item, usually to a new (Ten)Series. Like a Series in Pandas, a TenSeries represents a single column that’s part of a TenFrame. (Series have other uses in Pandas as well, but TenSeries are more restricted in meaning.)

In [1]: import tenFrame as tf
In [2]: sales = tf.TenFrame(session, 'retaildemo.retail.sales_detail')
In [3]: sales['total'] = sales.price * sales.qty
In [4]: sales['aconst'] = 7
In [5]: sales
Out[5]:
   transid        date       tme     sku           upc  store  qty  ...  xsales  price   cost  customer_key  promo_flag  total  aconst
0   100001  2012-12-31  03:52:07  222218    7084781260    728  1.0  ...    1.97   1.97   1.53          <NA>           0   1.97       7
1   100002  2012-12-31  04:07:58  260548   81809400001    728  1.0  ...    1.39   1.39   0.99          <NA>           0   1.39       7
2   100003  2012-12-31  05:10:44  224423    7116603129    728  1.0  ...    2.77   2.77   2.12          <NA>           0   2.77       7
3   100004  2012-12-31  05:38:59  306141  999999907314    728  1.0  ...    0.49   0.49   0.32          <NA>           0   0.49       7
4   100004  2012-12-31  05:38:59  101528    1200000452    728  1.0  ...    2.57   2.57   1.87          <NA>           0   2.57       7
5   100004  2012-12-31  05:38:59  110966    2200000898    728  1.0  ...    1.27   1.27   0.67          <NA>           0   1.27       7
6   100005  2012-12-31  06:04:51  161510    6150000594    728  2.0  ...    1.42   0.71   0.89          <NA>           0   1.42       7
7   100006  2012-12-31  06:31:45  105159    1590000209    728  1.0  ...    1.29   1.29   0.95          <NA>           0   1.29       7
8   100006  2012-12-31  06:31:45  102143    1230000055    728  2.0  ...   13.32   6.66  12.33          <NA>           0  13.32       7
9   100006  2012-12-31  06:31:45  298945  999999900118    728  1.0  ...    0.82   0.82   0.80          <NA>           0   0.82       7
>>>

Compatibility#

An error you might see popping up when you do an assignment, as well as in other circumstances, is a CompatibilityException, saying that the frames are not “compatible.” This comes down to the nature of TenFrames once again, that they are really queries and not tables of numbers like Pandas DataFrames. In the general case, it doesn’t make sense to assign a column (TenSeries) from one TenFrame into another. The could come from different tables, or radically different queries. Maybe the column is defined in terms of other columns in its original TenFrame which don’t exist in the one you’re assigning to. When you realize that each column is defined by some query that takes data from some arbitrary table and manipulates it in arbitrary ways, you can see that lifting a column from one TenFrame and plunking it down, without its context, into another cannot be expected to make sense, in general. When two TenFrames are not compatible and you try to combine them or use them with each other in ways that tenFrame can’t make sense of, you’ll get a CompatibilityException.

But there are certain special cases in which this can make sense and in which you can assign a TenSeries into another TenFrame, and many of the cases we care about fall into this category. This happens when the two TenFrames in question are mostly the same, and so a column being brought from one to another wouldn’t be losing its context. Here’s an example from just above:

sales['total'] = sales.price * sales.qty

(Keep in mind that we are considering the general case, so we can’t assume that sales simply references a table; it might have all kinds of other operations on it as well.) Now, the product of two TenSeries (the right-hand side of the assignment) is another TenSeries, and every TenSeries is part of a TenFrame. That TenFrame is not the same one as sales, because sales, as yet, doesn’t have a column corresponding to that product. It’s a new, nameless TenFrame, with a brief and fleeting existence. But what’s important about this new TenFrame is that it started as a copy of sales, so the query it contains is the same as what’s in sales – with the addition of the new column. It has all the same operations as sales, plus one more, the one at the end, which defines the new product column. So it’s an “extension” of sales, and the new column will have the same “context” when moved over to sales. In fact, since this new TenFrame has everything sales has possibly more, we can simply replace the operations in sales with the operations from the new TenFrame and not lose anything, and indeed gain the definition of the new column.

For all that that’s a “special case,” it’s also the situation you’re usually dealing with, since you don’t usually want to combine a table with an unrelated column this way. So the bottom line is that things should usually work out the way you’d want and expect; just know about compatibility for those times when problems do arise.

Properties#

Here are some key properties (not methods) on TenFrames. Many of the pandas DataFrame properties are represented here, with the same names meaning more or less the same things.

  • shape: A tuple containing the number of rows and columns, as (rows, columns). All 1010data tables and queries, and thus all TenFrames, are two-dimensional.

    In [1]: frame.shape
    Out[1]: (5997373902, 13)
    
  • table: the name of the table with which this TenFrame was created. If none was given (e.g. if you used the data= parameter in the constructor), this property’s value is None.

    In [1]: frame1 = tf.TenFrame(session, "retail.sales_master")
    In [2]: frame1.table
    Out[2]: 'retail.sales_master'
    In [3]: df = pd.read_csv("dat.csv")   # Load in a pandas DataFrame
    In [4]: frame2 = tf.TenFrame(session, data=df)
    In [5]: frame2.table
    In [6]:
    
  • size: The number of elements in the TenFrame, rows × columns.

    In [1]: frame.shape
    Out[1]: (5997373902, 13)
    In [2]: frame.size
    Out[2]: 77965860726
    
  • columns: The names of the columns in this TenFrame. This is a list in tenFrame, not an Index object like it is in pandas.

    In [1]: frame.columns
    Out[1]:
    ['transid',
     'date',
     'tme',
     'sku',
     'upc',
     'store',
     'qty',
     'weight',
     'xsales',
     'price',
     'cost',
     'customer_key',
     'promo_flag']
    
  • cols: A tuple containing the py1010 Column objects for each of the columns. This allows access at a slightly lower level. Please see the py1010 documentation for more information if you need to work with these.

    In [1]: frame.cols
    Out[1]:
    (Column(i: transid),
     Column(i: date),
     Column(i: tme),
     Column(i: sku),
     Column(a: upc),
     Column(i: store),
     Column(f: qty),
     Column(f: weight),
     Column(f: xsales),
     Column(f: price),
     Column(f: cost),
     Column(i: customer_key),
     Column(i: promo_flag))
     In [2]:
    
  • iloc: The py1010 row iterator for this TenFrame (equivalent to query.rows on the underlying py1010 query). This may not be exactly the same as pandas iloc, but it does seem to be pretty close. Basically, a thing you can iterate over, subset to access a specific row, slice, etc. The rows it gives back are tuples containing the data in order of the columns.

    In [1]: frame.iloc[192]
    Out[1]:
    (100049,
     datetime.date(2012, 12, 31),
     datetime.time(9, 25, 2),
     138544,
     '4012',
     728,
     1.0,
     1.25,
     1.71,
     1.37,
     0.66,
     nan,
     0)
    
  • ndim: Number of dimensions, as in pandas. But note that it always returns 2, since all TenFrames are two-dimensional.

  • session: The py1010 Session object this TenFrame uses to access the 1010data servers. Supplied in the constructor. Might be None if the TenFrame is working offline.

  • Everything else: Accessing any attribute of a TenFrame that isn’t already the name of some existing attribute or property or method returns the TenSeries for that column of the TenFrame, like in pandas. As mentioned above, though, unlike pandas, this access will always succeed, whether or not there really is such a column in the table, since at the time of access, the TenFrame doesn’t necessarily know what columns are in a table. But it will raise an exception if you try looking at it in the REPL, to avoid mystifying messages about empty dataframes.

    In [1]: frame.transid
    Out[1]:
       transid
    0   100001
    1   100002
    2   100003
    3   100004
    4   100004
    5   100004
    6   100005
    7   100006
    8   100006
    9   100006
    
    In [2]: xyz = frame.bloop     # nonexistent column!  But it succeeds anyway
    In [3]: frame.bloop           # but looking at it this way doesn't.
    Traceback (most recent call last):
      ....
      File ..., line ..., in __repr__
        raise AttributeError(f"{self.columnName} not found "
    AttributeError: bloop not found in this TenFrame.
    In [4]:
    

Methods#

Some methods (mostly Pandas-emulating) that you can find on TenFrames.

  • assign(self, **kwargs): Assign new columns to a TenFrame. Equivalent to frame[k] = v for the k, v pairs in kwargs (except that it returns a new object instead of modifying the one it’s called on, same as in pandas). As in pandas, later assignments can refer to earlier ones in the same statement.

    This method can be used with callables (functions) as values, in which case the function is made part of the query and executed on each subproc on the server side (if supported), and passed a single argument, which will be a Pandas DataFrame containing one segment of the current table.

    In [1]: def strangeproc(df):
       ...:     # df.vol.std() would be the stddev only of this segment, if there were more than one!
       ...:     return df.rkm**2 + df.mass - df.vol.std()
    In [2]: sol2 = sol.assign(massdist=sol.mass + sol.dist, densdist=sol.massdist/sol.vol, strange=strangeproc)
    In [3]: sol2
    Out[3]:
           name    type   orbits       rkm           vol          mass  \
    0       Sun    star        -  696324.0  1.414300e+09  1.988550e+09
    1   Jupiter  planet      Sun   69911.0  1.431280e+06  1.898600e+06
    2    Saturn  planet      Sun   58232.0  8.271300e+05  5.684600e+05
    3    Uranus  planet      Sun   25362.0  6.834000e+04  8.683200e+04
    4   Neptune  planet      Sun   24622.0  6.254000e+04  1.024300e+05
    5     Earth  planet      Sun    6371.0  1.083000e+03  5.973600e+04
    6     Venus  planet      Sun    6051.0  9.280000e+02  4.868000e+03
    7      Mars  planet      Sun    3389.0  1.630000e+02  6.417000e+03
    8  Ganymede    moon  Jupiter    2634.0  7.600000e+01  1.480000e+02
    9     Titan    moon   Saturn    2576.0  7.100000e+01  1.340000e+02
    
               dist      massdist      densdist       strange
    0  0.000000e+00  1.988550e+09  1.406031e+00  4.865312e+11
    1  7.783000e+08  7.801986e+08  5.451055e+02  4.565014e+09
    2  1.427000e+09  1.427568e+09  1.725930e+03  3.067102e+09
    3  2.871000e+09  2.871087e+09  4.201181e+04  3.188855e+08
    4  4.497100e+09  4.497202e+09  7.190922e+04  2.819129e+08
    5  1.496000e+08  1.496597e+08  1.381900e+05 -2.837830e+08
    6  1.082000e+08  1.082049e+08  1.166001e+05 -2.878129e+08
    7  2.279000e+08  2.279064e+08  1.398199e+06 -3.129407e+08
    8  1.070000e+06  1.070148e+06  1.408089e+04 -3.174943e+08
    9  1.221850e+06  1.221984e+06  1.721104e+04 -3.177965e+08
    
    In [4]:
    
  • apply(self, func, axis=0, args=(), **kwargs): Acts on the server side. Moves all the data in the current query to into the accum, converts it into a Pandas DataFrame, and runs df.apply(func, axis, args) on it, and the result (a DataFrame) becomes the new result-set of this query. So you should only use this when you know that the current data-set is small enough to fit into a single computer. The accums that 1010data uses are pretty beefy and can handle fairly large datasets, but that doesn’t compare to the amount of data that the 1010data system is designed to work with, as mentioned above. Using apply() when the data is still too big may result in a server-side error.

  • isnull(self), notnull(self): As in Pandas.

    In [1]: tab
    Out[1]:
          i    f     a
    0     1  3.0     x
    1  <NA>  5.0  <NA>
    2     3  NaN     z
    
    In [2]: tab.isnull()
    Out[2]:
       i_isna  f_isna  a_isna
    0       0       0       0
    1       1       0       1
    2       0       1       0
    
    In [3]:
    

    (Note that 1010data does not have a reserved boolean type, so True/False values are simply 1 and 0.)

  • dropna(self, axis=0, how=”any”, thresh=None, subset=None, inplace=False): As in Pandas. Currently only axis=0 is supported.

    In [1]: tab.dropna()
    Out[1]:
       i    f  a
    0  1  3.0  x
    
    In [2]: tab.dropna(subset=["i", "a"])
    Out[2]:
       i    f  a
    0  1  3.0  x
    1  3  NaN  z
    
    In[3]:
    
  • fillna(self, value=None, *, cols=None, inplace=False): Returns a TenFrame (this one or a copy, depending on inplace) wherein the NA values in the specified columns (or all of them) are replaced by the given value. At this point, the value must either be a scalar or a TenSeries.

    In [1]: tab.fillna(-600, cols=["i", "f"])
    Out[1]:
         i      f     a
    0    1    3.0     x
    1 -600    5.0  <NA>
    2    3 -600.0     z
    
    In [2]: tab.fillna(-600)           # But if you try this...
    ...
    TentenException: ("Tenten error: 1 (b'amended column has changed type: a')", 1)
    # 1010data can't handle the type-juggling like pandas!
    
  • transpose(self, *, promote=True): Returns a transposed copy of this TenFrame (using the 1010data <transpose> operator). 1010data restricts transposition to tables wherein all the columns are of the same type or types that can be “promoted” to be of the same type (when promote=True, which is the default.) As in Pandas, this method can also be accessed as the .T property on a TenFrame.

    In [1]: tab.T
    Out[1]:
      m0    m1    m2
    0  1  <NA>     3
    1  3     5  <NA>
    2  x  <NA>     z
    
    In [2]:
    

    Note the column names; 1010data column names can’t be just numbers, like they can in pandas, names m0, m1, etc. are made up for them.

  • sort_values(self, by, ascending=True, inplace=False, na_position=”last”, **kwargs): As in Pandas, though restricted (e.g., no “key” parameter). by can be a single column name or a list, and if it is a list of more than one name, the dir is ignored.

    In [1]: sol = tf.TenFrame(session, "default.test.solar")
    In [2]: sol.sort_values("dist")
    Out[2]:
           name    type   orbits       rkm           vol          mass        dist
    0       Sun    star        -  696324.0  1.414300e+09  1.988550e+09         0.0
    1    Triton    moon  Neptune    1353.0  1.000000e+01  2.100000e+01    354759.0
    2      Moon    moon    Earth    1737.0  2.100000e+01  7.400000e+01    385000.0
    3        Io    moon  Jupiter    1821.0  2.500000e+01  8.900000e+01    422000.0
    4   Titania    moon   Uranus     788.0  2.000000e+00  3.000000e+00    436000.0
    5    Europa    moon  Jupiter    1560.0  1.600000e+01  4.800000e+01    670900.0
    6  Ganymede    moon  Jupiter    2634.0  7.600000e+01  1.480000e+02   1070000.0
    7     Titan    moon   Saturn    2576.0  7.100000e+01  1.340000e+02   1221850.0
    8  Callisto    moon  Jupiter    2410.0  5.800000e+01  1.080000e+02   1880000.0
    9   Mercury  planet      Sun    2439.0  6.000000e+01  3.300000e+02  57900000.0
    
    In [3]:
    
  • merge(self, right, on, right_on=None, how=”left”, cols=None, **kwargs): Also can be called as a top-level function, as in pandas: tf.merge(frame1, frame2, right, on, ...). More or less as in Pandas, with some restrictions. Pandas merge is very similar to the operation known as “link” in 1010data, but there are some aspects of each that do not line up. The right parameter can be another TenFrame, or it can simply be a string giving the name (pathname) of a table in the 1010data object tree. Of the types of merge allowed in pandas (how=), only "left" and "inner" type joins are currently supported, but there are a few 1010data-specific types that are also permitted: "include", "exclude", and "asof". See the 1010data documentation for the link op for details.

  • concat(self, other, axis=0): Also available pandas-style, as tf.concat([frame1, frame2, ...]). As in pandas, though perhaps a bit limited. Concatenate two tables, either appending the rows (axis=0) or the columns (axis=1). If appending rows, the two TenFrames should have the same number of columns. Beware of some limitations, though. Appending rows involves consolidating all the data into the accum, whose capacity is limited (i.e., as a single machine, it can’t handle truly “big data”), so it might hit its limits and signal an error. Appending columns does not run this risk, but you can only append columns in a table, not a worksheet (i.e., you have to use either a string naming a table or a TenFrame which has no operations in it apart from just opening a table.)

    In [1]: base.shape
    Out[1]: (15350, 9)
    In [2]: base.concat(base).shape
    Out[2]: (30700, 9)      # Twice the length
    In [3]: base.concat("pub.demo.baseball.master", axis=1).shape
    Out[3]: (15350, 18)     # Twice the width
    
  • agg(self, agg_info, …): As in Pandas, more or less. Some parameters omitted from the signature here, as they are mostly for use by people more familiar with 1010data macro code and would only be confusing if listed. See the reference documentation for details.

    In [6]: sol.agg({"mass":"sum"})
    Out[6]:
           mass_sum
    0  1.991278e+09
    
    In [7]: sol.groupby("type").agg({"mass":"sum"})
    Out[7]:
               type      mass_sum
    0          star  1.988550e+09
    1        planet  2.727673e+06
    2          moon  6.250000e+02
    3  dwarf planet  3.000000e+01
    
    In [8]:
    
  • head(self, n=20): Equivalent to self.slice(0, n). The first n rows of the TenFrame.

  • tail(self, n=20): Equivalent to self.slice(-n). The last n rows of the TenFrame

  • pipe(self, func, *args, **kwargs): Similar to Pandas .pipe(), but note that it is executed on the server side. The source code of the func is incorporated in the query and passed to the server, where the main process (the accum) uses the Pandas .pipe() method to run it on the table computed so far. So the function you pass it needs to expect a Pandas DataFrame as its input, and return a Pandas DataFrame as its output (but the result you get back on the client side is still a TenFrame). But see also the warning given above about the apply() method.

  • replace(self, to_replace, value, inplace=False, *, cols=None, negate=False): As in Pandas, replace values with value in places where condition to_replace is True.

  • where(self, cond, other, inplace=False, *, cols=None): As in Pandas, replace values with other in places where the condition is False (tenFrame perpetuates Pandas’ strange naming.) Just calls self.replace(cond, other, inplace=inplace, cols=cols, negate=True).

  • melt(self, id_vars=None, value_vars=None, var_name=”variable”, val_name=”value”): Works like Pandas.DataFrame.melt(), except without the col_level and ignore_index parameters, since TenFrames have no true index. Be careful that all the “value_vars” are of the same type, or you will get an error from the server complaining about inappropriate arguments to “case.” The resulting TenFrame tends to be ordered differently from what you would see in Pandas, as like “variable” rows are not grouped together.

    In [1]: sol.melt(["type"], ["rkm", "mass", "vol"])
    Out[1]:
         type variable         value
    0    star     mass  1.988550e+09
    1    star      vol  1.414300e+09
    2    star      rkm  6.963240e+05
    3  planet     mass  1.898600e+06
    4  planet      vol  1.431280e+06
    5  planet      rkm  6.991100e+04
    6  planet     mass  5.684600e+05
    7  planet      vol  8.271300e+05
    8  planet      rkm  5.823200e+04
    9  planet     mass  8.683200e+04
    
    # Pandas would have had all the "mass" rows and then all the "vol" rows, etc.
    
    In [2]:
    

Some Non-Pandas Methods#

TenFrame is not Pandas, and has things that matter to it which don’t make sense in pandas. Here are a few of the methods of TenFrames which don’t necessarily have counterparts in Pandas DataFrames.

  • extractXML(), prettyXML(), printprettyXML(): Get or print the 1010data XML macro-code of the query this TenFrame represents. Discussed above.

  • to_df(self, maxlen=10000, start=0): Returns a pandas DataFrame containing the data from this TenFrame, starting at row start with a maximum of maxlen rows total. This entails running the query and also fetching all the data, so be aware of how much you’re trying to get. To get the whole of the query results and put all the rows into a DataFrame, just use a value for maxlen that’s higher than the number of rows you expect the table to be. If you don’t have an upper bound on the number of rows you expect, you shouldn’t be using this method! It might wind up trying to download more data than your computer can handle. See above about why you’re using 1010data in the first place.

    frame.to_df(100000)     # The top 100000 rows, if there are that many
    frame.head(10).to_df(10)   # The top 10 rows, converted into a DataFrame.
    frame.tail(10).to_df(10)   # The last 10 rows, converted into a DataFrame
    frame.to_df(10000, start=20000)  # Rows 20000-29999, converted to a DataFrame.
    
  • slice(self, start=None, stop=None, step=None): Returns a TenFrame for the region of this TenFrame between the given start and stop. The slice is done on the server side, i.e., the new frame’s query specifically filters the results to the specified rows. Currently equivalent to frame[start:stop:step]. Both forms respect negative numbers, treating them the same way python does. ⚠️ NOTE: the step value is currently ignored!

  • online(self): Returns True when this TenFrame is in “online” mode, i.e., it was created with a real py1010 Session object. Note that this doesn’t mean the session it is using is still “live.” Sessions can time out or be killed by other sessions, etc.

    In [1]: import py1010
    In [2]: import tenFrame as tf
    In [3]: gateway = "https://...."
    In [4]: user = "jjjschmidt"  # (his name is my name too)
    In [5]: password = "XXXXXXX"
    In [6]: session = py1010.Session(gateway, user, password, py1010.POSSESS)
    In [7]: frame1 = tf.TenFrame(session, "retail.sales_detail")
    In [8]: frame1.online()    # frame1 has a session associated with it
    Out[8]: True
    In [9]: frame2 = tf.TenFrame(None, "retail.sales")    # An offline frame
    In [10]: frame2.online()
    Out[10]: False
    
  • copy(self): Pandas DataFrames have this too, but distinctions are important to remember. The DataFrame copy() makes a deep copy (by default) of all the indices and all the data, so for a large DataFrame this can take up a lot of space and even cost noticeable time. But copying TenFrames is cheap. Copying a TenFrame just makes a shallow copy of the operations which define its query and a few other pieces of data. Since a TenFrame doesn’t actually contain any data inside itself, there isn’t a lot of data to copy around, so a lot of operations on TenFrames and TenSeries actually make copies of the TenFrame.

  • segby(self): Returns a list of column-names on which the base table of this TenFrame is segmented. TenFrames without a base table (i.e., which were initialized with the data= parameter, instead of the table= parameter) have no segmentation.

Aggregating data#

Basic aggregation on columns can be done almost like in pandas:

In [1]: sales tf.TenFrame(session, 'retaildemo.retail.sales_detail')
In [2]: len(sales)
Out[2]: 5997373902

# Using IPython's "%time" magic to see how long it takes...
In [3]: %time sales['xsales'].sum()[0][0]   # It's a one-row, one-column TenSeries
CPU times: user 31 ms, sys: 67.4 ms, total: 98.3 ms
Wall time: 35.1 s
Out[3]: 23361892611.108414          # 35sec!  Not bad for 6 billion rows.

In [4]: sales[['xsales','qty']].sum()
Out[4]:
     xsales_sum       qty_sum
0  2.336189e+10  8.136801e+09

In [5]:

Aggregating functions on TenSeries (like sales['xsales'].sum()) return TenFrames in tenFrame, unlike in Pandas where they return Series.

To get basic statistics, the .describe() method can be used, like in pandas (we’ll use a subset of the data so the examples don’t take too long to run):

In [1]: rsales = sales[:100000]      # Restricted subset of data
In [1]: rsales.describe()   # Returns a DataFrame, not a TenFrame!
Out[1]:
              transid          date            tme            sku  \
count   100000.000000  1.000000e+05  100000.000000  100000.000000
mean    107880.093040  2.012123e+07  149215.946800  195314.896760
std_s     4466.318854  2.756710e+01   31280.583466   74463.274328
min     100001.000000  2.012123e+07    1904.000000  100003.000000
lquart  103975.000000  2.012123e+07  124725.000000  132341.000000
median  107959.000000  2.012123e+07  151839.000000  152789.000000
uquart  111766.000000  2.012123e+07  172707.000000  273072.500000
max     115478.000000  2.012123e+07  231333.000000  306491.000000

                store            qty         weight         xsales  \
count   100000.000000  100000.000000  100000.000000  100000.000000
mean      1091.846590       1.272320       1.110387       4.142385
std_s      533.951117       0.959464       0.759946       4.762913
min        244.000000      -6.000000      -9.390000     -35.990000
lquart     684.000000       1.000000       1.000000       1.860000
median     937.000000       1.000000       1.000000       2.970000
uquart    1492.000000       1.000000       1.000000       4.670000
max       1998.000000      70.000000      59.960000     261.780000

                price           cost customer_key  promo_flag
count   100000.000000  100000.000000            0    100000.0
mean         3.417335       2.837458          NaN         0.0
std_s        3.645787       3.758547          NaN         0.0
min          0.090000     -22.780000         <NA>         0.0
lquart       1.570000       1.080000         <NA>         0.0
median       2.570000       1.870000         <NA>         0.0
uquart       3.870000       3.130000         <NA>         0.0
max        148.000000     147.640000         <NA>         0.0
In [2]: rsales[['xsales','qty']].describe()
Out[2]:
               xsales            qty
count   100000.000000  100000.000000
mean         4.142385       1.272320
std_s        4.762913       0.959464
min        -35.990000      -6.000000
lquart       1.860000       1.000000
median       2.970000       1.000000
uquart       4.670000       1.000000
max        261.780000      70.000000

You can also use include="all", like in pandas (but you need some string columns in the data in order for it to make sense)

In [1]: stations[['id','name','state','lat','lon']].describe(include="all")
Out[1]:
              name state            id         lat         lon
count          262   262    262.000000  262.000000  262.000000
ucnt           253    52           NaN         NaN         NaN
mode    CHARLESTON    AK           NaN         NaN         NaN
modef            2    20           NaN         NaN         NaN
mean           NaN   NaN  28496.011450   40.120634   99.640729
std_s          NaN   NaN  27768.083922    8.604403   22.598400
min            NaN   NaN   3103.000000   13.550000   66.000000
lquart         NaN   NaN  13957.000000   34.900000   82.533000
median         NaN   NaN  14942.500000   40.100000   93.733500
uquart         NaN   NaN  24230.000000   43.567000  112.433000
max            NaN   NaN  94910.000000   71.300000  170.217000
In [2]:

For a more complex set of aggregations, you can use .agg() and some other aggregation functions just like in pandas:

In [1]: sales.agg({"transid":"nunique", "xsales":"sum"})
Out[1]:
   transid_nunique    xsales_sum
0      812962531.0  2.336189e+10
In [2]: rsales.transid.value_counts()
Out[2]:
   transid    c0
0   111357  97.0
1   114142  76.0
2   115120  74.0
3   110972  73.0
4   111161  70.0
5   108341  69.0
6   107850  66.0
7   109714  64.0
8   111952  64.0
9   112615  64.0

You can even do things pandas can’t do, since .agg() accepts the “adjoin=” parameter, like aggregating functions do (see below).

Aggregations may cause the resulting data to be an in-memory table on the accum.

Groupby#

As in pandas, you use the .groupby() method to indicate the groups by which to aggregate. Groupby works with the various aggregation functions just mentioned.

In [1]: sales = tf.TenFrame(session, "retaildemo.retail.sales_detail")
In [2]: prods = tf.TenFrame(session, "retaildemo.retail.products")
In [3]: sales2019 = sales[sales.date.year() == 2019]
In [4]: both2019 = sales2019.merge(prods, on="sku")
In [5]: sales2019.groupby('store')[['xsales','qty']].sum()
Out[5]:
   store   xsales_sum    qty_sum
0    728   6306913.33  2032782.0
1   1596  12908921.46  4466709.0
2   1440   9515432.37  3222894.0
3    937  15657491.69  5282039.0
4    244  16939703.38  5413033.0
5   1492  16179375.92  5041127.0
6   1998   8511838.72  2948421.0
7   1588   5394074.24  1965080.0
8    115  17105819.02  5199226.0
9    126  21463133.90  6854166.0

In [6]: rsales.groupby('store')[['xsales','qty']].describe()
Out[6]:
   store  xsales_count  xsales_mean  xsales_std_s  ...  qty_lquart  qty_median  qty_uquart  qty_max
0    728        6853.0     3.974398      4.638158  ...         1.0         1.0         1.0     18.0
1   1596       13570.0     4.022160      4.124360  ...         1.0         1.0         1.0     30.0
2   1843       10043.0     4.058331      4.229268  ...         1.0         1.0         1.0     20.0
3   1440       11378.0     3.956890      4.124843  ...         1.0         1.0         1.0     28.0
4    937       15444.0     4.271222      5.086939  ...         1.0         1.0         1.0     70.0
5    684       10729.0     3.978461      3.993783  ...         1.0         1.0         1.0     28.0
6    244       17551.0     4.379790      5.354245  ...         1.0         1.0         1.0     36.0
7   1492       14393.0     4.239222      5.532240  ...         1.0         1.0         1.0     40.0
8   1998          39.0     2.755897      2.156277  ...         1.0         1.0         1.0      3.0

[9 rows x 17 columns]

# etc...

Used this way, groupby works pretty much the same as it does in pandas: you can group by multiple columns by using a list of names, etc.

Note

As of version 6.0.0, tenFrame has actual “GroupBy” objects created by the .groupby() method, so therefore should behave about the same as pandas (as opposed to earlier behavior, in which the groups were stored as an instance variable on the TenFrame and .groupby() created a copy of the frame with the variable set.) Ideally, the only thing the user should notice is that .groupby() creates objects belonging to classes with odd names, like Mframe_8768892854914x1dz2j5w. These objects will still test as being instances of TenFrame, and will behave as TenFrames in all respects. You can tell if an object is one of these wrappers by testing isinstance(groupobj, tf.GroupByMarker).

Pivot tables#

pvt = frame.pivot_table(values, index, column, aggfunc="mean", rename=True)

The TenFrame method .pivot_table() works approximately like the corresponding Pandas method, but there are currently some differences and limitations.

Pandas column names are unrestricted: pretty much any string of characters is a valid name for a column in Pandas. This is not so in 1010data. 1010data column names must begin with a letter and consist of only letters, numbers, and underscores, and what’s more, the letters must be lowercase. It’s one thing to keep to these restrictions when you are setting the column names yourself (assigning TenSeries into a TenFrame, etc), but it’s another matter when dealing with a pivot table, whose columns are supposed to reflect the values of a column (the index, in Pandas).

There’s no simple way to make the 1010data restrictions truly handle all the possibilities the way Pandas does; TenFrame does its level best. All the values which are to become column names are lowercased, and all strings of non-letter (or underscore) characters are replaced with a single underscore each. Numerical values are prefixed with an alphabetic string (separated by an underscore) to make them acceptable. You can also use rename=False in the method invocation (keyword-only) to disable the renaming, and leave the columns as they come from 1010data: m0, m1, m2, …

Note

Fair Warning

It is possible that this processing may cause some values to collapse to the same result, which could cause odd results (1010data does not permit two columns to have the same name, even if one is hidden). The process of finding the values to use is also subject to error with certain characters in the values, etc. Usually, things will work just fine, but in oddball cases you might get some puzzling errors and have to look deeper. Trying to use multiple cbreaks (i.e., a multi-level index, in Pandas terminology) will definitely cause things to break, and at best you’ll wind up with some unhelpfully-named columns.

Pivot tables use the .agg() method, with the addition of a parameter for the column whose values are to be the new columns, known in 1010data as the “cbreaks.” cbreaks= is an available keyword-only parameter of the .groupby() method.

The TenSeries#

The TenSeries object is the tenFrame analogue of a pandas Series object. Unlike a Series object, it doesn’t really contain any data on its own (neither do TenFrames, for that matter: they fetch data at need), and are not much more than a column name and a reference to a TenFrame. So there is no such thing as a TenSeries which is not associated with some TenFrame, unlike pandas Series which can have an independent existence as data holders.

You generally get TenSeries by subsetting a TenFrame (see above); you shouldn’t have a need to call the TenSeries constructor directly. Many methods and operations on TenSeries also return TenSeries.

Subsetting#

A TenSeries may be subsetted by an integer or a slice, which returns the value at that row-number or an iterator over the values specified by that slice, respectively.

If the subset is not an integer or a slice, but is, say, a string, TenSeries attempts to provide functionality similar to that provided by Pandas. Working on a copy of the TenFrame to which this TenSeries belongs, an operator is added which selects on the first element of list of columns on which the TenFrame is grouped by being equal to the specified value (and then the corresponding TenSeries from that frame is returned). To put that in terms that can be understood without reading it through five times, saying frame.groupby("month").cost["January"] means the same as frame[frame.month == "January"])['cost'].

Properties#

Some properties on TenSeries include:

  • column: The underlying py1010 Column object which this TenSeries represents.

  • frame: The TenFrame which this TenSeries is part of.

  • size: The length of the TenSeries, i.e., the number of rows. Same as the len() of the TenFrame it belongs to.

  • ndim: Always returns 1.

  • str.: TenSeries have a bunch of string-related functions on a special .str. member, for compatibility with Pandas.

    • str.contains(self, substr, case=True, regex=True): Like pandas, frame.col.str.contains(REGEXP) returns a boolean TenSeries that is True just where the value of the frame.col column contains the regular expression REGEXP. The case parameter controls case-sensitivity (default case-sensitive, True), and if the regex parameter is False, then the comparison is made as a simple string and not as a regular expression, with case-sensitivity controlled by the case parameter.

    • str.startswith(self, pfx): A boolean TenSeries that is True just where the value of this column starts with the string given as pfx (1010data beginswith()).

    • str.endswith(self, pfx): A boolean TenSeries that is True just where the value of this column ends with the string given as pfx (1010data endswith())

    • str.lower(self): This TenSeries, lowercased (1010data strdowncase()).

    • str.upper(self): This TenSeries, uppercased (1010data strupcase()).

    • str.count(self, reg): Number of times the given regular expression reg appears in the each value in this TenSeries (1010data regex_count()).

    • str.len(self): The string length of each string in this TenSeries (1010data strlen()).

    • str.find(self, sub): Find the first occurrence of the substring within each value. This uses the 1010data strfind() function, which is a little different than Pandas’ .str.find(). It does not have start and end parameters, and it returns NA on failure (where Pandas’ find() returns -1).

    • str.get(self, ind): Select a substring out of a delimited string by index (1010data strpick(), index adjusted).

    • str.lstrip(self, to_strip=None): (also str.rstrip() and str.strip(), as in Pandas.) Strip off whitespace (only space characters!) from the left side of the string (or right side, for str.rstrip(), or both sides, for str.strip()). to_strip should be a string of characters to be stripped out, or None for just spaces. (1010data strtrim())

    • str.match(self, pat, case=True, flags=0): Boolean TenSeries showing which values start with a match for the given regular expression. If case=True, the check is case-sensitive. flags are an integer made up of flags from the “re” library. (1010data regex_beg())

    • str.pad(self, width, side=’left’): Pad the string to the given length. Unlike Pandas, this function can only pad with space characters. (1010data padleft() or padright())

    • str.capitalize(self): Capitalize string: all letters lowercased except first letter and any letter following a non-letter, which are uppercased. (1010data propercase())

    • str.removeprefix(self, prefix): Remove the given prefix from strings, if it is found at the beginning.

    • str.removesuffix(self, suffix): Remove the given suffix from strings, if it is found at the end.

    • str.repeat(self, repeats): Repeat a string a given number of times. Unlike pandas, the repeats parameter cannot be a sequence of integers, but it can be an integer-valued TenSeries. (1010data strrepeat())

  • dt.: TenSeries have properties (and a few methods) on a special .dt. member for use on datetime-valued TenSeries, for compatibility with Pandas.

    • dt.date: Returns a “date”-type column containing the date portion of the datetime. Uses the 1010data date() function.

    • dt.time: Returns a “time”-type column containing the time portion of the datetime. Uses the 1010data time() function.

    • dt.year: Returns an integer column containing the year portion of the datetime. Uses the 1010data year() function.

    • dt.month: Returns an integer column containing the month portion of the datetime. Uses the 1010data month() function.

    • dt.day: Returns an integer column containing the day (of month) portion of the datetime. Uses the 1010data day() function.

    • dt.hour: Returns an integer column containing the hour portion of the datetime. Uses the 1010data hour() function.

    • dt.minute: Returns an integer column containing the minute portion of the datetime. Uses the 1010data minute() function.

    • dt.second: Returns an integer column containing the seconds portion of the datetime. Uses the 1010data second() function.

    • dt.dayofweek: Returns an integer column containing the day of the week as an integer, according to the Pandas/ISO convention: Monday=0, Tuesday=1, … Sunday=6. Not the same as the 1010data dayofwk() function, from which it is converted! Also available as dt.day_of_week.

    • dt.dayofyear: Returns an integer column containing the day of the year as an integer. Also available as dt.day_of_year.

    • dt.quarter: Returns an integer column containing the quarter (of the year) which contains this datetime. Uses the 1010data quarter() function.

    • dt.is_month_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first of a month.

    • dt.is_month_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a month.

    • dt.daysinmonth: An integer TenSeries, showing how many days are in the month in which the datetime occurs.

    • dt.is_quarter_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first day of a quarter.

    • dt.is_quarter_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a quarter.

    • dt.is_year_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first day of a year.

    • dt.is_year_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a year.

    • dt.is_leap_year: A boolean TenSeries, equal to 1 just when the datetime occurs in a leap year (in the Gregorian calendar. Not valid for pre-Gregorian dates.)

    • dt.month_name(self, *args): The name of the month in which the datetime occurs. See the documentation for the 1010data function tomonthname() for possible valid args.

    • dt.day_name(self, *args): The name of the weekday on which the datetime occurs. See the documentation for the 1010data function todayname() for possible valid args.

    • dt.weekofyear: The ISO week number of the year (deprecated in Pandas, but easier to work with in tenFrame than the dt.isocalendar() method (see below) which is supposed to replace it.) Uses the 1010data isoweek() function.

    • dt.isocalendar(self): Returns a TenFrame extending the parent TenFrame by adding columns named year_*, week_*, and day_* columns for the year, week number, and day number in the ISO calendar. The column names are actually the strings "year_", "week_", and "day_" suffixed with a random string to ensure uniqueness.

    • dt.normalize(self): Normalize datetimes to midnight.

  • Everything Else: Accessing any member of a TenSeries that isn’t otherwise defined results in a method which you call to apply some function to the TenSeries, see below.

Methods#

Many of the “methods” on TenSeries are really calls to 1010data functions, but there are also some that involve more work on the client side, or otherwise are not simply an interface to server-side functions. Some are mainly proxies to similar methods on the underlying TenFrame, like these:

  • copy(self): Makes a copy of the underlying TenFrame, and returns the “same” TenSeries as this one from that copy (i.e., the TenSeries with the same name.)

  • online(self): Calls the online() method of the underlying TenFrame.

  • groupby(self, *args): Like running groupby() on the underlying TenFrame.

  • head(self, n=20): Returns the first n values of this TenSeries as a one-column TenFrame (same as self.frame[[self.colname()]].head(n))

  • sort_values(self, ascending=True, inplace=False, **kwargs): Performs sort_values() on the underlying TenFrame and returns the TenSeries from it corresponding to this one (i.e., with the same name).

  • isnull(self), isna(self): Returns a boolean TenSeries showing where this TenSeries is NA.

  • notnull(self), notna(self): Returns a boolean TenSeries showing where this TenSeries is not NA.

  • fillna(self, *args, **kwargs): Same as the fillna() method on the underlying TenFrame acting only on this column.

  • replace(self, *args, **kwargs), where(self, *args, **kwargs): Replace values where the condition is/is not True; runs the same-named method on the underlying TenFrame, restricting to this column.

Pandas-like methods that are specific to TenSeries and aren’t just calls to 1010data functions include:

  • isin(self, values): Returns a boolean TenSeries on whether or not the value of this TenSeries is to be found in the values (a set, list, tuple, etc.)

    In [1]: sol[sol.type.isin(['star', 'moon'])]
    Out[1]:
           name  type   orbits       rkm           vol          mass       dist
    0       Sun  star        -  696324.0  1.414300e+09  1.988550e+09        0.0
    1  Ganymede  moon  Jupiter    2634.0  7.600000e+01  1.480000e+02  1070000.0
    2     Titan  moon   Saturn    2576.0  7.100000e+01  1.340000e+02  1221850.0
    3  Callisto  moon  Jupiter    2410.0  5.800000e+01  1.080000e+02  1880000.0
    4        Io  moon  Jupiter    1821.0  2.500000e+01  8.900000e+01   422000.0
    5      Moon  moon    Earth    1737.0  2.100000e+01  7.400000e+01   385000.0
    6    Europa  moon  Jupiter    1560.0  1.600000e+01  4.800000e+01   670900.0
    7    Triton  moon  Neptune    1353.0  1.000000e+01  2.100000e+01   354759.0
    8   Titania  moon   Uranus     788.0  2.000000e+00  3.000000e+00   436000.0
    
    In [2]:
    
  • unique(self, sort=True): As in Pandas, returns the unique values in this TenSeries. Unlike Pandas, which returns a list, this returns a TenFrame, since after all the number of uniques might still be too big to fit into one computer. So a TenFrame is constructed, a copy of the one underlying this TenSeries, with operations added to compute the unique values. The .unique() method takes “groupby” into account, so you get the unique values for each group. By default, the TenFrame is sorted so that you can see all the unique values for a given block together, but since there are limits on the size of a table when sorting it, you can pass sort=False to disable the sorting and avoid the server-side error.

    In [1]: states = stations.state.unique()
    In [2]: states
    Out[2]:
      state
    0    AK
    1    AL
    2    AR
    3    AZ
    4    CA
    5    CO
    6    CT
    7    DE
    8    FL
    9    GA
    
    In [3]: len(states)
    Out[3]: 52      # Because of PR (Puerto Rico) and PI (Pacific Islands, i.e., Guam in this case.)
    
    In [4]: sales.transid.unique()  # too many to sort!
    TentenException...
    TentenException: ('function run failed, returned 1 (Sort is limited to 100000000 rows.  Limit table size via selection or tabulation if practical.)', 1)
    
    In [5]: sales.transid.unique(sort=False)
    Out[5]:
       transid
    0   100001
    1   100002
    2   100003
    3   100004
    4   100005
    5   100006
    6   100007
    7   100008
    8   100009
    9   100010
    
    # They happened to be sorted anyway.
    In [6]: len(sales.transid.unique(sort=False))
    Out[6]: 812962531      # Good thing we didn't try to download it all!
    
    In [7]: sales.groupby(["store", "date"]).customer_key.unique().dropna()
    Out[7]:
       store      date  customer_key
    0      2  20150221           117
    1      2  20150222           577
    2      2  20150223           595
    3      2  20150223          8489
    4      2  20150224           577
    5      2  20150224           595
    6      2  20150224           902
    7      2  20150224           903
    8      2  20150225           902
    9      2  20150225          1165
    
    In [8]:
    
  • nlargest(self, n), nsmallest(self, n): As in pandas, returns a TenSeries with the n largest (resp. smallest) values.

Functions#

Just like accessing a member of a TenFrame that isn’t otherwise defined always succeeds and gives you a TenSeries, accessing a member of a TenSeries that isn’t otherwise defined also will always succeed and will give you a method which, when called, results in a TenSeries which applies the named function to the original TenSeries. Fortunately, this is easier done than said:

In [1]: points['x'] = points['theta'].sin()

will set a new x column in the TenFrame points to be equal to the 1010data function sin() applied to the column points['theta']. So internally, points['theta'].sin() is translated into a column in the TenFrame whose value is defined as sin(theta). By making this happen for any function name you use on a TenSeries, tenFrame makes it possible to access all the 1010data functions without having to handle each one individually, or needing to update tenFrame when new ones are added. You can see a list of all the functions that are available in the documentation.

As shown here, points.theta.sin() becomes sin(theta) internally, i.e., the name of the column becomes the parameter of the function in the underlying query. If the function has more than one parameter, you can supply the others in the parentheses. So if you’re using the 1010data builtin amort() function, which computes mortgage amortization and takes three parameters (rate, remaining term, and number of months of the amortization), you can say

In [1]: morts['amort'] = morts.rate.amort(morts.remterm, morts.totmonths)

which will become a column in your table whose value is computed (on the server) by amort(rate;remterm;totmonths) (the underlying 1010data code uses semicolons to separate parameters, not commas), so the rate column on which you called the method becomes the first parameter. If you need to use a method like this on this column but want to have a different column or a constant first, that has to be done in other ways, such as using the tf.func object described below.

Note: The functions listed on this page in the 1010data documentation, i.e., all functions whose name starts with g_ (and a few special cases besides) are group-based functions and work differently; see below. So do the tabulation functions listed here, and their pandas equivalents (e.g. mean, where 1010data uses avg, etc.)

Aggregating Functions#

Some functions (like .sum() or .mean()) have the effect of calculating an aggregated statistic and not a simple function on each element of a column. These will return a TenFrame, unlike functions like .sin() which will return a TenSeries. This is a little different from Pandas, once again going back to the lack of an index in tenFrame. Pandas will return a Series in such cases, and the index of the series will show the values of each group corresponding to each sum or average, etc. If tenFrame returned just a TenSeries, you wouldn’t see the values, which would be quite confusing. You could still get at them by looking at the .frame of the TenSeries, but that might be counterintuitive. You can access the “results” column without much trouble if you remember that it is the last column, no matter how many variables you’re grouping by. You can use frame.lastCol as a quick accessor to this column.

Calling Aggregating Functions#

You call these functions much like you call ordinary functions on a column, just using their names as methods on the column you want to aggregate. You just say things like sales.groupby('date').xsales.sum() or sales.groupby('date').xsales.cumsum().

In [1]: rsales.groupby('store')['xsales'].sum()
Out[1]:
   store  xsales_sum
0    728    27236.55
1   1596    54580.71
2   1843    40757.82
3   1440    45021.49
4    937    65964.75
5    684    42684.91
6    244    76869.70
7   1492    61015.12
8   1998      107.48
In [2]:

It’s important to realize that though you have performed an aggregation (.sum()) on a TenSeries, the result is actually a TenFrame with two columns (where Pandas would have a Series, i.e., a single column, labeled with the index.)

All 1010data tabulation functions (and g_funs) respect a “group” parameter which specifies how the data should be grouped. This comes from the .groupby() part of the expression above, just as in Pandas, or you can provide it explicitly, as sales.xsales.sum(group='date').

Some 1010data g_funs have another special parameter, the “order”. This is important in functions like cumsum, rank (for ranking within a group), etc. You can supply it the same way, with the order= keyword. If omitted, the order is simply the order in which the rows appear in the database. If supplied, it should be a TenSeries (or list of TenSeries) which determine the row order in which values will be considered. Pandas uses the index to provide this, but tenFrame lacks index information.

There’s an important distinction between the “ordered” g_funs and the others. Ordinary g_funs are like tabulations or aggregations, in that they generally return one value per group, i.e., fewer values than the number of rows they operate on. For example, computing the sum of xsales grouped by date results in only one value per date (whether done by g_fun or tabulation function). But sales.groupby('date').xsales.cumsum() or sales.groupby('date').rank() have different values for every row within the group, so they return columns that are the same length as their input. You generally can’t run an “ordering” aggregation like this unless the table is segmented appropriately for the computation you’re doing.

g_funs#

1010data “g_funs” are a special kind of grouping or aggregation functions. Take a look at the documentation about them for some more understanding. g_funs have an important limitation, and it is one that applies to server-side Python coding in general: you can only use g_funs if your table is “segmented” in some way appropriate to how your aggregation works (or if the table is a single segment, less than about 8 million rows long).

Segmentation#

Segmentation is a topic for another document, but the basic notion is that the huge tables in 1010data are stored in smaller sections, and a table can be “segmented” on one or more columns such that a block of rows sharing the same values in those columns is never split up among multiple segments. So if your table is segmented on storeid, then when you’re processing a segment (in a subproc), you know that for each value of storeid in that segment, all the rows for the value are also there. Segmentation is also important for server-side coding because when running a server-side function, the function runs on one segment of the table (possibly on multiple machines processing different segments at once, but each operating on one segment at a time.) You don’t get the whole table, because once again, if you could hold the whole table in memory at once, even on a more powerful desktop machine than you use, you wouldn’t need the services of a Big Data company.

g_funs vs Tabulations#

Generally, g_funs and ordinary tabulations/aggregations have a lot of overlap; they do many of the same things. But if a table is segmented such that a g_fun can be used, a g_fun will be significantly faster than a tabulation. (Note, however, that the overlap of function is not true of certain g_funs, mainly the “ordered” g_funs mentioned above. 1010data tabulations cannot perform the same functions as “ordered” g_funs. Also, there remain some subtle differences between performing a tabulation vs computing a g_fun and selection, regarding how and if the data remains distributed among multiple machines, etc.)

You don’t really have to worry much about g_funs vs tabulations (unless you start writing your own server-side g_funs), because if your table’s segmentation supports the grouping you are doing (and you’re in online mode, so we can tell), tenFrame will always try to use a g_fun if one is available. If the table’s segmentation is not in your favor, you’ll get a tabulation instead. Basically, you write what you want and tenFrame will work out what’s best to use. You can demand a g_fun by explicitly using the g_ prefix in the function name, as sales.groupby("date").xsales.g_sum(). If you use the g_ prefix, tenFrame assumes there is a g_fun by that name and that you know what you’re doing by calling it. If the table isn’t segmented in a way that accommodates your groups, the server side will throw an error. You can use the frame.segby() method to see how a table is segmented.

The “select” parameter is a boolean column specifying which rows are to participate in the aggregation. If the column’s value is False (or 0) on a particular row, the value in that row is not used in the aggregation. sales.xsales.sum(group='date', select='promo_flag') is effectively the same as sales[sales.promo_flag].xsales.sum(group='date'), though it might be done more efficiently on the server side. (For g_funs, this is accomplished by means of special “select” parameter that g_funs have; if you are doing a tabulation, a selection operation is inserted before the tabulation.)

The actual decision process regarding what to do when faced with frame.column.function() is really complicated, perhaps unnecessarily so, precisely because it’s trying to look “simple” and guess at what a user will typically want, and like all attempts to guess what the user wants, the results can be quite mystifying if the guess is wrong.

You probably don’t want to read this.

  1. Translate the function name from Pandas to 1010data, if needed (e.g. pandas mean corresponds to 1010data avg).

  2. Look for g_NAME in the table of 1010data g_funs (i.e., add the g_ prefix and look). If it’s there:

    1. If it’s an “ordered” function (has “order”/”O” as one of its parameters), build a column defined by a call to g_NAME(groups;select;order;X[;Y...]). Tabulations don’t do ordered functions (except for the first and last functions, which are checked for separately, and a tabulation will be done for them if the segmentation is not appropriate for g_funs). If the segmentation is unfriendly, the server side will complain.

    2. Otherwise,

      1. If the segmentation of the table is appropriate for these groups (i.e., at least one group is in the set of things the table is segmented by, or the table is only one segment long anyway, etc.), build a column defined by a call to g_NAME(groups;select;X[;Y...]), followed by a selection on g_first1() to select only the first value, so as to make this behave like a tabulation, which changes the rank of the column.

      2. Otherwise, do a tabulation, using fun='NAME'

  3. Otherwise (if g_NAME is not listed as a known 1010data g_fun),

    1. If the name actually starts with g_ already, then we assume the user knows what they’re doing and build a column with NAME(groups;select;X[;Y...]), or NAME(groups;order;select;...) if the order= parameter was explicitly given, followed by a selection as before.

    2. Otherwise,

      1. If NAME is known to be a function used with the tabulation operator, perform a tabulation with it.

      2. If there is some grouping specified, either by groupby() on the table or column being operated on or by the group= parameter explicitly, we assume the user must want some kind of aggregation happening, so do a tabulation or g_fun column (without adding g_ to the function name), accordingly as the table’s segmentation includes something we’re grouping on.

      3. Finally, if all else fails and there are no groups specified, assume the user is after a non-aggregating ordinary function, any of the functions in the function reference that aren’t listed as g_funs or some user-defined u_fun (as with @frame.def_ufun).

In a call to .agg() where multiple functions are used together, either all of them become tabulations or all of them become g_funs. If some must be tabulations (because there is no corresponding g_fun) and some must be g_funs (because the function starts with g_ or other reasons), an exception is raised.

Adjoining Aggregations - computing new aggregated columns without joins#

Aggregating functions (as well as the .agg() method on TenFrames) can optionally be called with the adjoin = True parameter to obtain some very useful results not easily achievable with pandas. Most aggregations result in a table which is shorter than the original data. For example, frame.groupby("month").sales.sum() can be expected to have only 12 rows, no matter how long the original data is. But sometimes it’s useful to retain the granularity of the original data along with adding the aggregated information, so that each row also has information on the sum of sales that happened in the same month as the transaction listed in that row. In Pandas, you can do something similar with

frame.merge(frame.groupby("month").sales.sum(), on="month")

The resulting column names wind up a bit confusing, but what’s worse is that you have to perform a potentially time-consuming join (the .merge() method). The adjoin parameter in aggregations lets you do this in a single expression, but more importantly, if your table is segmented by the groupby column(s), no join is required, so the calculation is extremely fast.

Adjoining returns a new TenFrame, not a TenSeries to assign into an existing one, so you can just say

frame_sums = frame.groupby("month").sales.sum(adjoin=True, name="monthly_sales")

using the name= parameter to set the name for the new column (if not set, it would default to "sales_sum").

There is even some special-case detection done so that you can do it automatically during an assignment and simply say

frame['monthly_sales'] = frame.groupby("month").sales.sum()

Note

Arguably, this shouldn’t make sense, since the frames aren’t “compatible” at first glance: the right-hand frame is an aggregation and has fewer rows than the left-hand frame. Besides, the right-hand expression actually returns a TenFrame, and you can’t assign a TenFrame as an element (a column) of another TenFrame! However, the meaning of such an assignment is obvious and it’s so useful that tenFrame makes a special exception for this sort of thing and does what you want.

An example using the weather stations data mentioned above:

stations = tf.TenFrame(session, "pub.demo.weather.stations")
stations.groupby("state").elev.mean(adjoin=True)
stations

results in

     id        name state  tz     lat      lon  elev  anht    elev_mean
0  3103   FLAGSTAFF    AZ  -7  35.133  111.667  2135     6  1196.000000
1  3812   ASHEVILLE    NC  -5  35.433   82.533   661     6   218.333333
2  3813       MACON    GA  -5  32.700   83.650   110     7   144.333333
3  3816     PADUCAH    KY  -6  37.067   88.767   125    10   251.800000
4  3820     AUGUSTA    GA  -5  33.367   81.967    45     6   144.333333
5  3822    SAVANNAH    GA  -5  32.133   81.200    16     9   144.333333
6  3856  HUNTSVILLE    AL  -6  34.650   86.767   190     8   127.750000
7  3860  HUNTINGTON    WV  -5  38.367   82.550   255     6   476.000000
8  3870  GREENVILLE    SC  -5  34.900   82.217   296     7   125.666667
9  3872     BECKLEY    WV  -5  37.783   81.117   765    10   476.000000

Note that all the rows are still there, and the mean elevation for each state is on all the rows for stations in that state (look at the three GA rows in the example above, or the two WV rows.) Each record automatically has the correct value for elev_mean , that is no join/merge was required as in pandas.

that and func#

TenFrame provides some special values which can be very useful in making code more concise and simpler to write and to read. There are things that they make possible that would be extremely difficult to express in other ways.

that#

that is a sort of generic TenFrame. In a sense, it can be used to be referring to whatever TenFrame it’s being used with. It’s probably easier to understand from examples:

In [1]: s2019 = tf.TenFrame(session, "retaildemo.retail.sales_detail")[tf.that.date.year() == 2019]
In [2]: bigdays = s2019.groupby('date').agg({'price':'sum', 'cost':'sum'})[tf.that.price_sum > 7.0e6]
In [3]: bigdays
Out[3]:
       date     price_sum      cost_sum
0  20190201  7.288699e+06  6.376741e+06
1  20190216  7.055807e+06  5.846130e+06
2  20190303  8.405907e+06  7.064833e+06
3  20190420  9.655613e+06  8.208903e+06
4  20190505  8.504962e+06  7.208672e+06
5  20190622  7.200566e+06  5.986890e+06
6  20190623  7.708646e+06  6.389132e+06
7  20190810  7.641546e+06  6.518093e+06
8  20190825  8.023629e+06  6.682147e+06
9  20191012  7.211072e+06  6.209403e+06

In [4]:

As you see, we are selecting from a frame on a condition involving its columns, without needing to assign the frame to a variable in order to access its columns. tf.that already has all its columns, so to speak. That’s where the name comes from: you’re working with some TenFrame or another, possibly a complicated expression for one, and you just want to refer to something in “that.”

Or you can use columns from that in expressions with another TenFrame, as a way of referring to that frame’s columns:

In [1]: s2019 = sales[sales.date.year() == 2019]
In [2]: sumcosts = s2019.groupby('date').agg({'price': 'sum', 'cost': 'sum'})['price_sum'] - tf.that['cost_sum']
In [3]: sumcosts
Out[3]:
       xvwm83p5
0  8.707065e+05
1  1.012163e+06
2  9.873511e+05
3  9.119584e+05
4  9.840235e+05
5  1.209677e+06
6  1.341073e+06
7  1.228974e+06
8  1.027738e+06
9  9.813348e+05

Note that sumcosts is a TenSeries (not a TenFrame), which makes sense as it is the difference of two TenSeries: the price_sum of the aggregation done on s2019, and the cost_sum of… that same table, the aggregated table you got from running agg()! Using tf.that saves you from having to type the whole thing in all over again.

But I’m sure you’re wondering: what kind of name is xvwm83p5?? It’s an artifact of how tenFrame has to operate. Usually you don’t see it, and we’ll look at ways to avoid it in a moment, but in this simplest case, without any embellishments to fix it, unfortunately it is exposed for you to see. If all you need is the values from the column (which is likely, if you’re not bothering to assign the aggregate TenFrame to a variable), it won’t matter to you what it’s called, so that’s okay. But this kind of thing will turn up now and then as you experiment with tenFrame, so you should at least know that it happens and that it’s normal.

that can also be helpful in subscripting. If you try to do this

In [1]: base = tf.TenFrame(session, "pub.demo.baseball.master")
In [2]: fifties = base[base.birthyear >= 1950][base.birthyear < 1960]

you’ll get an error that the frames are not “compatible.” Why that is so is another matter, but you can easily make it work by saying

fifties = base[base.birthyear >= 1950][tf.that.birthyear < 1960]

You can even use tf.that for both subscripts. Of course, in this case syntax like

fifties = base[(base.birthyear >= 1950) & (base.birthyear < 1960)]

is also available and probably preferable.

func#

The tf.func object uses some of the machinery from tf.that and makes it a little easier to write function expressions on columns without being subject to the method-calling syntax.

As mentioned above, something like frame.col1.fun(frame.col2, frame.col3, frame.col4) becomes an expression in the 1010data code that looks like fun(col1; col2; col3; col4), i.e., the object on which the function is called becomes the first parameter. But maybe that isn’t what you need, and it isn’t convenient to reorder things so you have a different column at the front. Or maybe the column is a complicated expression, like below. You can build a function expression with tf.func to specify exactly what you mean to call. So using the example above and the 1010data amort function,

In [1]: morts['amort'] = morts.rate.amort(morts.remterm, morts.totmonths)

You can say the same thing without worrying about what column you’re calling a method on like this:

In [1]: morts['amort'] = tf.func.amort(morts.rate, morts.remterm, morts.totmonths)

or even

In [1]: morts['amort'] = tf.func.amort(tf.that.rate, tf.that.remterm, tf.that.totmonths)

using tf.that to provide you with a generic way to refer to a column. If you had used strings, like "rate" and "remterm", they would have been taken to be strings and quoted when translated into XML, and you would have gotten something like amort('rate';'remterm';'totmonths'), which is not what you want.

Module-Level Functions#

There are a few module-level functions in tenFrame, often there to emulate a similar usage in pandas. These include:

  • tf.islive(session): Returns True if the given py1010 Session object is not only logged in, but is currently “live” and responsive to queries. It actually creates and runs a dummy query on the session to confirm that the session really is running queries and not timed out, etc.

  • tf.directory(session, dirname, to_df=True, **kwargs): Returns a Pandas DataFrame containing the names and metadata about all the tables and subdirectories in the directory named in the 1010data object tree. If to_df is False, returns a TenFrame instead.

  • tf.merge(frame, *args, **kwargs): The same as frame.merge(*args, **kwargs). Even though merge() makes sense as an instance method on frames, pandas chooses to define it as a module-level method, so TenFrame supports the same syntax.

  • tf.melt(frame, *args, **kwargs): The same as frame.melt(*args, **kwargs), for the same reason as merge(). In this case, pandas supports the latter syntax as well.

  • tf.wide_to_long(frame, *args, **kwargs): The same as frame.wide_to_long(*args, **kwargs), for the same reason as merge().

  • tf.concat(frames, *args, **kwargs): Takes a sequence of frames and runs the .concat() method on them sequentially to concatenate them all together. Equivalent to reduce(lambda x, y: x.concat(y, *args, **kwargs), frames). Again, provided for compatibility with how Pandas does it.

  • tf.select(condlist, choicelist, default=0): Returns a string which contains a 1010data expression equivalent to the numpy np.select() method. This string is suitable to be set as a new column of a TenFrame, and will not be quoted. So

    frame['success'] = tf.select([frame.totsales > 10000, frame.totsales > 1000, frame.sales > 10],
                                 ["Great", "So-so", "Lousy"], "Rotten")
    

    will create a column success which will contain one of the strings listed above, depending on the value of the sales column.

More Examples: Merging Tables#

The tenFrame .merge() method works pretty much the same as the pandas one, subject to what the 1010data engine can do. It uses the 1010data “link” operator, whose capabilities overlap those of pandas .merge(), but neither is a subset of the other.

The merge() signature looks like this:

merge(left, right, on, right_on=None, how=”left”, cols=None, **kwargs)

TenFrame merge() works about the same as pandas:

sales = tf.TenFrame(session, "retaildemo.retail.sales_detail")
salesprod = tf.merge(sales, "retaildemo.retail.products", on="sku")

yields a frame that starts with:

transid     date    tme     sku     upc     store   qty     weight  xsales  price   ...     group   groupdesc       category        categorydesc    pl      size    is_random_wgt   brand   brand_name      manuf
0   100001  2012-12-31      03:52:07        222218  7084781260      728     1.0     1.0     1.97    1.97    ...     2045    ENERGY DRINKS   2045010 ENERGY-SINGLE SERVE     0       15 OZ   0       MONS    Monster Monster Beverage
1   100002  2012-12-31      04:07:58        260548  81809400001     728     1.0     1.0     1.39    1.39    ...     2045    ENERGY DRINKS   2045010 ENERGY-SINGLE SERVE     0       16 OZ   0       ROCK    Rockstar        Rockstar
2   100003  2012-12-31      05:10:44        224423  7116603129      728     1.0     1.0     2.77    2.77    ...     1520    BREAD/BAKED GDS 1520100 BRD-BRKFST CAKES/SWEET ROLLS    0       5 OZ    0       SVNH    Svnh    <NA>
3   100004  2012-12-31      05:38:59        306141  999999907314    728     1.0     1.0     0.49    0.49    ...     1460    SNACK -CHIPS/PRETZELS   1460030 SNACK-POTATO CHIPS      1       1.5 OZ  0       PRV     Priv Lbl        PL Mfg
4   100004  2012-12-31      05:38:59        101528  1200000452      728     1.0     1.0     2.57    2.57    ...     2040    NEW AGE BEVERAGE        2040060 NAB COFFEE      0       13.7 OZ 0       STRBK   Starbucks       Pepsi/Starbucks
5   100004  2012-12-31      05:38:59        110966  2200000898      728     1.0     1.0     1.27    1.27    ...     1390    CONFECTION      1390050 CONF GUMS-MINTS 0       15 CT   0       WRIGL   Wrigleys        Mars
6   100005  2012-12-31      06:04:51        161510  6150000594      728     2.0     1.0     1.42    0.71    ...     2045    ENERGY DRINKS   2045010 ENERGY-SINGLE SERVE     0       16 OZ   0       RRAIN   Rrain   <NA>
7   100006  2012-12-31      06:31:45        105159  1590000209      728     1.0     1.0     1.29    1.29    ...     2620    MEAT DEPT. DELI 2620115 FRANKS  0       16OZ    0       BAR-S   Bar-S   Alfa
8   100006  2012-12-31      06:31:45        102143  1230000055      728     2.0     1.0     13.32   6.66    ...     2160    CIG SINGLE PACKS        2160010 CIG-SINGLE PACKS        0       EACH    0       MORE    More    <NA>
9   100006  2012-12-31      06:31:45        298945  999999900118    728     1.0     1.0     0.82    0.82    ...     1520    BREAD/BAKED GDS 1520020 BRD-BUNS-FRESH  1       8 CT    0       PRV     Priv Lbl        PL Mfg

You have a left TenFrame and a right TenFrame (which may be just a string naming a table), you name the column to join on with on=. If the column has a different name in the right TenFrame you use on= and right_on= (again, like in Pandas). Note, however, that TenFrame joins default to how="left", and only how="left" and how="inner" are supported. Values of "right", "outer", and "cross" are not supported in TenFrame merging. However, the 1010data-specific values "include", "exclude", and "asof" are supported; see the documentation for the link op.

The cols= parameter lets you specify exactly which columns from the right-hand TenFrame are to be adjoined in the merge. The kwargs can be used to set other attributes on the XML code being generated, for users who are familiar with our macrocode. One thing you can add this way is suffix="_y", to specify a suffix to add to the columns from the right-hand table to avoid conflict. (There is no way to specify a suffix for the left-hand table.)

Concatenating#

There is also a concat() function, like in pandas:

newframe = tf.concat([frame1, frame2], axis=0)

The axis can be 0 (rows, the default) or 1 (columns). (The rest of the arguments for pandas concat() function mainly deal with indexes and other things that are not relevant to tenFrame.) The two types of concatenation are actually handled rather differently in 1010data terms. Note that if you are joining by columns, the right-hand frame cannot be a “worksheet,” meaning a 1010data query whose data has all been brought into the accum (by doing a tabulation, for example.)

Both merge() and concat() are also available as TenFrame methods, which may be more intuitive.

salesprod = sales.merge("retaildemo.retail.products", on="sku")
newframe = frame1.concat(frame2, axis=0)

Offline Mode#

TenFrame actually does two distinct cool things: (1) it constructs a 1010data macrocode query mimicking the effects of various pandas-like manipulations, and (2) it runs that query and makes the results available to the programmer, in a more or less pythonic/pandas-like way. The first of these is actually independent of the second, and it may be useful to develop a query “offline,” not dependent on a connection to 1010data’s servers, and then use the macrocode derived from the TenFrame, perhaps modifying it further, etc. To that end, tenFrame does not require py1010 to run (it doesn’t even need to be installed), and you can create a TenFrame without a py1010 Session. To do so, supply None in the TenFrame constructor in place of the session:

frame = tf.TenFrame(None, "pub.cool.table")

The resulting TenFrame will do its level best to perform all the functions of any other TenFrame without ever trying to contact the 1010data servers. Then you can get the XML of the macrocode query using the .extractXML(), .prettyXML(), or .printprettyXML() methods as mentioned above.

In offline mode, there are some things a TenFrame can’t do that it could if it were online. Obviously, it can’t retrieve data from the results, since there are no results because the query can’t be run. If you try to get data from the results, you’ll get back the value NotImplemented. Similarly, the .size and .shape properties are unavailable (and will return NotImplemented); this also applies to using a negative index when subsetting the rows (because the number of rows is not known), and the len() of a TenFrame in offline mode will always be 0. But there are some other limitations too.

In particular, if you do an aggregation on a TenFrame and don’t supply a dictionary ({'column':'function',...} or {'column':['function',...],...}) but only a function name or a list of function names, those functions are supposed to be applied to all applicable columns. Not all functions can be applied to all function types. Specifically, string columns cannot be tabulated by numeric functions like sum or average, but only by minimum, maximum, count, nunique, and first (a function not found in pandas.) So in order to know how to tabulate an underspecified aggregation like this, tenFrame needs to know all the columns and their types – which it cannot know in offline mode, so this results in an error. You’d do best simply to provide a fully-specified aggregation dictionary to .agg(). Relatedly, when offline tenFrame can’t tell what a table’s segmentation is like, so it can’t determine when it can use g_funs and defaults to tabulations always.

Important Differences from Pandas#

We’ve already touched on some of the differences between tenFrame and pandas. It’s unlikely that this section will ever be an exhaustive list of all the differences, but some notable ones bear mentioning (or repeating.)

Nonexistent Columns#

One striking difference between a TenFrame and a DataFrame is how they handle trying to access a column that isn’t in the table. A DataFrame knows exactly what’s in it, what all the columns (Series) are and even what’s inside them, and if you try to access a column that isn’t there, either with df.wrongcolumn syntax or df['wrongcolumn'], you’ll get an AttributeError or a KeyError respectively. But there is no way to know what columns are in a 1010data query without actually running the query (example: cross-tabulation), and we’d like not to have to do that unless we really need to. (It’s true that 1010data queries are cached on the server side and so repeated runs will be fast, but there’s still the network latency, which adds up when there are many calls, and also for various reasons there may be slight differences in the queries that defeat the cache.) And besides, tenFrame should also work in “offline mode,” where it has no access to running the queries at all. So it turns out to be simplest to make it so that trying to access a column (TenSeries) in a TenFrame always succeeds. So instead of raising exceptions, you’ll find yourself getting mysterious empty TenFrames, as tenFrame runs the query and then removes all the columns except for one – and that one isn’t there. There is some attempt to avoid this issue when you’re working from the command-line, in the REPL (or jupyter, etc.) When a TenSeries is being converted for display in the REPL, there is a check to see if the column really exists (the query has to be run then anyway, after all), and if not, an exception is raised.

Nonexistent Functions#

Likewise, you can call a function on a TenSeries and discover later that it doesn’t exist. In order to make the whole gamut of 1010data functions available automatically, tenFrame takes any attribute on a TenSeries (which isn’t already defined to mean something else) to mean a function that you intend to call on that TenSeries as part of the query. So if you try to call a function that doesn’t exist on a TenSeries, there will be no exception until the query is actually run. This is probably less jarring than non-existent columns, since after all, you do at least get an error, and query run-time is when you should expect to get it. Just that in tenFrame, query run-time isn’t necessarily immediately, like it is with pandas.

Selection by Callable#

Here’s an important thing from Pandas which will not always work in tenFrame, and it’s important to know why.

# Using the "pub.demo.weather.stations" dataset again
>>> df[lambda d: d.lon > 100]
        id        name state  tz     lat      lon  elev  anht
0     3103   FLAGSTAFF    AZ  -7  35.133  111.667  2135     6
134  21504        HILO    HI -10  19.717  155.067    11     6
135  22516     KAHULUI    HI -10  20.900  156.433    15     6
136  22521    HONOLULU    HI -11  21.333  157.917     5     8
137  22536       LIHUE    HI -11  21.983  159.350    45     6
..     ...         ...   ...  ..     ...      ...   ...   ...
243  94008     GLASGOW    MT  -7  48.217  106.617   700     6
244  94014   WILLISTON    ND  -6  48.183  103.633   579     6
245  94185       BURNS    OR  -8  43.583  119.050  1271     5
246  94224     ASTORIA    OR  -8  46.150  123.883     7     6
247  94240  QUILLAYUTE    WA  -8  47.950  124.550    55     7

[100 rows x 8 columns]
>>>

This will work in tenFrame only in environments where server-side python is available, and a little reflection on how tenFrame works can tell you why. Remember that a TenFrame really isn’t a Dataframe; it doesn’t really have its data in memory to apply functions to. It’s just a set of instructions being assembled into a query to ask the 1010data Edge server. In Pandas, you’re defining a function (with lambda) and Pandas applies it to to the DataFrame and does the selection. But in tenFrame, the selection is being done on the server side (the whole point of 1010data being to be able to select and operate on such big tables), and your lambda needs to be defined there, not on your local side. So tenFrame will take the source code of your function and make it part of the query that is sent up to the server side, for it to be executed there – assuming that server-side python is available. This also has some other quirks that may raise their heads from time to time.

Dates, Times, and Datetimes#

1010data tables and queries support values representing dates and times, like Pandas does, but they are stored differently and may behave differently.

Normal data in a 1010data table has only 4 types: integer (32-bit), long integer (64-bit), float, and string. There is no real “date” type or “time” type, etc. But each column has “format” information as part of its metadata, which is used by some other 1010data products to specify just how to format the data for output (how many decimal places, commas, etc.) and the format data may also say that a column’s values are to be interpreted as dates or times or datetimes (a timestamp of a date and a time) or even things like a year and a fiscal quarter, etc. These are stored rather differently from one another, and not necessarily as consistently as you might thing (e.g. dates are stored as integers, giving the year, month, and day simply by their digits, as 20220325, while datetimes are stored as floating-point numbers counting the number of days and fractions of a day since a known fixed date.) The lower-level py1010 library on which tenFrame is based automatically detects these formats and converts date, times, and datetimes into Python date objects, time objects, and datetime objects respectively.

As with Pandas Series, TenSeries objects have a .dt member which gives access to various properties of datetime columns (and a few methods); see above.