TenFrame for the Pandas User#
The tenFrame library is an attempt to present 1010data queries in a form that’s closer to what Pandas-users (pandits) expect, so they can program in a familiar manner, instead of having to translate their ideas into 1010data XML macro code.
This guide is meant as an aid for users coming with some understanding of
pandas (not necessarily pandas experts!), in order to learn a little about
the differences involved in using tenFrame. Be sure to refer to the
tenFrame reference documentation for the most detailed and up-to-date
information. (This documentation is derived from the docstrings of the
library, and so it is also available within python itself, using python’s
help()
feature.) As you learn more about tenFrame, it will be helpful
to understand at least a little of how the 1010data XML macro language
looks and works, so you might want to look at its documentation a
little as you go along. Still, the aim of tenFrame is to shield you as
much as possible from having to know our XML language.
Why Not Just Use Pandas?#
Because tenFrame tries to mimic pandas but is not a perfect copy and is not a drop-in replacement, the user can start to feel that it should be the same as pandas and the little differences seem to be a nuisance. Why aren’t we just using pandas, anyway?
The answer lies in our industry, the “Big Data” industry. 1010data helps its clients analyze and manage Big data. That is, datasets with millions or billions of rows (or more!), far more than any ordinary business computer could hold in its memory at once. If you could just load your data into pandas, you wouldn’t need to use 1010data on it. Conversely, when you crunch a bunch of numbers using 1010data, at the end you’re looking for a much smaller extract of the data that you can make sense out of, and from which you can draw useful conclusions and insights (which, considered as data, are even smaller.) So 1010data’s services help you “whittle down” your mountain of data, and as soon as it’s small enough for you, you can just download the smaller dataset you now have, and perhaps process it using pandas on your local computer.
Another reason for lack of feature parity with Pandas, of course, is that Pandas is an old established library many years in the making (v1.0.0 came out at the end of January 2020, but the library’s history goes back to 2011) with dozens of authors (at least), while tenFrame is much newer and has a much smaller developer community!
TenFrames Aren’t DataFrames – But They Pretend to Be#
The “Dataframe” object in the tenFrame library is the TenFrame. Sure, they try to act like DataFrames, as we’ll see, but it’s important to remember that they aren’t really and truly DataFrames or even tables of any kind. What a TenFrame is is a wrapper around a 1010data query, which is a bunch of instructions on how to look up and/or compute a table full of data (the result set). It’s important to keep track of when things are done, or in tenFrame’s case, when they are not done. For example, if you add a column to a DataFrame in pandas, like df['new'] = df.col1 * df.col2
, say, then at the time that expression is evaluated, pandas builds a new column, allocates memory for it (which might be a lot, if there are many rows), and goes through all the rows in the table, multiplying col1
by col2
and putting the result in the new column. All that has to happen before control even passes to the next statement (or if you’re typing in the REPL, before your prompt can come back.)
But “Big Data” is, well, big. Too big to fit into your computer, and too big for you to wait for all the calculations to complete. Even if your computer could hold trillion-row columns, doing a trillion calculations is going to take too long. So when you do things like that with a tenFrame, with exactly the same syntax, frame['new'] = frame.col1 * frame.col2
, nothing is computed, no columns are multiplied, and there might not even be any access of 1010data’s servers. All that happens is that the query inside the TenFrame has an instruction added to it that says to multiply those two columns and present the results as a new column with a particular name. And even when tenFrame actually runs the query and sends it to 1010data’s servers, even then no computations are done! The server just understands that there is to be a column with that name whose value is computed in a certain way. It’s only when you actually try to access some data that it is actually computed (which may cause other data which it depends on to be computed as well), and even then, the 1010data servers might compute only part of it, if that’s enough to give you the correct answer. If all my program is asking for is frame['new'][200]
, there’s no reason to compute all trillion values of the column.
TenFrames and TenSeries are python objects that translate python’s operations on them, like multiplication and assignment, above, and also pandas-like methods on them, into 1010data query instructions. In so doing, they try to act like tables and columns, but that’s not really what they are.
How 1010data Handles Big Data#
In order to understand what can be done in a 1010data query (and thus what can be done in a TenFrame), helps to understand at least a little of how 1010data deals with these huge amounts of data.
Processing in 1010data, roughly speaking, involves a single “main process,” running on a single machine, which in 1010data jargon is called the “accum” (short for accumulator.) Generally, when you’re talking to “the server” in 1010data, it’s the accum you’re talking to. The raw data in the tables, the columns containing trillions of entries, are stored columnwise on one or more machines (“subprocs”), each dealing with one “segment” of the data at a time. The accum sends instructions to the subprocs, which do the computation on all the segments in parallel and send the results back to the accum as needed.
Sometimes, computations result in a table that’s no longer distributed, but is all stored in memory on the accum. Tabulations (aggregations) are the best example. The accum receives results from all the subprocs and combines them all into a table, but the results are not part of the data, and have a completely different number of rows, so the results of the tabulation are all stored in memory on the accum. This can put limits on the size of the results of these sorts of computations.
The 1010data system also aggressively caches queries and partial queries, so building up a query step-by-step is reasonably efficient even if you run it between steps, and running the same query twice involves comparatively little overhead (tenFrame avoids running the same TenFrame twice anyway, but even if you wound up with the same query in another way it would run quickly).
1010data, py1010, and tenFrame#
The tenFrame library is built using another 1010data library called py1010, which provides access to the underlying 1010data tables. Py1010 performs transactions with the 1010data Edge servers, including logging in, running queries, and fetching data and metadata about the results. TenFrame does not require py1010, but without it you can only use it in “offline mode,” which we’ll get to later.
TenFrame attempts to reduce the need for knowing 1010data’s query-language, but knowledge of it is certainly helpful. Some py1010 objects are also exposed and used in tenFrame, so it’s a good idea to familiarize yourself with py1010 a little, if possible.
So a TenFrame is a presentation layer wrapped around a 1010data query, and it uses a py1010 Query
object to send requests over to the 1010data server to run the query, when needed. That “when needed” is important to remember: you might develop a huge, complicated query and never have the slightest delay as you type it into an interactive session, because you aren’t necessarily doing anything except fiddling around with little operations on your local machine. Recall what was said above in the introduction. TenFrame will only actually send the query to be run when you ask for something it feels requires it. That might be when you ask for data from the results, but it also might just be when something you do needs to know the names of the column in the results. So don’t be surprised if things pause while the query is running and you didn’t think it needed to.
Getting Started#
So, let’s look at how to begin using tenFrame a little.
In [1]: import py1010
In [2]: import tenFrame as tf
In [3]: session = py1010.Session("https://www2.1010data.com/prime-latest/gw", USERNAME, PASSWORD, py1010.POSSESS)
In [4]: stations = tf.TenFrame(session, "pub.demo.weather.stations")
In [5]: stations
Out[5]:
id name state tz lat lon elev anht
0 3103 FLAGSTAFF AZ -7 35.133 111.667 2135 6
1 3812 ASHEVILLE NC -5 35.433 82.533 661 6
2 3813 MACON GA -5 32.700 83.650 110 7
3 3816 PADUCAH KY -6 37.067 88.767 125 10
4 3820 AUGUSTA GA -5 33.367 81.967 45 6
5 3822 SAVANNAH GA -5 32.133 81.200 16 9
6 3856 HUNTSVILLE AL -6 34.650 86.767 190 8
7 3860 HUNTINGTON WV -5 38.367 82.550 255 6
8 3870 GREENVILLE SC -5 34.900 82.217 296 7
9 3872 BECKLEY WV -5 37.783 81.117 765 10
In [6]: len(stations)
Out[6]: 262
In [7]:
As you see, you import py1010 and the TenFrame object (you should import py1010 first), and then start a session using py1010’s Session object constructor. You give it the gateway URL, username, password and login type (KILL, NOKILL, or POSSESS. See the py1010 or 1010data documentation.) Then you can create a TenFrame based on that session and the table “pub.demo.weather.stations”. Note how much the TenFrame looks like a Pandas DataFrame when it’s presented! It’s nice, but gets there by “cheating” a little: it really just makes a little DataFrame our of the first ten rows (see the box below if you’re curious). Pandas shows you first and last few rows of a long frame, and also tells you how many there are, but here you’re only seeing a complete DataFrame made out of just the top 10 rows. Still, column-wise, it’s basically accurate: It’s a table (frame) with eight columns: id, name, state, tz, lat, lon, elev, and anht. This table represents weather stations, giving each one’s ID and name (generally a city), state, time-zone, latitude, longitude, elevation and the height of its anemometer (that’s something you need to know when you’re doing weather.)
Even with just this one table, we can do some simple operations:
In [7]: stations['coord_sum'] = stations.lat + stations.lon
In [8]: stations
Out[8]:
id name state tz lat lon elev anht coord_sum
0 3103 FLAGSTAFF AZ -7 35.133 111.667 2135 6 146.800
1 3812 ASHEVILLE NC -5 35.433 82.533 661 6 117.966
2 3813 MACON GA -5 32.700 83.650 110 7 116.350
3 3816 PADUCAH KY -6 37.067 88.767 125 10 125.834
4 3820 AUGUSTA GA -5 33.367 81.967 45 6 115.334
5 3822 SAVANNAH GA -5 32.133 81.200 16 9 113.333
6 3856 HUNTSVILLE AL -6 34.650 86.767 190 8 121.417
7 3860 HUNTINGTON WV -5 38.367 82.550 255 6 120.917
8 3870 GREENVILLE SC -5 34.900 82.217 296 7 117.117
9 3872 BECKLEY WV -5 37.783 81.117 765 10 118.900
In [9]:
This works just like you would expect it to: adds a new column called coord_sum
to the table, and its value is the sum of the latitude and the longitude for each row. (Why you would ever want to add the latitude and longitude together is another question.)
You can also create an “in-memory” table in 1010data directly from a DataFrame by using the data=
keyword. This is a table that’s actually baked into your query as a table “literal”, so to speak, so you can operate on it within 1010data. This isn’t generally useful by itself (you already have the data locally, you can operate on it with Pandas), but sometimes it might be needed for combining with 1010data tables – or for examples:
In [1]: pdiris = pd.read_csv("iris.csv")
In [2]: iris = tf.TenFrame(session, data=pdiris)
In [3]: iris
Out[3]:
SepalLength SepalWidth PetalLength PetalWidth Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
In [4]:
Here are some other simple things you can do:
In [9]: stations[stations.elev > 1000]
Out[9]:
id name state tz ... lon elev anht coord_sum
0 3103 FLAGSTAFF AZ -7 ... 111.667 2135 6 146.800
1 23009 ROSWELL NM -7 ... 104.533 1112 6 137.833
2 23044 EL PASO TX -7 ... 106.400 1194 10 138.200
3 23047 AMARILLO TX -6 ... 101.700 1098 7 136.933
4 23050 ALBUQUERQUE NM -7 ... 106.617 1619 7 141.667
5 23061 ALAMOSA CO -7 ... 105.867 2297 9 143.317
6 23062 DENVER-STAPLETON INT'L CO -7 ... 104.867 1610 10 144.634
7 23065 GOODLAND KS -7 ... 101.700 1124 10 141.067
8 23066 GRAND JUNCTION CO -7 ... 108.533 1475 7 147.650
9 23153 TONOPAH NV -8 ... 117.133 1653 8 155.200
In [10]: len(stations[stations.elev > 1000])
Out[10]: 31
In [11]: stations2 = stations.groupby("state").agg({'lat':'mean', 'lon':'mean', 'elev':'mean'})
In [12]: stations2
Out[12]:
state lat_mean lon_mean elev_mean
0 AZ 33.833250 111.762500 1196.000000
1 NC 35.355667 79.274833 218.333333
2 GA 33.052833 83.252833 144.333333
3 KY 37.990000 85.416800 251.800000
4 AL 32.800000 87.041750 127.750000
5 WV 38.350000 81.279250 476.000000
6 SC 33.916667 81.122333 125.666667
7 TX 31.079353 98.658882 383.411765
8 KS 38.680200 98.473400 607.400000
9 LA 30.775000 92.108500 27.000000
In [13]:
In Pandas, the same queries would be
>>> df[df.elev > 1000]
...
>>> df2 = df.groupby("state").agg({'lat':'mean', 'lon':'mean', 'elev':'mean'})
that is, exactly the same as with tenFrame, as you can see if you try them out. But you will notice a difference in the output of the second one, though. In Pandas, the results are automatically re-sorted by state, which has become the index of the new frame, whereas tenFrame doesn’t have a concept of a true “index,” so the order is unchanged. You can always apply the .sort_values()
method, like you might do in Pandas.
But more generally, note that the .agg()
function works like it does in pandas. It’s probably best to provide the full aggregation data, which columns aggregated by which functions, but tenFrame will generally interpret the parameters of .agg()
about the same way as Pandas:
In [13]: stations.agg(['sum'])
Out[13]:
id_sum tz_sum lat_sum lon_sum elev_sum anht_sum
0 7465955.0 -1661.0 10511.606 26105.871 98055.0 1944.0
In [14]:
The TenFrame#
TenFrame
is the DataFrame-like class of tenFrame. As we saw earlier, it really only pretends to be DataFrame-like: it’s really a wrapper around a 1010data query.
Constructing#
The __init__
method of the TenFrame class has the following signature:
__init__(self, session, table=None, ops=None, data=None, meta=None, load=None, imports=None, imports_once=None)
Although it appears that the only mandatory parameter is the session
, in fact, you must supply one of table
, ops
, data
, or load
. If the session is None
, you can only work in “offline mode” (q.v.) Normally, you’ll just specify the table
, a string (pathname) of the name of the table in the 1010data object tree. If you want to work from a pandas DataFrame you already have on your local machine, uploading it into 1010data as an in-memory table, you can use data=dataframe
instead. (The ops
parameter is only if you have some specific set of ops with which to initialize the query, and should be considered deprecated.)
If the load
parameter is given, then the path named should be a saved 1010data QuickQuery, and instead of being treated like a table (i.e., being run and its output being used as the base dataset of this query), the query is actually read in as a query, and its operations parsed to become the operations of the new TenFrame. This feature is used to implement the Library class, and probably is not very useful to most users directly on TenFrames.
Warning
This functionality is intended to work mainly with queries that were
saved with tenFrame in the first place. Although most pre-existing
queries can be loaded in, the resulting code query might not conform to
what tenFrame expects, so the resulting TenFrame may behave in
unexpected ways. Also, some existing queries may have quirks in their
XML code that can cause the load
functionality to fail and raise an
exception.
The imports
and imports_once
parameters each can take the pathname of a saved library in the 1010data object tree (or a list of pathnames), which will be inserted into <import>
ops (with once="1"
for import_once
). See the documentation on the import op.
Subsetting#
Subsetting (selecting a subset of a frame or series using square brackets) is used in many different ways by Pandas DataFrames, with different types of subset entailing various different behaviors. TenFrames attempt to emulate those where possible.
Most straightforward is subsetting by a string. In Pandas and tenFrame alike, this will return the column of that name which is in the frame (a Series object in Pandas, a TenSeries object in tenFrame). This kind of subsetting can also be done as an attribute, like in pandas, i.e., frame.price
is the same as frame['price']
, for any name that isn’t already used for some other attribute of TenFrame. (Also like in pandas, you cannot set a new column this way, but must use brackets, i.e., frame['new'] = ...
, not frame.new = ...
.)
In [14]: stations['state']
Out[14]:
state
0 AZ
1 NC
2 GA
3 KY
4 GA
5 GA
6 AL
7 WV
8 SC
9 WV
In [15]: stations.lat
Out[15]:
lat
0 35.133
1 35.433
2 32.700
3 37.067
4 33.367
5 32.133
6 34.650
7 38.367
8 34.900
9 37.783
In[16]:
Warning
One important quirk to keep in mind is that although subsetting with the name of a non-existent column will raise a KeyError in pandas, subsetting a TenFrame by a string (or specifying an attribute) will always succeed (though it may cause an exception afterward, see below). A TenSeries is little more than a column name and a reference to a TenFrame, so it can always be created. We need to do this because there is no way to know what the columns in a 1010data query are without running the query, and that might take a long time, so we want to avoid doing that unless we really need data. This can potentially lead to some puzzling errors or lack of errors, so watch out for it. One way this situation can arise is if you try to access a member of the TenFrame object that you expect to be there because it’s in Pandas, but isn’t in actually in tenFrame. (In order to try to make this happen a little less often and a little less confusingly, trying to look at a TenSeries by just typing its name in the REPL and if it isn’t really there in the TenFrame, an exception will be raised.)
In [16]: stations.nosuchcolumn # The REPL knows
Out[16]:
AttributeError
....
AttributeError: nosuchcolumn not found in this TenFrame.
In [17]: stations['xyz'] = stations.nosuchcolumn - stations['anothercolumn'] # But in general, no error.
In [18]: # No error!!!
In [18]: stations # But if you ever try to run it or get data from it...
Out[18]:
TentenException
...
TentenException: ('function run failed, returned 1 (Computed column xyz cannot be parsed: invalid column or function: nosuchcolumn)', 1)
In [19]:
Also unlike pandas, you can also (always) subset by an integer. Pandas does not have a concept of the order of columns, but TenFrame does, so you can select a column by its number.
As in pandas, you can subset with a list of column-names to restrict the TenFrame to just those columns, or reorder them:
In [19]: stations[["name", "tz", "state]]
Out[19]:
name tz state
0 FLAGSTAFF -7 AZ
1 ASHEVILLE -5 NC
2 MACON -5 GA
3 PADUCAH -6 KY
4 AUGUSTA -5 GA
5 SAVANNAH -5 GA
6 HUNTSVILLE -6 AL
7 HUNTINGTON -5 WV
8 GREENVILLE -5 SC
9 BECKLEY -5 WV
In [20]:
However, in 1010data code, restricting the columns this way doesn’t actually get rid of the old columns. They’re still there and can still be accessed and referenced by subsequent operations:
In [20]: somecols = stations[["name", "tz", "state"]]
In [21]: somecols['coordprod'] = somecols.lat * somecols.lon # would not work in pandas!
In [22]: somecols
Out[22]:
name tz state coordprod
0 FLAGSTAFF -7 AZ 3923.196711
1 ASHEVILLE -5 NC 2924.391789
2 MACON -5 GA 2735.355000
3 PADUCAH -6 KY 3290.326389
4 AUGUSTA -5 GA 2734.992889
5 SAVANNAH -5 GA 2609.199600
6 HUNTSVILLE -6 AL 3006.476550
7 HUNTINGTON -5 WV 3167.195850
8 GREENVILLE -5 SC 2869.373300
9 BECKLEY -5 WV 3064.843611
In [23]:
Subsetting can also select rows instead of columns. Saying df[df['price'] > 10.0]
in pandas is really creating a boolean Series with df['price'] > 10.0]
and then selecting on it by putting it in the brackets. It works exactly the same in tenFrame. frame['price'] > 10.0
evaluates to a boolean TenSeries, analogous to how it works in pandas, so the same syntax (frame[frame['price'] > 10.0]
) with TenFrames will produce the same result: a TenFrame which selects just those rows where the price
column is greater than 10. We saw this in the first examples.
Subsetting on a slice also slices along the rows and not the columns, and produces a “server-side slice,” which is a query with operations that select a specific sequence of rows based on their row number.
Iterating over a TenFrame produces a list of the column names, as it does in pandas. As mentioned above, anything that requires actually knowing what columns really are there will entail running the query.
loc
and iloc
#
Currently, tenFrame does not support the .loc
member of pandas
DataFrames for selecting subsets. .loc
is mostly used for selection
by index, and tenFrame does not have the same concept of indexing that
pandas has. The .iloc
member of the TenFrame object returns the
.rows
member of the underlying query, which is not quite the same, but
seems to work for most cases.
Setting Values#
As we saw above, just like in pandas, you can add a new column by setting a new item, usually to a new (Ten)Series. Like a Series in Pandas, a TenSeries represents a single column that’s part of a TenFrame. (Series have other uses in Pandas as well, but TenSeries are more restricted in meaning.)
In [1]: import tenFrame as tf
In [2]: sales = tf.TenFrame(session, 'retaildemo.retail.sales_detail')
In [3]: sales['total'] = sales.price * sales.qty
In [4]: sales['aconst'] = 7
In [5]: sales
Out[5]:
transid date tme sku upc store qty ... xsales price cost customer_key promo_flag total aconst
0 100001 2012-12-31 03:52:07 222218 7084781260 728 1.0 ... 1.97 1.97 1.53 <NA> 0 1.97 7
1 100002 2012-12-31 04:07:58 260548 81809400001 728 1.0 ... 1.39 1.39 0.99 <NA> 0 1.39 7
2 100003 2012-12-31 05:10:44 224423 7116603129 728 1.0 ... 2.77 2.77 2.12 <NA> 0 2.77 7
3 100004 2012-12-31 05:38:59 306141 999999907314 728 1.0 ... 0.49 0.49 0.32 <NA> 0 0.49 7
4 100004 2012-12-31 05:38:59 101528 1200000452 728 1.0 ... 2.57 2.57 1.87 <NA> 0 2.57 7
5 100004 2012-12-31 05:38:59 110966 2200000898 728 1.0 ... 1.27 1.27 0.67 <NA> 0 1.27 7
6 100005 2012-12-31 06:04:51 161510 6150000594 728 2.0 ... 1.42 0.71 0.89 <NA> 0 1.42 7
7 100006 2012-12-31 06:31:45 105159 1590000209 728 1.0 ... 1.29 1.29 0.95 <NA> 0 1.29 7
8 100006 2012-12-31 06:31:45 102143 1230000055 728 2.0 ... 13.32 6.66 12.33 <NA> 0 13.32 7
9 100006 2012-12-31 06:31:45 298945 999999900118 728 1.0 ... 0.82 0.82 0.80 <NA> 0 0.82 7
>>>
Compatibility#
An error you might see popping up when you do an assignment, as well as in other circumstances, is a CompatibilityException
, saying that the frames are not “compatible.” This comes down to the nature of TenFrames once again, that they are really queries and not tables of numbers like Pandas DataFrames. In the general case, it doesn’t make sense to assign a column (TenSeries) from one TenFrame into another. The could come from different tables, or radically different queries. Maybe the column is defined in terms of other columns in its original TenFrame which don’t exist in the one you’re assigning to. When you realize that each column is defined by some query that takes data from some arbitrary table and manipulates it in arbitrary ways, you can see that lifting a column from one TenFrame and plunking it down, without its context, into another cannot be expected to make sense, in general. When two TenFrames are not compatible and you try to combine them or use them with each other in ways that tenFrame can’t make sense of, you’ll get a CompatibilityException
.
But there are certain special cases in which this can make sense and in which you can assign a TenSeries into another TenFrame, and many of the cases we care about fall into this category. This happens when the two TenFrames in question are mostly the same, and so a column being brought from one to another wouldn’t be losing its context. Here’s an example from just above:
sales['total'] = sales.price * sales.qty
(Keep in mind that we are considering the general case, so we can’t assume that sales
simply references a table; it might have all kinds of other operations on it as well.) Now, the product of two TenSeries (the right-hand side of the assignment) is another TenSeries, and every TenSeries is part of a TenFrame. That TenFrame is not the same one as sales
, because sales
, as yet, doesn’t have a column corresponding to that product. It’s a new, nameless TenFrame, with a brief and fleeting existence. But what’s important about this new TenFrame is that it started as a copy of sales
, so the query it contains is the same as what’s in sales – with the addition of the new column. It has all the same operations as sales
, plus one more, the one at the end, which defines the new product column. So it’s an “extension” of sales, and the new column will have the same “context” when moved over to sales. In fact, since this new TenFrame has everything sales
has possibly more, we can simply replace the operations in sales
with the operations from the new TenFrame and not lose anything, and indeed gain the definition of the new column.
For all that that’s a “special case,” it’s also the situation you’re usually dealing with, since you don’t usually want to combine a table with an unrelated column this way. So the bottom line is that things should usually work out the way you’d want and expect; just know about compatibility for those times when problems do arise.
Properties#
Here are some key properties (not methods) on TenFrames. Many of the pandas DataFrame properties are represented here, with the same names meaning more or less the same things.
shape: A tuple containing the number of rows and columns, as
(rows, columns)
. All 1010data tables and queries, and thus all TenFrames, are two-dimensional.In [1]: frame.shape Out[1]: (5997373902, 13)
table: the name of the table with which this TenFrame was created. If none was given (e.g. if you used the
data=
parameter in the constructor), this property’s value isNone
.In [1]: frame1 = tf.TenFrame(session, "retail.sales_master") In [2]: frame1.table Out[2]: 'retail.sales_master' In [3]: df = pd.read_csv("dat.csv") # Load in a pandas DataFrame In [4]: frame2 = tf.TenFrame(session, data=df) In [5]: frame2.table In [6]:
size: The number of elements in the TenFrame, rows × columns.
In [1]: frame.shape Out[1]: (5997373902, 13) In [2]: frame.size Out[2]: 77965860726
columns: The names of the columns in this TenFrame. This is a list in tenFrame, not an Index object like it is in pandas.
In [1]: frame.columns Out[1]: ['transid', 'date', 'tme', 'sku', 'upc', 'store', 'qty', 'weight', 'xsales', 'price', 'cost', 'customer_key', 'promo_flag']
cols: A tuple containing the py1010 Column objects for each of the columns. This allows access at a slightly lower level. Please see the py1010 documentation for more information if you need to work with these.
In [1]: frame.cols Out[1]: (Column(i: transid), Column(i: date), Column(i: tme), Column(i: sku), Column(a: upc), Column(i: store), Column(f: qty), Column(f: weight), Column(f: xsales), Column(f: price), Column(f: cost), Column(i: customer_key), Column(i: promo_flag)) In [2]:
iloc: The py1010 row iterator for this TenFrame (equivalent to
query.rows
on the underlying py1010 query). This may not be exactly the same as pandas iloc, but it does seem to be pretty close. Basically, a thing you can iterate over, subset to access a specific row, slice, etc. The rows it gives back are tuples containing the data in order of the columns.In [1]: frame.iloc[192] Out[1]: (100049, datetime.date(2012, 12, 31), datetime.time(9, 25, 2), 138544, '4012', 728, 1.0, 1.25, 1.71, 1.37, 0.66, nan, 0)
ndim: Number of dimensions, as in pandas. But note that it always returns 2, since all TenFrames are two-dimensional.
session: The py1010 Session object this TenFrame uses to access the 1010data servers. Supplied in the constructor. Might be
None
if the TenFrame is working offline.Everything else: Accessing any attribute of a TenFrame that isn’t already the name of some existing attribute or property or method returns the TenSeries for that column of the TenFrame, like in pandas. As mentioned above, though, unlike pandas, this access will always succeed, whether or not there really is such a column in the table, since at the time of access, the TenFrame doesn’t necessarily know what columns are in a table. But it will raise an exception if you try looking at it in the REPL, to avoid mystifying messages about empty dataframes.
In [1]: frame.transid Out[1]: transid 0 100001 1 100002 2 100003 3 100004 4 100004 5 100004 6 100005 7 100006 8 100006 9 100006 In [2]: xyz = frame.bloop # nonexistent column! But it succeeds anyway In [3]: frame.bloop # but looking at it this way doesn't. Traceback (most recent call last): .... File ..., line ..., in __repr__ raise AttributeError(f"{self.columnName} not found " AttributeError: bloop not found in this TenFrame. In [4]:
Methods#
Some methods (mostly Pandas-emulating) that you can find on TenFrames.
assign(self, **kwargs): Assign new columns to a TenFrame. Equivalent to
frame[k] = v
for thek, v
pairs in kwargs (except that it returns a new object instead of modifying the one it’s called on, same as in pandas). As in pandas, later assignments can refer to earlier ones in the same statement.This method can be used with callables (functions) as values, in which case the function is made part of the query and executed on each subproc on the server side (if supported), and passed a single argument, which will be a Pandas DataFrame containing one segment of the current table.
In [1]: def strangeproc(df): ...: # df.vol.std() would be the stddev only of this segment, if there were more than one! ...: return df.rkm**2 + df.mass - df.vol.std() In [2]: sol2 = sol.assign(massdist=sol.mass + sol.dist, densdist=sol.massdist/sol.vol, strange=strangeproc) In [3]: sol2 Out[3]: name type orbits rkm vol mass \ 0 Sun star - 696324.0 1.414300e+09 1.988550e+09 1 Jupiter planet Sun 69911.0 1.431280e+06 1.898600e+06 2 Saturn planet Sun 58232.0 8.271300e+05 5.684600e+05 3 Uranus planet Sun 25362.0 6.834000e+04 8.683200e+04 4 Neptune planet Sun 24622.0 6.254000e+04 1.024300e+05 5 Earth planet Sun 6371.0 1.083000e+03 5.973600e+04 6 Venus planet Sun 6051.0 9.280000e+02 4.868000e+03 7 Mars planet Sun 3389.0 1.630000e+02 6.417000e+03 8 Ganymede moon Jupiter 2634.0 7.600000e+01 1.480000e+02 9 Titan moon Saturn 2576.0 7.100000e+01 1.340000e+02 dist massdist densdist strange 0 0.000000e+00 1.988550e+09 1.406031e+00 4.865312e+11 1 7.783000e+08 7.801986e+08 5.451055e+02 4.565014e+09 2 1.427000e+09 1.427568e+09 1.725930e+03 3.067102e+09 3 2.871000e+09 2.871087e+09 4.201181e+04 3.188855e+08 4 4.497100e+09 4.497202e+09 7.190922e+04 2.819129e+08 5 1.496000e+08 1.496597e+08 1.381900e+05 -2.837830e+08 6 1.082000e+08 1.082049e+08 1.166001e+05 -2.878129e+08 7 2.279000e+08 2.279064e+08 1.398199e+06 -3.129407e+08 8 1.070000e+06 1.070148e+06 1.408089e+04 -3.174943e+08 9 1.221850e+06 1.221984e+06 1.721104e+04 -3.177965e+08 In [4]:
apply(self, func, axis=0, args=(), **kwargs): Acts on the server side. Moves all the data in the current query to into the accum, converts it into a Pandas DataFrame, and runs
df.apply(func, axis, args)
on it, and the result (a DataFrame) becomes the new result-set of this query. So you should only use this when you know that the current data-set is small enough to fit into a single computer. The accums that 1010data uses are pretty beefy and can handle fairly large datasets, but that doesn’t compare to the amount of data that the 1010data system is designed to work with, as mentioned above. Usingapply()
when the data is still too big may result in a server-side error.isnull(self), notnull(self): As in Pandas.
In [1]: tab Out[1]: i f a 0 1 3.0 x 1 <NA> 5.0 <NA> 2 3 NaN z In [2]: tab.isnull() Out[2]: i_isna f_isna a_isna 0 0 0 0 1 1 0 1 2 0 1 0 In [3]:
(Note that 1010data does not have a reserved boolean type, so True/False values are simply 1 and 0.)
dropna(self, axis=0, how=”any”, thresh=None, subset=None, inplace=False): As in Pandas. Currently only
axis=0
is supported.In [1]: tab.dropna() Out[1]: i f a 0 1 3.0 x In [2]: tab.dropna(subset=["i", "a"]) Out[2]: i f a 0 1 3.0 x 1 3 NaN z In[3]:
fillna(self, value=None, *, cols=None, inplace=False): Returns a TenFrame (this one or a copy, depending on
inplace
) wherein the NA values in the specified columns (or all of them) are replaced by the givenvalue
. At this point, the value must either be a scalar or a TenSeries.In [1]: tab.fillna(-600, cols=["i", "f"]) Out[1]: i f a 0 1 3.0 x 1 -600 5.0 <NA> 2 3 -600.0 z In [2]: tab.fillna(-600) # But if you try this... ... TentenException: ("Tenten error: 1 (b'amended column has changed type: a')", 1) # 1010data can't handle the type-juggling like pandas!
transpose(self, *, promote=True): Returns a transposed copy of this TenFrame (using the 1010data
<transpose>
operator). 1010data restricts transposition to tables wherein all the columns are of the same type or types that can be “promoted” to be of the same type (whenpromote=True
, which is the default.) As in Pandas, this method can also be accessed as the.T
property on a TenFrame.In [1]: tab.T Out[1]: m0 m1 m2 0 1 <NA> 3 1 3 5 <NA> 2 x <NA> z In [2]:
Note the column names; 1010data column names can’t be just numbers, like they can in pandas, names
m0
,m1
, etc. are made up for them.sort_values(self, by, ascending=True, inplace=False, na_position=”last”, **kwargs): As in Pandas, though restricted (e.g., no “key” parameter).
by
can be a single column name or a list, and if it is a list of more than one name, thedir
is ignored.In [1]: sol = tf.TenFrame(session, "default.test.solar") In [2]: sol.sort_values("dist") Out[2]: name type orbits rkm vol mass dist 0 Sun star - 696324.0 1.414300e+09 1.988550e+09 0.0 1 Triton moon Neptune 1353.0 1.000000e+01 2.100000e+01 354759.0 2 Moon moon Earth 1737.0 2.100000e+01 7.400000e+01 385000.0 3 Io moon Jupiter 1821.0 2.500000e+01 8.900000e+01 422000.0 4 Titania moon Uranus 788.0 2.000000e+00 3.000000e+00 436000.0 5 Europa moon Jupiter 1560.0 1.600000e+01 4.800000e+01 670900.0 6 Ganymede moon Jupiter 2634.0 7.600000e+01 1.480000e+02 1070000.0 7 Titan moon Saturn 2576.0 7.100000e+01 1.340000e+02 1221850.0 8 Callisto moon Jupiter 2410.0 5.800000e+01 1.080000e+02 1880000.0 9 Mercury planet Sun 2439.0 6.000000e+01 3.300000e+02 57900000.0 In [3]:
merge(self, right, on, right_on=None, how=”left”, cols=None, **kwargs): Also can be called as a top-level function, as in pandas:
tf.merge(frame1, frame2, right, on, ...)
. More or less as in Pandas, with some restrictions. Pandas merge is very similar to the operation known as “link” in 1010data, but there are some aspects of each that do not line up. Theright
parameter can be another TenFrame, or it can simply be a string giving the name (pathname) of a table in the 1010data object tree. Of the types of merge allowed in pandas (how=
), only"left"
and"inner"
type joins are currently supported, but there are a few 1010data-specific types that are also permitted:"include"
,"exclude"
, and"asof"
. See the 1010data documentation for the link op for details.concat(self, other, axis=0): Also available pandas-style, as
tf.concat([frame1, frame2, ...])
. As in pandas, though perhaps a bit limited. Concatenate two tables, either appending the rows (axis=0
) or the columns (axis=1
). If appending rows, the two TenFrames should have the same number of columns. Beware of some limitations, though. Appending rows involves consolidating all the data into the accum, whose capacity is limited (i.e., as a single machine, it can’t handle truly “big data”), so it might hit its limits and signal an error. Appending columns does not run this risk, but you can only append columns in a table, not a worksheet (i.e., you have to use either a string naming a table or a TenFrame which has no operations in it apart from just opening a table.)In [1]: base.shape Out[1]: (15350, 9) In [2]: base.concat(base).shape Out[2]: (30700, 9) # Twice the length In [3]: base.concat("pub.demo.baseball.master", axis=1).shape Out[3]: (15350, 18) # Twice the width
agg(self, agg_info, …): As in Pandas, more or less. Some parameters omitted from the signature here, as they are mostly for use by people more familiar with 1010data macro code and would only be confusing if listed. See the reference documentation for details.
In [6]: sol.agg({"mass":"sum"}) Out[6]: mass_sum 0 1.991278e+09 In [7]: sol.groupby("type").agg({"mass":"sum"}) Out[7]: type mass_sum 0 star 1.988550e+09 1 planet 2.727673e+06 2 moon 6.250000e+02 3 dwarf planet 3.000000e+01 In [8]:
head(self, n=20): Equivalent to
self.slice(0, n)
. The firstn
rows of the TenFrame.tail(self, n=20): Equivalent to
self.slice(-n)
. The lastn
rows of the TenFramepipe(self, func, *args, **kwargs): Similar to Pandas
.pipe()
, but note that it is executed on the server side. The source code of the func is incorporated in the query and passed to the server, where the main process (the accum) uses the Pandas.pipe()
method to run it on the table computed so far. So the function you pass it needs to expect a Pandas DataFrame as its input, and return a Pandas DataFrame as its output (but the result you get back on the client side is still a TenFrame). But see also the warning given above about theapply()
method.replace(self, to_replace, value, inplace=False, *, cols=None, negate=False): As in Pandas, replace values with
value
in places where conditionto_replace
is True.where(self, cond, other, inplace=False, *, cols=None): As in Pandas, replace values with
other
in places where the condition is False (tenFrame perpetuates Pandas’ strange naming.) Just callsself.replace(cond, other, inplace=inplace, cols=cols, negate=True).
melt(self, id_vars=None, value_vars=None, var_name=”variable”, val_name=”value”): Works like
Pandas.DataFrame.melt()
, except without thecol_level
andignore_index
parameters, since TenFrames have no true index. Be careful that all the “value_vars” are of the same type, or you will get an error from the server complaining about inappropriate arguments to “case.” The resulting TenFrame tends to be ordered differently from what you would see in Pandas, as like “variable” rows are not grouped together.In [1]: sol.melt(["type"], ["rkm", "mass", "vol"]) Out[1]: type variable value 0 star mass 1.988550e+09 1 star vol 1.414300e+09 2 star rkm 6.963240e+05 3 planet mass 1.898600e+06 4 planet vol 1.431280e+06 5 planet rkm 6.991100e+04 6 planet mass 5.684600e+05 7 planet vol 8.271300e+05 8 planet rkm 5.823200e+04 9 planet mass 8.683200e+04 # Pandas would have had all the "mass" rows and then all the "vol" rows, etc. In [2]:
Some Non-Pandas Methods#
TenFrame is not Pandas, and has things that matter to it which don’t make sense in pandas. Here are a few of the methods of TenFrames which don’t necessarily have counterparts in Pandas DataFrames.
extractXML(), prettyXML(), printprettyXML(): Get or print the 1010data XML macro-code of the query this TenFrame represents. Discussed above.
to_df(self, maxlen=10000, start=0): Returns a pandas DataFrame containing the data from this TenFrame, starting at row
start
with a maximum ofmaxlen
rows total. This entails running the query and also fetching all the data, so be aware of how much you’re trying to get. To get the whole of the query results and put all the rows into a DataFrame, just use a value formaxlen
that’s higher than the number of rows you expect the table to be. If you don’t have an upper bound on the number of rows you expect, you shouldn’t be using this method! It might wind up trying to download more data than your computer can handle. See above about why you’re using 1010data in the first place.frame.to_df(100000) # The top 100000 rows, if there are that many frame.head(10).to_df(10) # The top 10 rows, converted into a DataFrame. frame.tail(10).to_df(10) # The last 10 rows, converted into a DataFrame frame.to_df(10000, start=20000) # Rows 20000-29999, converted to a DataFrame.
slice(self, start=None, stop=None, step=None): Returns a TenFrame for the region of this TenFrame between the given
start
andstop
. The slice is done on the server side, i.e., the new frame’s query specifically filters the results to the specified rows. Currently equivalent toframe[start:stop:step]
. Both forms respect negative numbers, treating them the same way python does. ⚠️ NOTE: thestep
value is currently ignored!online(self): Returns True when this TenFrame is in “online” mode, i.e., it was created with a real py1010 Session object. Note that this doesn’t mean the session it is using is still “live.” Sessions can time out or be killed by other sessions, etc.
In [1]: import py1010 In [2]: import tenFrame as tf In [3]: gateway = "https://...." In [4]: user = "jjjschmidt" # (his name is my name too) In [5]: password = "XXXXXXX" In [6]: session = py1010.Session(gateway, user, password, py1010.POSSESS) In [7]: frame1 = tf.TenFrame(session, "retail.sales_detail") In [8]: frame1.online() # frame1 has a session associated with it Out[8]: True In [9]: frame2 = tf.TenFrame(None, "retail.sales") # An offline frame In [10]: frame2.online() Out[10]: False
copy(self): Pandas DataFrames have this too, but distinctions are important to remember. The DataFrame copy() makes a deep copy (by default) of all the indices and all the data, so for a large DataFrame this can take up a lot of space and even cost noticeable time. But copying TenFrames is cheap. Copying a TenFrame just makes a shallow copy of the operations which define its query and a few other pieces of data. Since a TenFrame doesn’t actually contain any data inside itself, there isn’t a lot of data to copy around, so a lot of operations on TenFrames and TenSeries actually make copies of the TenFrame.
segby(self): Returns a list of column-names on which the base table of this TenFrame is segmented. TenFrames without a base table (i.e., which were initialized with the
data=
parameter, instead of thetable=
parameter) have no segmentation.
Aggregating data#
Basic aggregation on columns can be done almost like in pandas:
In [1]: sales tf.TenFrame(session, 'retaildemo.retail.sales_detail')
In [2]: len(sales)
Out[2]: 5997373902
# Using IPython's "%time" magic to see how long it takes...
In [3]: %time sales['xsales'].sum()[0][0] # It's a one-row, one-column TenSeries
CPU times: user 31 ms, sys: 67.4 ms, total: 98.3 ms
Wall time: 35.1 s
Out[3]: 23361892611.108414 # 35sec! Not bad for 6 billion rows.
In [4]: sales[['xsales','qty']].sum()
Out[4]:
xsales_sum qty_sum
0 2.336189e+10 8.136801e+09
In [5]:
Aggregating functions on TenSeries (like sales['xsales'].sum()
) return
TenFrames in tenFrame, unlike in Pandas where they return Series.
To get basic statistics, the .describe()
method can be used, like in
pandas (we’ll use a subset of the data so the examples don’t take too long
to run):
In [1]: rsales = sales[:100000] # Restricted subset of data
In [1]: rsales.describe() # Returns a DataFrame, not a TenFrame!
Out[1]:
transid date tme sku \
count 100000.000000 1.000000e+05 100000.000000 100000.000000
mean 107880.093040 2.012123e+07 149215.946800 195314.896760
std_s 4466.318854 2.756710e+01 31280.583466 74463.274328
min 100001.000000 2.012123e+07 1904.000000 100003.000000
lquart 103975.000000 2.012123e+07 124725.000000 132341.000000
median 107959.000000 2.012123e+07 151839.000000 152789.000000
uquart 111766.000000 2.012123e+07 172707.000000 273072.500000
max 115478.000000 2.012123e+07 231333.000000 306491.000000
store qty weight xsales \
count 100000.000000 100000.000000 100000.000000 100000.000000
mean 1091.846590 1.272320 1.110387 4.142385
std_s 533.951117 0.959464 0.759946 4.762913
min 244.000000 -6.000000 -9.390000 -35.990000
lquart 684.000000 1.000000 1.000000 1.860000
median 937.000000 1.000000 1.000000 2.970000
uquart 1492.000000 1.000000 1.000000 4.670000
max 1998.000000 70.000000 59.960000 261.780000
price cost customer_key promo_flag
count 100000.000000 100000.000000 0 100000.0
mean 3.417335 2.837458 NaN 0.0
std_s 3.645787 3.758547 NaN 0.0
min 0.090000 -22.780000 <NA> 0.0
lquart 1.570000 1.080000 <NA> 0.0
median 2.570000 1.870000 <NA> 0.0
uquart 3.870000 3.130000 <NA> 0.0
max 148.000000 147.640000 <NA> 0.0
In [2]: rsales[['xsales','qty']].describe()
Out[2]:
xsales qty
count 100000.000000 100000.000000
mean 4.142385 1.272320
std_s 4.762913 0.959464
min -35.990000 -6.000000
lquart 1.860000 1.000000
median 2.970000 1.000000
uquart 4.670000 1.000000
max 261.780000 70.000000
You can also use include="all"
, like in pandas (but you need some
string columns in the data in order for it to make sense)
In [1]: stations[['id','name','state','lat','lon']].describe(include="all")
Out[1]:
name state id lat lon
count 262 262 262.000000 262.000000 262.000000
ucnt 253 52 NaN NaN NaN
mode CHARLESTON AK NaN NaN NaN
modef 2 20 NaN NaN NaN
mean NaN NaN 28496.011450 40.120634 99.640729
std_s NaN NaN 27768.083922 8.604403 22.598400
min NaN NaN 3103.000000 13.550000 66.000000
lquart NaN NaN 13957.000000 34.900000 82.533000
median NaN NaN 14942.500000 40.100000 93.733500
uquart NaN NaN 24230.000000 43.567000 112.433000
max NaN NaN 94910.000000 71.300000 170.217000
In [2]:
For a more complex set of aggregations, you can use .agg()
and some
other aggregation functions just like in pandas:
In [1]: sales.agg({"transid":"nunique", "xsales":"sum"})
Out[1]:
transid_nunique xsales_sum
0 812962531.0 2.336189e+10
In [2]: rsales.transid.value_counts()
Out[2]:
transid c0
0 111357 97.0
1 114142 76.0
2 115120 74.0
3 110972 73.0
4 111161 70.0
5 108341 69.0
6 107850 66.0
7 109714 64.0
8 111952 64.0
9 112615 64.0
You can even do things pandas can’t do, since .agg()
accepts the
“adjoin=” parameter, like aggregating functions do (see below).
Aggregations may cause the resulting data to be an in-memory table on the accum.
Groupby#
As in pandas, you use the .groupby()
method to indicate the groups by which
to aggregate. Groupby works with the various aggregation functions just
mentioned.
In [1]: sales = tf.TenFrame(session, "retaildemo.retail.sales_detail")
In [2]: prods = tf.TenFrame(session, "retaildemo.retail.products")
In [3]: sales2019 = sales[sales.date.year() == 2019]
In [4]: both2019 = sales2019.merge(prods, on="sku")
In [5]: sales2019.groupby('store')[['xsales','qty']].sum()
Out[5]:
store xsales_sum qty_sum
0 728 6306913.33 2032782.0
1 1596 12908921.46 4466709.0
2 1440 9515432.37 3222894.0
3 937 15657491.69 5282039.0
4 244 16939703.38 5413033.0
5 1492 16179375.92 5041127.0
6 1998 8511838.72 2948421.0
7 1588 5394074.24 1965080.0
8 115 17105819.02 5199226.0
9 126 21463133.90 6854166.0
In [6]: rsales.groupby('store')[['xsales','qty']].describe()
Out[6]:
store xsales_count xsales_mean xsales_std_s ... qty_lquart qty_median qty_uquart qty_max
0 728 6853.0 3.974398 4.638158 ... 1.0 1.0 1.0 18.0
1 1596 13570.0 4.022160 4.124360 ... 1.0 1.0 1.0 30.0
2 1843 10043.0 4.058331 4.229268 ... 1.0 1.0 1.0 20.0
3 1440 11378.0 3.956890 4.124843 ... 1.0 1.0 1.0 28.0
4 937 15444.0 4.271222 5.086939 ... 1.0 1.0 1.0 70.0
5 684 10729.0 3.978461 3.993783 ... 1.0 1.0 1.0 28.0
6 244 17551.0 4.379790 5.354245 ... 1.0 1.0 1.0 36.0
7 1492 14393.0 4.239222 5.532240 ... 1.0 1.0 1.0 40.0
8 1998 39.0 2.755897 2.156277 ... 1.0 1.0 1.0 3.0
[9 rows x 17 columns]
# etc...
Used this way, groupby works pretty much the same as it does in pandas: you can group by multiple columns by using a list of names, etc.
Note
As of version 6.0.0, tenFrame has actual “GroupBy” objects created by
the .groupby()
method, so therefore should behave about the same as
pandas (as opposed to earlier behavior, in which the groups were stored
as an instance variable on the TenFrame and .groupby()
created a
copy of the frame with the variable set.) Ideally, the only thing the
user should notice is that .groupby()
creates objects belonging to
classes with odd names, like Mframe_8768892854914x1dz2j5w
. These
objects will still test as being instances of TenFrame, and will behave
as TenFrames in all respects. You can tell if an object is one of these
wrappers by testing isinstance(groupobj, tf.GroupByMarker)
.
Pivot tables#
pvt = frame.pivot_table(values, index, column, aggfunc="mean", rename=True)
The TenFrame method .pivot_table()
works approximately like the corresponding Pandas method, but there are currently some differences and limitations.
Pandas column names are unrestricted: pretty much any string of characters is a valid name for a column in Pandas. This is not so in 1010data. 1010data column names must begin with a letter and consist of only letters, numbers, and underscores, and what’s more, the letters must be lowercase. It’s one thing to keep to these restrictions when you are setting the column names yourself (assigning TenSeries into a TenFrame, etc), but it’s another matter when dealing with a pivot table, whose columns are supposed to reflect the values of a column (the index, in Pandas).
There’s no simple way to make the 1010data restrictions truly handle all the possibilities the way Pandas does; TenFrame does its level best. All the values which are to become column names are lowercased, and all strings of non-letter (or underscore) characters are replaced with a single underscore each. Numerical values are prefixed with an alphabetic string (separated by an underscore) to make them acceptable. You can also use rename=False
in the method invocation (keyword-only) to disable the renaming, and leave the columns as they come from 1010data: m0
, m1
, m2
, …
Note
Fair Warning
It is possible that this processing may cause some values to collapse to the same result, which could cause odd results (1010data does not permit two columns to have the same name, even if one is hidden). The process of finding the values to use is also subject to error with certain characters in the values, etc. Usually, things will work just fine, but in oddball cases you might get some puzzling errors and have to look deeper. Trying to use multiple cbreaks (i.e., a multi-level index, in Pandas terminology) will definitely cause things to break, and at best you’ll wind up with some unhelpfully-named columns.
Pivot tables use the .agg()
method, with the addition of a parameter for
the column whose values are to be the new columns, known in 1010data as
the “cbreaks.” cbreaks=
is an available keyword-only parameter of the
.groupby()
method.
The TenSeries#
The TenSeries object is the tenFrame analogue of a pandas Series object. Unlike a Series object, it doesn’t really contain any data on its own (neither do TenFrames, for that matter: they fetch data at need), and are not much more than a column name and a reference to a TenFrame. So there is no such thing as a TenSeries which is not associated with some TenFrame, unlike pandas Series which can have an independent existence as data holders.
You generally get TenSeries by subsetting a TenFrame (see above); you shouldn’t have a need to call the TenSeries constructor directly. Many methods and operations on TenSeries also return TenSeries.
Subsetting#
A TenSeries may be subsetted by an integer or a slice, which returns the value at that row-number or an iterator over the values specified by that slice, respectively.
If the subset is not an integer or a slice, but is, say, a string, TenSeries attempts to provide functionality similar to that provided by Pandas. Working on a copy of the TenFrame to which this TenSeries belongs, an operator is added which selects on the first element of list of columns on which the TenFrame is grouped by being equal to the specified value (and then the corresponding TenSeries from that frame is returned). To put that in terms that can be understood without reading it through five times, saying frame.groupby("month").cost["January"]
means the same as frame[frame.month == "January"])['cost']
.
Properties#
Some properties on TenSeries include:
column: The underlying py1010 Column object which this TenSeries represents.
frame: The TenFrame which this TenSeries is part of.
size: The length of the TenSeries, i.e., the number of rows. Same as the
len()
of the TenFrame it belongs to.ndim: Always returns 1.
str.: TenSeries have a bunch of string-related functions on a special
.str.
member, for compatibility with Pandas.str.contains(self, substr, case=True, regex=True): Like pandas,
frame.col.str.contains(REGEXP)
returns a boolean TenSeries that is True just where the value of theframe.col
column contains the regular expressionREGEXP
. Thecase
parameter controls case-sensitivity (default case-sensitive, True), and if theregex
parameter is False, then the comparison is made as a simple string and not as a regular expression, with case-sensitivity controlled by thecase
parameter.str.startswith(self, pfx): A boolean TenSeries that is True just where the value of this column starts with the string given as
pfx
(1010data beginswith()).str.endswith(self, pfx): A boolean TenSeries that is True just where the value of this column ends with the string given as
pfx
(1010data endswith())str.lower(self): This TenSeries, lowercased (1010data strdowncase()).
str.upper(self): This TenSeries, uppercased (1010data strupcase()).
str.count(self, reg): Number of times the given regular expression
reg
appears in the each value in this TenSeries (1010data regex_count()).str.len(self): The string length of each string in this TenSeries (1010data strlen()).
str.find(self, sub): Find the first occurrence of the substring within each value. This uses the 1010data strfind() function, which is a little different than Pandas’
.str.find()
. It does not have start and end parameters, and it returns NA on failure (where Pandas’ find() returns -1).str.get(self, ind): Select a substring out of a delimited string by index (1010data strpick(), index adjusted).
str.lstrip(self, to_strip=None): (also str.rstrip() and str.strip(), as in Pandas.) Strip off whitespace (only space characters!) from the left side of the string (or right side, for str.rstrip(), or both sides, for str.strip()).
to_strip
should be a string of characters to be stripped out, or None for just spaces. (1010data strtrim())str.match(self, pat, case=True, flags=0): Boolean TenSeries showing which values start with a match for the given regular expression. If
case=True
, the check is case-sensitive.flags
are an integer made up of flags from the “re” library. (1010data regex_beg())str.pad(self, width, side=’left’): Pad the string to the given length. Unlike Pandas, this function can only pad with space characters. (1010data padleft() or padright())
str.capitalize(self): Capitalize string: all letters lowercased except first letter and any letter following a non-letter, which are uppercased. (1010data propercase())
str.removeprefix(self, prefix): Remove the given prefix from strings, if it is found at the beginning.
str.removesuffix(self, suffix): Remove the given suffix from strings, if it is found at the end.
str.repeat(self, repeats): Repeat a string a given number of times. Unlike pandas, the
repeats
parameter cannot be a sequence of integers, but it can be an integer-valued TenSeries. (1010data strrepeat())
dt.: TenSeries have properties (and a few methods) on a special
.dt.
member for use on datetime-valued TenSeries, for compatibility with Pandas.dt.date: Returns a “date”-type column containing the date portion of the datetime. Uses the 1010data date() function.
dt.time: Returns a “time”-type column containing the time portion of the datetime. Uses the 1010data time() function.
dt.year: Returns an integer column containing the year portion of the datetime. Uses the 1010data year() function.
dt.month: Returns an integer column containing the month portion of the datetime. Uses the 1010data month() function.
dt.day: Returns an integer column containing the day (of month) portion of the datetime. Uses the 1010data day() function.
dt.hour: Returns an integer column containing the hour portion of the datetime. Uses the 1010data hour() function.
dt.minute: Returns an integer column containing the minute portion of the datetime. Uses the 1010data minute() function.
dt.second: Returns an integer column containing the seconds portion of the datetime. Uses the 1010data second() function.
dt.dayofweek: Returns an integer column containing the day of the week as an integer, according to the Pandas/ISO convention: Monday=0, Tuesday=1, … Sunday=6. Not the same as the 1010data dayofwk() function, from which it is converted! Also available as dt.day_of_week.
dt.dayofyear: Returns an integer column containing the day of the year as an integer. Also available as dt.day_of_year.
dt.quarter: Returns an integer column containing the quarter (of the year) which contains this datetime. Uses the 1010data quarter() function.
dt.is_month_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first of a month.
dt.is_month_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a month.
dt.daysinmonth: An integer TenSeries, showing how many days are in the month in which the datetime occurs.
dt.is_quarter_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first day of a quarter.
dt.is_quarter_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a quarter.
dt.is_year_start: A boolean TenSeries, equal to 1 just when the datetime occurs on the first day of a year.
dt.is_year_end: A boolean TenSeries, equal to 1 just when the datetime occurs on the last day of a year.
dt.is_leap_year: A boolean TenSeries, equal to 1 just when the datetime occurs in a leap year (in the Gregorian calendar. Not valid for pre-Gregorian dates.)
dt.month_name(self, *args): The name of the month in which the datetime occurs. See the documentation for the 1010data function tomonthname() for possible valid args.
dt.day_name(self, *args): The name of the weekday on which the datetime occurs. See the documentation for the 1010data function todayname() for possible valid args.
dt.weekofyear: The ISO week number of the year (deprecated in Pandas, but easier to work with in tenFrame than the dt.isocalendar() method (see below) which is supposed to replace it.) Uses the 1010data isoweek() function.
dt.isocalendar(self): Returns a TenFrame extending the parent TenFrame by adding columns named
year_*
,week_*
, andday_*
columns for the year, week number, and day number in the ISO calendar. The column names are actually the strings"year_"
,"week_"
, and"day_"
suffixed with a random string to ensure uniqueness.dt.normalize(self): Normalize datetimes to midnight.
Everything Else: Accessing any member of a TenSeries that isn’t otherwise defined results in a method which you call to apply some function to the TenSeries, see below.
Methods#
Many of the “methods” on TenSeries are really calls to 1010data functions, but there are also some that involve more work on the client side, or otherwise are not simply an interface to server-side functions. Some are mainly proxies to similar methods on the underlying TenFrame, like these:
copy(self): Makes a copy of the underlying TenFrame, and returns the “same” TenSeries as this one from that copy (i.e., the TenSeries with the same name.)
online(self): Calls the online() method of the underlying TenFrame.
groupby(self, *args): Like running groupby() on the underlying TenFrame.
head(self, n=20): Returns the first n values of this TenSeries as a one-column TenFrame (same as
self.frame[[self.colname()]].head(n)
)sort_values(self, ascending=True, inplace=False, **kwargs): Performs
sort_values()
on the underlying TenFrame and returns the TenSeries from it corresponding to this one (i.e., with the same name).isnull(self), isna(self): Returns a boolean TenSeries showing where this TenSeries is NA.
notnull(self), notna(self): Returns a boolean TenSeries showing where this TenSeries is not NA.
fillna(self, *args, **kwargs): Same as the
fillna()
method on the underlying TenFrame acting only on this column.replace(self, *args, **kwargs), where(self, *args, **kwargs): Replace values where the condition is/is not True; runs the same-named method on the underlying TenFrame, restricting to this column.
Pandas-like methods that are specific to TenSeries and aren’t just calls to 1010data functions include:
isin(self, values): Returns a boolean TenSeries on whether or not the value of this TenSeries is to be found in the values (a set, list, tuple, etc.)
In [1]: sol[sol.type.isin(['star', 'moon'])] Out[1]: name type orbits rkm vol mass dist 0 Sun star - 696324.0 1.414300e+09 1.988550e+09 0.0 1 Ganymede moon Jupiter 2634.0 7.600000e+01 1.480000e+02 1070000.0 2 Titan moon Saturn 2576.0 7.100000e+01 1.340000e+02 1221850.0 3 Callisto moon Jupiter 2410.0 5.800000e+01 1.080000e+02 1880000.0 4 Io moon Jupiter 1821.0 2.500000e+01 8.900000e+01 422000.0 5 Moon moon Earth 1737.0 2.100000e+01 7.400000e+01 385000.0 6 Europa moon Jupiter 1560.0 1.600000e+01 4.800000e+01 670900.0 7 Triton moon Neptune 1353.0 1.000000e+01 2.100000e+01 354759.0 8 Titania moon Uranus 788.0 2.000000e+00 3.000000e+00 436000.0 In [2]:
unique(self, sort=True): As in Pandas, returns the unique values in this TenSeries. Unlike Pandas, which returns a list, this returns a TenFrame, since after all the number of uniques might still be too big to fit into one computer. So a TenFrame is constructed, a copy of the one underlying this TenSeries, with operations added to compute the unique values. The
.unique()
method takes “groupby” into account, so you get the unique values for each group. By default, the TenFrame is sorted so that you can see all the unique values for a given block together, but since there are limits on the size of a table when sorting it, you can passsort=False
to disable the sorting and avoid the server-side error.In [1]: states = stations.state.unique() In [2]: states Out[2]: state 0 AK 1 AL 2 AR 3 AZ 4 CA 5 CO 6 CT 7 DE 8 FL 9 GA In [3]: len(states) Out[3]: 52 # Because of PR (Puerto Rico) and PI (Pacific Islands, i.e., Guam in this case.) In [4]: sales.transid.unique() # too many to sort! TentenException... TentenException: ('function run failed, returned 1 (Sort is limited to 100000000 rows. Limit table size via selection or tabulation if practical.)', 1) In [5]: sales.transid.unique(sort=False) Out[5]: transid 0 100001 1 100002 2 100003 3 100004 4 100005 5 100006 6 100007 7 100008 8 100009 9 100010 # They happened to be sorted anyway. In [6]: len(sales.transid.unique(sort=False)) Out[6]: 812962531 # Good thing we didn't try to download it all! In [7]: sales.groupby(["store", "date"]).customer_key.unique().dropna() Out[7]: store date customer_key 0 2 20150221 117 1 2 20150222 577 2 2 20150223 595 3 2 20150223 8489 4 2 20150224 577 5 2 20150224 595 6 2 20150224 902 7 2 20150224 903 8 2 20150225 902 9 2 20150225 1165 In [8]:
nlargest(self, n), nsmallest(self, n): As in pandas, returns a TenSeries with the n largest (resp. smallest) values.
Functions#
Just like accessing a member of a TenFrame that isn’t otherwise defined always succeeds and gives you a TenSeries, accessing a member of a TenSeries that isn’t otherwise defined also will always succeed and will give you a method which, when called, results in a TenSeries which applies the named function to the original TenSeries. Fortunately, this is easier done than said:
In [1]: points['x'] = points['theta'].sin()
will set a new x
column in the TenFrame points
to be equal to the 1010data
function sin() applied to the column points['theta']
. So internally,
points['theta'].sin()
is translated into a column in the TenFrame whose
value is defined as sin(theta). By making this happen for any function
name you use on a TenSeries, tenFrame makes it possible to access all the
1010data functions without having to handle each one individually, or
needing to update tenFrame when new ones are added. You can see a list of
all the functions that are available in the documentation.
As shown here, points.theta.sin()
becomes sin(theta)
internally, i.e., the name of the column becomes the parameter of the function in the underlying query. If the function has more than one parameter, you can supply the others in the parentheses. So if you’re using the 1010data builtin amort() function, which computes mortgage amortization and takes three parameters (rate, remaining term, and number of months of the amortization), you can say
In [1]: morts['amort'] = morts.rate.amort(morts.remterm, morts.totmonths)
which will become a column in your table whose value is computed (on the
server) by amort(rate;remterm;totmonths)
(the underlying 1010data code
uses semicolons to separate parameters, not commas), so the rate
column
on which you called the method becomes the first parameter. If you need
to use a method like this on this column but want to have a different
column or a constant first, that has to be done in other ways, such as
using the tf.func
object described below.
Note: The functions listed on this page in the 1010data documentation,
i.e., all functions whose name starts with g_
(and a few special cases
besides) are group-based functions and work differently; see below. So do
the tabulation functions listed here, and their pandas equivalents
(e.g. mean
, where 1010data uses avg
, etc.)
Aggregating Functions#
Some functions (like .sum()
or .mean()
) have the effect of calculating an aggregated statistic and not a simple function on each element of a column. These will return a TenFrame, unlike functions like .sin()
which will return a TenSeries. This is a little different from Pandas, once again going back to the lack of an index in tenFrame. Pandas will return a Series in such cases, and the index of the series will show the values of each group corresponding to each sum or average, etc. If tenFrame returned just a TenSeries, you wouldn’t see the values, which would be quite confusing. You could still get at them by looking at the .frame
of the TenSeries, but that might be counterintuitive. You can access the “results” column without much trouble if you remember that it is the last column, no matter how many variables you’re grouping by. You can use frame.lastCol
as a quick accessor to this column.
Calling Aggregating Functions#
You call these functions much like you call ordinary functions on a
column, just using their names as methods on the column you want to
aggregate. You just say things like sales.groupby('date').xsales.sum()
or sales.groupby('date').xsales.cumsum()
.
In [1]: rsales.groupby('store')['xsales'].sum()
Out[1]:
store xsales_sum
0 728 27236.55
1 1596 54580.71
2 1843 40757.82
3 1440 45021.49
4 937 65964.75
5 684 42684.91
6 244 76869.70
7 1492 61015.12
8 1998 107.48
In [2]:
It’s important to realize that though you have performed an aggregation (.sum()
) on a TenSeries, the result is actually a TenFrame with two columns (where Pandas would have a Series, i.e., a single column, labeled with the index.)
All 1010data tabulation functions (and g_funs) respect a “group” parameter which specifies how the data should be grouped. This comes from the .groupby()
part of the expression above, just as in Pandas, or you can provide it explicitly, as sales.xsales.sum(group='date')
.
Some 1010data g_funs have another special parameter, the “order”. This is important in functions like cumsum
, rank
(for ranking within a group), etc. You can supply it the same way, with the order=
keyword. If omitted, the order is simply the order in which the rows appear in the database. If supplied, it should be a TenSeries (or list of TenSeries) which determine the row order in which values will be considered. Pandas uses the index to provide this, but tenFrame lacks index information.
There’s an important distinction between the “ordered” g_funs and the others. Ordinary g_funs are like tabulations or aggregations, in that they generally return one value per group, i.e., fewer values than the number of rows they operate on. For example, computing the sum of xsales grouped by date results in only one value per date (whether done by g_fun or tabulation function). But sales.groupby('date').xsales.cumsum()
or sales.groupby('date').rank()
have different values for every row within the group, so they return columns that are the same length as their input. You generally can’t run an “ordering” aggregation like this unless the table is segmented appropriately for the computation you’re doing.
g_funs#
1010data “g_funs” are a special kind of grouping or aggregation functions. Take a look at the documentation about them for some more understanding. g_funs have an important limitation, and it is one that applies to server-side Python coding in general: you can only use g_funs if your table is “segmented” in some way appropriate to how your aggregation works (or if the table is a single segment, less than about 8 million rows long).
Segmentation#
Segmentation is a topic for another document, but the basic notion
is that the huge tables in 1010data are stored in smaller sections, and
a table can be “segmented” on one or more columns such that a block of
rows sharing the same values in those columns is never split up among
multiple segments. So if your table is segmented on storeid
, then
when you’re processing a segment (in a subproc), you know that for
each value of storeid
in that segment, all the rows for the value
are also there. Segmentation is also important for server-side coding
because when running a server-side function, the function runs on one
segment of the table (possibly on multiple machines processing
different segments at once, but each operating on one segment at a
time.) You don’t get the whole table, because once again, if you could
hold the whole table in memory at once, even on a more powerful desktop
machine than you use, you wouldn’t need the services of a Big Data
company.
g_funs vs Tabulations#
Generally, g_funs and ordinary tabulations/aggregations have a lot of overlap; they do many of the same things. But if a table is segmented such that a g_fun can be used, a g_fun will be significantly faster than a tabulation. (Note, however, that the overlap of function is not true of certain g_funs, mainly the “ordered” g_funs mentioned above. 1010data tabulations cannot perform the same functions as “ordered” g_funs. Also, there remain some subtle differences between performing a tabulation vs computing a g_fun and selection, regarding how and if the data remains distributed among multiple machines, etc.)
You don’t really have to worry much about g_funs vs tabulations (unless
you start writing your own server-side g_funs), because if your table’s
segmentation supports the grouping you are doing (and you’re in online
mode, so we can tell), tenFrame will always try to use a g_fun if one is
available. If the table’s segmentation is not in your favor, you’ll get
a tabulation instead. Basically, you write what you want and tenFrame
will work out what’s best to use. You can demand a g_fun by explicitly
using the g_
prefix in the function name, as
sales.groupby("date").xsales.g_sum()
. If you use the g_
prefix,
tenFrame assumes there is a g_fun by that name and that you know what
you’re doing by calling it. If the table isn’t segmented in a way that
accommodates your groups, the server side will throw an error. You can
use the frame.segby()
method to see how a table is segmented.
The “select” parameter is a boolean column specifying which rows are to
participate in the aggregation. If the column’s value is False (or 0)
on a particular row, the value in that row is not used in the
aggregation. sales.xsales.sum(group='date', select='promo_flag')
is
effectively the same as
sales[sales.promo_flag].xsales.sum(group='date')
, though it might be
done more efficiently on the server side. (For g_funs, this is
accomplished by means of special “select” parameter that g_funs have; if
you are doing a tabulation, a selection operation is inserted before the
tabulation.)
The actual decision process regarding what to do when faced with
frame.column.function()
is really complicated, perhaps unnecessarily
so, precisely because it’s trying to look “simple” and guess at what a
user will typically want, and like all attempts to guess what the user
wants, the results can be quite mystifying if the guess is wrong.
You probably don’t want to read this.
Translate the function name from Pandas to 1010data, if needed (e.g. pandas
mean
corresponds to 1010dataavg
).Look for
g_NAME
in the table of 1010data g_funs (i.e., add theg_
prefix and look). If it’s there:If it’s an “ordered” function (has “order”/”O” as one of its parameters), build a column defined by a call to
g_NAME(groups;select;order;X[;Y...])
. Tabulations don’t do ordered functions (except for thefirst
andlast
functions, which are checked for separately, and a tabulation will be done for them if the segmentation is not appropriate for g_funs). If the segmentation is unfriendly, the server side will complain.Otherwise,
If the segmentation of the table is appropriate for these groups (i.e., at least one group is in the set of things the table is segmented by, or the table is only one segment long anyway, etc.), build a column defined by a call to
g_NAME(groups;select;X[;Y...])
, followed by a selection ong_first1()
to select only the first value, so as to make this behave like a tabulation, which changes the rank of the column.Otherwise, do a tabulation, using
fun='NAME'
Otherwise (if
g_NAME
is not listed as a known 1010data g_fun),If the name actually starts with
g_
already, then we assume the user knows what they’re doing and build a column withNAME(groups;select;X[;Y...])
, orNAME(groups;order;select;...)
if theorder=
parameter was explicitly given, followed by a selection as before.Otherwise,
If
NAME
is known to be a function used with the tabulation operator, perform a tabulation with it.If there is some grouping specified, either by
groupby()
on the table or column being operated on or by thegroup=
parameter explicitly, we assume the user must want some kind of aggregation happening, so do a tabulation or g_fun column (without addingg_
to the function name), accordingly as the table’s segmentation includes something we’re grouping on.Finally, if all else fails and there are no groups specified, assume the user is after a non-aggregating ordinary function, any of the functions in the function reference that aren’t listed as g_funs or some user-defined u_fun (as with
@frame.def_ufun
).
In a call to .agg()
where multiple functions are used together,
either all of them become tabulations or all of them become g_funs. If
some must be tabulations (because there is no corresponding g_fun) and
some must be g_funs (because the function starts with g_
or other
reasons), an exception is raised.
Adjoining Aggregations - computing new aggregated columns without joins#
Aggregating functions (as well as the .agg()
method on TenFrames) can optionally be called with the adjoin = True
parameter to obtain some very useful results not easily achievable with pandas. Most aggregations result in a table which is shorter than the original data. For example, frame.groupby("month").sales.sum()
can be expected to have only 12 rows, no matter how long the original data is. But sometimes it’s useful to retain the granularity of the original data along with adding the aggregated information, so that each row also has information on the sum of sales that happened in the same month as the transaction listed in that row. In Pandas, you can do something similar with
frame.merge(frame.groupby("month").sales.sum(), on="month")
The resulting column names wind up a bit confusing, but what’s worse is
that you have to perform a potentially time-consuming join (the
.merge()
method). The adjoin
parameter in aggregations lets you do
this in a single expression, but more importantly, if your table is
segmented by the groupby column(s), no join is required, so the
calculation is extremely fast.
Adjoining returns a new TenFrame, not a TenSeries to assign into an existing one, so you can just say
frame_sums = frame.groupby("month").sales.sum(adjoin=True, name="monthly_sales")
using the name=
parameter to set the name for the new column (if not
set, it would default to "sales_sum"
).
There is even some special-case detection done so that you can do it automatically during an assignment and simply say
frame['monthly_sales'] = frame.groupby("month").sales.sum()
Note
Arguably, this shouldn’t make sense, since the frames aren’t “compatible” at first glance: the right-hand frame is an aggregation and has fewer rows than the left-hand frame. Besides, the right-hand expression actually returns a TenFrame, and you can’t assign a TenFrame as an element (a column) of another TenFrame! However, the meaning of such an assignment is obvious and it’s so useful that tenFrame makes a special exception for this sort of thing and does what you want.
An example using the weather stations data mentioned above:
stations = tf.TenFrame(session, "pub.demo.weather.stations")
stations.groupby("state").elev.mean(adjoin=True)
stations
results in
id name state tz lat lon elev anht elev_mean
0 3103 FLAGSTAFF AZ -7 35.133 111.667 2135 6 1196.000000
1 3812 ASHEVILLE NC -5 35.433 82.533 661 6 218.333333
2 3813 MACON GA -5 32.700 83.650 110 7 144.333333
3 3816 PADUCAH KY -6 37.067 88.767 125 10 251.800000
4 3820 AUGUSTA GA -5 33.367 81.967 45 6 144.333333
5 3822 SAVANNAH GA -5 32.133 81.200 16 9 144.333333
6 3856 HUNTSVILLE AL -6 34.650 86.767 190 8 127.750000
7 3860 HUNTINGTON WV -5 38.367 82.550 255 6 476.000000
8 3870 GREENVILLE SC -5 34.900 82.217 296 7 125.666667
9 3872 BECKLEY WV -5 37.783 81.117 765 10 476.000000
Note that all the rows are still there, and the mean elevation for each
state is on all the rows for stations in that state (look at the three
GA
rows in the example above, or the two WV
rows.) Each record
automatically has the correct value for elev_mean , that is no join/merge
was required as in pandas.
that
and func
#
TenFrame provides some special values which can be very useful in making code more concise and simpler to write and to read. There are things that they make possible that would be extremely difficult to express in other ways.
that
#
that
is a sort of generic TenFrame. In a sense, it can be used to be
referring to whatever TenFrame it’s being used with. It’s probably easier
to understand from examples:
In [1]: s2019 = tf.TenFrame(session, "retaildemo.retail.sales_detail")[tf.that.date.year() == 2019]
In [2]: bigdays = s2019.groupby('date').agg({'price':'sum', 'cost':'sum'})[tf.that.price_sum > 7.0e6]
In [3]: bigdays
Out[3]:
date price_sum cost_sum
0 20190201 7.288699e+06 6.376741e+06
1 20190216 7.055807e+06 5.846130e+06
2 20190303 8.405907e+06 7.064833e+06
3 20190420 9.655613e+06 8.208903e+06
4 20190505 8.504962e+06 7.208672e+06
5 20190622 7.200566e+06 5.986890e+06
6 20190623 7.708646e+06 6.389132e+06
7 20190810 7.641546e+06 6.518093e+06
8 20190825 8.023629e+06 6.682147e+06
9 20191012 7.211072e+06 6.209403e+06
In [4]:
As you see, we are selecting from a frame on a condition involving its columns, without needing to assign the frame to a variable in order to access its columns. tf.that
already has all its columns, so to speak. That’s where the name comes from: you’re working with some TenFrame or another, possibly a complicated expression for one, and you just want to refer to something in “that.”
Or you can use columns from that
in expressions with another TenFrame, as a way of referring to that frame’s columns:
In [1]: s2019 = sales[sales.date.year() == 2019]
In [2]: sumcosts = s2019.groupby('date').agg({'price': 'sum', 'cost': 'sum'})['price_sum'] - tf.that['cost_sum']
In [3]: sumcosts
Out[3]:
xvwm83p5
0 8.707065e+05
1 1.012163e+06
2 9.873511e+05
3 9.119584e+05
4 9.840235e+05
5 1.209677e+06
6 1.341073e+06
7 1.228974e+06
8 1.027738e+06
9 9.813348e+05
Note that sumcosts
is a TenSeries (not a TenFrame), which makes sense as it is the difference of two TenSeries: the price_sum
of the aggregation done on s2019, and the cost_sum
of… that same table, the aggregated table you got from running agg()
! Using tf.that
saves you from having to type the whole thing in all over again.
But I’m sure you’re wondering: what kind of name is xvwm83p5
?? It’s an artifact of how tenFrame has to operate. Usually you don’t see it, and we’ll look at ways to avoid it in a moment, but in this simplest case, without any embellishments to fix it, unfortunately it is exposed for you to see. If all you need is the values from the column (which is likely, if you’re not bothering to assign the aggregate TenFrame to a variable), it won’t matter to you what it’s called, so that’s okay. But this kind of thing will turn up now and then as you experiment with tenFrame, so you should at least know that it happens and that it’s normal.
that
can also be helpful in subscripting. If you try to do this
In [1]: base = tf.TenFrame(session, "pub.demo.baseball.master")
In [2]: fifties = base[base.birthyear >= 1950][base.birthyear < 1960]
you’ll get an error that the frames are not “compatible.” Why that is so is another matter, but you can easily make it work by saying
fifties = base[base.birthyear >= 1950][tf.that.birthyear < 1960]
You can even use tf.that
for both subscripts. Of course, in this case syntax like
fifties = base[(base.birthyear >= 1950) & (base.birthyear < 1960)]
is also available and probably preferable.
func
#
The tf.func
object uses some of the machinery from tf.that
and
makes it a little easier to write function expressions on columns without
being subject to the method-calling syntax.
As mentioned above, something like frame.col1.fun(frame.col2, frame.col3, frame.col4)
becomes an expression in the 1010data code that looks like fun(col1; col2; col3; col4)
, i.e., the object on which the function is called becomes the first parameter. But maybe that isn’t what you need, and it isn’t convenient to reorder things so you have a different column at the front. Or maybe the column is a complicated expression, like below. You can build a function expression with tf.func
to specify exactly what you mean to call. So using the example above and the 1010data amort
function,
In [1]: morts['amort'] = morts.rate.amort(morts.remterm, morts.totmonths)
You can say the same thing without worrying about what column you’re calling a method on like this:
In [1]: morts['amort'] = tf.func.amort(morts.rate, morts.remterm, morts.totmonths)
or even
In [1]: morts['amort'] = tf.func.amort(tf.that.rate, tf.that.remterm, tf.that.totmonths)
using tf.that
to provide you with a generic way to refer to a column.
If you had used strings, like "rate"
and "remterm"
, they would
have been taken to be strings and quoted when translated into XML, and you
would have gotten something like amort('rate';'remterm';'totmonths')
,
which is not what you want.
Module-Level Functions#
There are a few module-level functions in tenFrame, often there to emulate a similar usage in pandas. These include:
tf.islive(session): Returns True if the given py1010 Session object is not only logged in, but is currently “live” and responsive to queries. It actually creates and runs a dummy query on the session to confirm that the session really is running queries and not timed out, etc.
tf.directory(session, dirname, to_df=True, **kwargs): Returns a Pandas DataFrame containing the names and metadata about all the tables and subdirectories in the directory named in the 1010data object tree. If
to_df
is False, returns a TenFrame instead.tf.merge(frame, *args, **kwargs): The same as
frame.merge(*args, **kwargs)
. Even thoughmerge()
makes sense as an instance method on frames, pandas chooses to define it as a module-level method, so TenFrame supports the same syntax.tf.melt(frame, *args, **kwargs): The same as
frame.melt(*args, **kwargs)
, for the same reason asmerge()
. In this case, pandas supports the latter syntax as well.tf.wide_to_long(frame, *args, **kwargs): The same as
frame.wide_to_long(*args, **kwargs)
, for the same reason as merge().tf.concat(frames, *args, **kwargs): Takes a sequence of frames and runs the
.concat()
method on them sequentially to concatenate them all together. Equivalent toreduce(lambda x, y: x.concat(y, *args, **kwargs), frames)
. Again, provided for compatibility with how Pandas does it.tf.select(condlist, choicelist, default=0): Returns a string which contains a 1010data expression equivalent to the numpy
np.select()
method. This string is suitable to be set as a new column of a TenFrame, and will not be quoted. Soframe['success'] = tf.select([frame.totsales > 10000, frame.totsales > 1000, frame.sales > 10], ["Great", "So-so", "Lousy"], "Rotten")
will create a column
success
which will contain one of the strings listed above, depending on the value of thesales
column.
More Examples: Merging Tables#
The tenFrame .merge()
method works pretty much the same as the pandas
one, subject to what the 1010data engine can do. It uses the 1010data
“link” operator, whose capabilities overlap those of pandas .merge(),
but neither is a subset of the other.
The merge() signature looks like this:
merge(left, right, on, right_on=None, how=”left”, cols=None, **kwargs)
TenFrame merge()
works about the same as pandas:
sales = tf.TenFrame(session, "retaildemo.retail.sales_detail")
salesprod = tf.merge(sales, "retaildemo.retail.products", on="sku")
yields a frame that starts with:
transid date tme sku upc store qty weight xsales price ... group groupdesc category categorydesc pl size is_random_wgt brand brand_name manuf
0 100001 2012-12-31 03:52:07 222218 7084781260 728 1.0 1.0 1.97 1.97 ... 2045 ENERGY DRINKS 2045010 ENERGY-SINGLE SERVE 0 15 OZ 0 MONS Monster Monster Beverage
1 100002 2012-12-31 04:07:58 260548 81809400001 728 1.0 1.0 1.39 1.39 ... 2045 ENERGY DRINKS 2045010 ENERGY-SINGLE SERVE 0 16 OZ 0 ROCK Rockstar Rockstar
2 100003 2012-12-31 05:10:44 224423 7116603129 728 1.0 1.0 2.77 2.77 ... 1520 BREAD/BAKED GDS 1520100 BRD-BRKFST CAKES/SWEET ROLLS 0 5 OZ 0 SVNH Svnh <NA>
3 100004 2012-12-31 05:38:59 306141 999999907314 728 1.0 1.0 0.49 0.49 ... 1460 SNACK -CHIPS/PRETZELS 1460030 SNACK-POTATO CHIPS 1 1.5 OZ 0 PRV Priv Lbl PL Mfg
4 100004 2012-12-31 05:38:59 101528 1200000452 728 1.0 1.0 2.57 2.57 ... 2040 NEW AGE BEVERAGE 2040060 NAB COFFEE 0 13.7 OZ 0 STRBK Starbucks Pepsi/Starbucks
5 100004 2012-12-31 05:38:59 110966 2200000898 728 1.0 1.0 1.27 1.27 ... 1390 CONFECTION 1390050 CONF GUMS-MINTS 0 15 CT 0 WRIGL Wrigleys Mars
6 100005 2012-12-31 06:04:51 161510 6150000594 728 2.0 1.0 1.42 0.71 ... 2045 ENERGY DRINKS 2045010 ENERGY-SINGLE SERVE 0 16 OZ 0 RRAIN Rrain <NA>
7 100006 2012-12-31 06:31:45 105159 1590000209 728 1.0 1.0 1.29 1.29 ... 2620 MEAT DEPT. DELI 2620115 FRANKS 0 16OZ 0 BAR-S Bar-S Alfa
8 100006 2012-12-31 06:31:45 102143 1230000055 728 2.0 1.0 13.32 6.66 ... 2160 CIG SINGLE PACKS 2160010 CIG-SINGLE PACKS 0 EACH 0 MORE More <NA>
9 100006 2012-12-31 06:31:45 298945 999999900118 728 1.0 1.0 0.82 0.82 ... 1520 BREAD/BAKED GDS 1520020 BRD-BUNS-FRESH 1 8 CT 0 PRV Priv Lbl PL Mfg
You have a left TenFrame and a right TenFrame (which may be just a string
naming a table), you name the column to join on with on=
. If the
column has a different name in the right TenFrame you use on=
and
right_on=
(again, like in Pandas). Note, however, that TenFrame joins
default to how="left"
, and only how="left"
and how="inner"
are
supported. Values of "right"
, "outer"
, and "cross"
are not
supported in TenFrame merging. However, the 1010data-specific values
"include"
, "exclude"
, and "asof"
are supported; see the
documentation for the link op.
The cols=
parameter lets you specify exactly which columns from the right-hand TenFrame are to be adjoined in the merge. The kwargs
can be used to set other attributes on the XML code being generated, for users who are familiar with our macrocode. One thing you can add this way is suffix="_y"
, to specify a suffix to add to the columns from the right-hand table to avoid conflict. (There is no way to specify a suffix for the left-hand table.)
Concatenating#
There is also a concat()
function, like in pandas:
newframe = tf.concat([frame1, frame2], axis=0)
The axis can be 0
(rows, the default) or 1
(columns). (The rest of the arguments for pandas concat()
function mainly deal with indexes and other things that are not relevant to tenFrame.) The two types of concatenation are actually handled rather differently in 1010data terms. Note that if you are joining by columns, the right-hand frame cannot be a “worksheet,” meaning a 1010data query whose data has all been brought into the accum (by doing a tabulation, for example.)
Both merge()
and concat()
are also available as TenFrame methods, which may be more intuitive.
salesprod = sales.merge("retaildemo.retail.products", on="sku")
newframe = frame1.concat(frame2, axis=0)
Offline Mode#
TenFrame actually does two distinct cool things: (1) it constructs a 1010data macrocode query mimicking the effects of various pandas-like manipulations, and (2) it runs that query and makes the results available to the programmer, in a more or less pythonic/pandas-like way. The first of these is actually independent of the second, and it may be useful to develop a query “offline,” not dependent on a connection to 1010data’s servers, and then use the macrocode derived from the TenFrame, perhaps modifying it further, etc. To that end, tenFrame does not require py1010 to run (it doesn’t even need to be installed), and you can create a TenFrame without a py1010 Session. To do so, supply None
in the TenFrame constructor in place of the session:
frame = tf.TenFrame(None, "pub.cool.table")
The resulting TenFrame will do its level best to perform all the functions of any other TenFrame without ever trying to contact the 1010data servers. Then you can get the XML of the macrocode query using the .extractXML()
, .prettyXML()
, or .printprettyXML()
methods as mentioned above.
In offline mode, there are some things a TenFrame can’t do that it could if it were online. Obviously, it can’t retrieve data from the results, since there are no results because the query can’t be run. If you try to get data from the results, you’ll get back the value NotImplemented
. Similarly, the .size
and .shape
properties are unavailable (and will return NotImplemented
); this also applies to using a negative index when subsetting the rows (because the number of rows is not known), and the len()
of a TenFrame in offline mode will always be 0
. But there are some other limitations too.
In particular, if you do an aggregation on a TenFrame and don’t supply a dictionary ({'column':'function',...}
or {'column':['function',...],...}
) but only a function name or a list of function names, those functions are supposed to be applied to all applicable columns. Not all functions can be applied to all function types. Specifically, string columns cannot be tabulated by numeric functions like sum or average, but only by minimum, maximum, count, nunique, and first (a function not found in pandas.) So in order to know how to tabulate an underspecified aggregation like this, tenFrame needs to know all the columns and their types – which it cannot know in offline mode, so this results in an error. You’d do best simply to provide a fully-specified aggregation dictionary to .agg()
. Relatedly, when offline tenFrame can’t tell what a table’s segmentation is like, so it can’t determine when it can use g_funs and defaults to tabulations always.
Important Differences from Pandas#
We’ve already touched on some of the differences between tenFrame and pandas. It’s unlikely that this section will ever be an exhaustive list of all the differences, but some notable ones bear mentioning (or repeating.)
Nonexistent Columns#
One striking difference between a TenFrame and a DataFrame is how they
handle trying to access a column that isn’t in the table. A DataFrame
knows exactly what’s in it, what all the columns (Series) are and even
what’s inside them, and if you try to access a column that isn’t there,
either with df.wrongcolumn
syntax or df['wrongcolumn']
, you’ll get
an AttributeError or a KeyError respectively. But there is no way to know
what columns are in a 1010data query without actually running the query
(example: cross-tabulation), and we’d like not to have to do that unless we
really need to. (It’s true that 1010data queries are cached on the server
side and so repeated runs will be fast, but there’s still the network
latency, which adds up when there are many calls, and also for various
reasons there may be slight differences in the queries that defeat the
cache.) And besides, tenFrame should also work in “offline mode,” where
it has no access to running the queries at all. So it turns out to be
simplest to make it so that trying to access a column (TenSeries) in a
TenFrame always succeeds. So instead of raising exceptions, you’ll find
yourself getting mysterious empty TenFrames, as tenFrame runs the query and
then removes all the columns except for one – and that one isn’t there.
There is some attempt to avoid this issue when you’re working from the
command-line, in the REPL (or jupyter, etc.) When a TenSeries is being
converted for display in the REPL, there is a check to see if the column
really exists (the query has to be run then anyway, after all), and if not,
an exception is raised.
Nonexistent Functions#
Likewise, you can call a function on a TenSeries and discover later that it doesn’t exist. In order to make the whole gamut of 1010data functions available automatically, tenFrame takes any attribute on a TenSeries (which isn’t already defined to mean something else) to mean a function that you intend to call on that TenSeries as part of the query. So if you try to call a function that doesn’t exist on a TenSeries, there will be no exception until the query is actually run. This is probably less jarring than non-existent columns, since after all, you do at least get an error, and query run-time is when you should expect to get it. Just that in tenFrame, query run-time isn’t necessarily immediately, like it is with pandas.
Selection by Callable#
Here’s an important thing from Pandas which will not always work in tenFrame, and it’s important to know why.
# Using the "pub.demo.weather.stations" dataset again
>>> df[lambda d: d.lon > 100]
id name state tz lat lon elev anht
0 3103 FLAGSTAFF AZ -7 35.133 111.667 2135 6
134 21504 HILO HI -10 19.717 155.067 11 6
135 22516 KAHULUI HI -10 20.900 156.433 15 6
136 22521 HONOLULU HI -11 21.333 157.917 5 8
137 22536 LIHUE HI -11 21.983 159.350 45 6
.. ... ... ... .. ... ... ... ...
243 94008 GLASGOW MT -7 48.217 106.617 700 6
244 94014 WILLISTON ND -6 48.183 103.633 579 6
245 94185 BURNS OR -8 43.583 119.050 1271 5
246 94224 ASTORIA OR -8 46.150 123.883 7 6
247 94240 QUILLAYUTE WA -8 47.950 124.550 55 7
[100 rows x 8 columns]
>>>
This will work in tenFrame only in environments where server-side python is available, and a little reflection on how tenFrame works can tell you why. Remember that a TenFrame really isn’t a Dataframe; it doesn’t really have its data in memory to apply functions to. It’s just a set of instructions being assembled into a query to ask the 1010data Edge server. In Pandas, you’re defining a function (with lambda
) and Pandas applies it to to the DataFrame and does the selection. But in tenFrame, the selection is being done on the server side (the whole point of 1010data being to be able to select and operate on such big tables), and your lambda needs to be defined there, not on your local side. So tenFrame will take the source code of your function and make it part of the query that is sent up to the server side, for it to be executed there – assuming that server-side python is available. This also has some other quirks that may raise their heads from time to time.
Dates, Times, and Datetimes#
1010data tables and queries support values representing dates and times, like Pandas does, but they are stored differently and may behave differently.
Normal data in a 1010data table has only 4 types: integer (32-bit), long integer (64-bit), float, and string. There is no real “date” type or “time” type, etc. But each column has “format” information as part of its metadata, which is used by some other 1010data products to specify just how to format the data for output (how many decimal places, commas, etc.) and the format data may also say that a column’s values are to be interpreted as dates or times or datetimes (a timestamp of a date and a time) or even things like a year and a fiscal quarter, etc. These are stored rather differently from one another, and not necessarily as consistently as you might thing (e.g. dates are stored as integers, giving the year, month, and day simply by their digits, as 20220325, while datetimes are stored as floating-point numbers counting the number of days and fractions of a day since a known fixed date.) The lower-level py1010 library on which tenFrame is based automatically detects these formats and converts date, times, and datetimes into Python date objects, time objects, and datetime objects respectively.
As with Pandas Series, TenSeries objects have a .dt
member which gives
access to various properties of datetime columns (and a few methods); see
above.