Basic usage

This section shows how to perform basic operations on the TenFrame.

We discuss tenFrame basics such as starting a session, creating a TenFrame from a 1010data table, subsetting rows and columns, computing new columns, aggregating data, and joining tables.

Opening a table

To open a table, you must supply the TenFrame object with a py1010 Session object and a 1010data table to open.

The following example creates a simple TenFrame object for the table pub.demo.weather.stations:

import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
stations = tf.TenFrame(session, "pub.demo.weather.stations")
print(stations)
Note: 1010data URL is the gateway and version you use to access 1010data. USERNAME and PASSWORD are the user name and password you use to access 1010data.

First, import py1010, 1010data's Python library, and then import tenFrame. Next, start a session with the py1010 Session object. You can then create a TenFrame object stations based on the session and the table pub.demo.weather.stations.

Note: See Establishing a Session in the 1010data Python SDK User's Guide for more information about starting a py1010 session.

As mentioned above, stations is the resulting TenFrame object, which in the example above is based on the pub.demo.weather.stations table.

You can print stations to return a representation of the TenFrame object, or the first 10 entries of the table.

Note: To convert the results of a TenFrame to a DataFrame, use the to_df() function:
dstations = stations.to_df()
print(dstations)
You will see that pandas works similarly, although pandas displays the first and last few rows of a long frame and the number of rows and columns at the bottom, as follows.

Getting basic table statistics

The .describe() method is similar to the one in pandas. It returns a pandas DataFrame with the following statistics for all the numeric columns in the table:
  • count
  • mean
  • std
  • min
  • lquart
  • median
  • uquart
  • max
If include is set to all, the following rows are added for all string columns:
  • count of unique values
  • mode
  • frequency of mode
stations = tf.TenFrame(session, "pub.demo.weather.stations")
print(stations.describe(include='all'))

Subsetting rows

You can subset rows in tenFrame similarly to subsetting rows in pandas.

Once you have a TenFrame object, you can subset rows to refine the data.

In the following example, we create the TenFrame object retailitem based on the session and the pub.demo.retail.item table. Then, we can perform simple operations on the table, such as returning the results of store 1 only. You can use len() to return the number of rows in the selection (17).
import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
retailitem = tf.TenFrame(session, "pub.demo.retail.item")
print(retailitem[retailitem.store == 1])
print(len(retailitem[retailitem.store == 1]))

The following are additional ways you can subset rows in tenFrame:

Notation Description Example
frame[value1:value2] Slice of data from value1 to value2. sales[3:1000]
frame[condition] Subset data that meets a certain condition sales[sales.qty > 19]
frame.sample(frac) Subset a random sample of the data sales.sample(frac=1e-4)
frame.head(n), frame.tail(n) Subset first/last rows sales.head(n)

Subsetting columns

You can subset one or more columns of a TenFrame much as you would subset one or more columns of a DataFrame in pandas.

When you subset a single column of a TenFrame, such as stations['id'] it yields a TenSeries, but when you subset more than one column, or even a list of a single column, such as stations[['id', 'state']] or stations[['id']], it yields a TenFrame.

If you subset by a string, tenFrame returns the column of that name which is in the frame. In the following example, stations['state'] evaluates to the state column, and the first few values of it are printed out.

import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
stations = tf.TenFrame(session, "pub.demo.weather.stations")
print(stations['state'])

You can subset more than one column as follows:
stations[['id','name','state']]

It is possible to use the * wildcard character to subset multiple columns. For example, baseball[["birth*"]] is the same as baseball[["birthmonth", "birthday", "birthyear"]].

You can also subset a single column as an attribute, for example, stations.state, but to set a new column, you must use brackets, as follows:
sales['margin'] = sales.xsales - sales.cost

The new column margin is in brackets, while the existing columns xsales and cost are attributes. See Computing new columns for more information.

Note: Unlike pandas, tenFrame will allow you to subset by a non-existent column. This is because tenFrame has no way of knowing what the columns in a 1010data query are without first running the query.

tenFrame, unlike pandas, allows you to subset by an integer. stations[2] is the same as stations['state'] for the stations TenFrame.

Computing new columns

You can assign one new column to a TenFrame or assign several new columns at once in a new TenFrame.

To set a new column in a TenFrame, put the new column name in brackets, and then define the column, as follows:
retailitem['margin'] = retailitem.sales - retailitem.cost

The following example uses the retaildemo.retail.sales_detail table.

import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
sales = tf.TenFrame(session, "retaildemo.retail.sales_detail")
sales['margin'] = sales.xsales - sales.cost
print(sales)

Remember, in tenFrame, the new column is not computed until you access the data.

TenFrame also allows you to add one or more new columns in a new TenFrame with assign(), as follows:
promo2 = promo.assign(promodiscount=promo.price * 0.05, 
    promoprice = promo.price - (promo.price * 0.05))
print(promo2[['price', 'promodiscount', 'promoprice']])

Manipulating dates and times

1010data stores dates differently from pandas. In 1010data, there are no real "date" or "time" types. Dates are stored as integers, giving the year, month, and day in the following format: YYYYMMDD. For example, 1010data stores the date 3/25/2022 as the integer 20220325. Datetimes are stored as floating-point numbers counting the number of days and fractions of a day since a known fixed date. Dates and times are covered extensively in the 1010data Reference Manual, in the Dates and time topic.

The following code selects all records with a sales date equal to or later than 1/1/2016 and earlier than 1/1/2017, and merges those records with the products table:
import datetime

sales = sales[((datetime.date(2016,1,1) <= sales.date) 
     & (sales.date < datetime.date(2017,1,1)))]
salesprod = tf.merge(sales, "retaildemo.retail.products", on="sku")
print(salesprod)

Aggregating data

Aggregation in tenFrame is similar to pandas aggregration.

You can use the groupby() method, along with aggregation functions, to aggregate data that is grouped by the values of one or more columns.

The following code groups sales by store, and then uses sum() to aggregate the total sales by store.

import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
retailitem = tf.TenFrame(session, "retaildemo.retail.sales_detail")
aggsales = retailitem.groupby('store')[['xsales']].sum()
print(aggsales)

You can use the agg() method for more complex aggregations, such as aggregating multiple columns in different ways at the same time, as follows:

sales.agg({"price":"sum", "cost": ["max", "min"], "xsales":"mean"})

Note that tenFrame aggregations are always TenFrames containing the group columns (if any) and the result, while in pandas an aggregation such as df.col.sum() is a Series, labeled with the index, and not a DataFrame.

There is a large library of 1010data functions available to aggregate data. The 1010data Reference Manual contains a complete list of tabulation and group functions. The following is a sameple:

Function Description Example
sum() Sum sales.xsales.sum()
count() Count of N/A values sales.customer_key.count()
median() Median sales.qty.median()
min() Minimum sales.xsales.min()
max() Maximum sales.xsales.max()
mean() Mean sales.xsales.mean()
var() Variance sales.cost.var()
std() Standard Deviation sales.price.std()
Note: Some of the functions (such as max(), min(), and mean()) have different names as 1010data functions. tenFrame automatically converts them to the more familiar pandas function names.

Adjoining aggregations

Adjoining aggregations allows you to compute new aggregated columns easily. In addition, if your table is segmented by the groupby() column(s), tenFrame accomplishes aggregations without time-consuming joins. Adjoining returns a new TenFrame, not a TenSeries to assign into an existing TenFrame. Most adjoined aggregations result in a table that is shorter than the original data. You still use the groupby() method, but in the aggregation function, you would set adjoin=True, as follows:
frame_sums = frame.groupby("month").sales.sum(adjoin=True, name="monthly_sales")

The name parameter is optional, and if omitted, the default column name would be sales_sum.

The following example shows how to use adjoining aggregations to find the mean elevation of weather stations by state:

stations = tf.TenFrame(session, "pub.demo.weather.stations")
stations.groupby("state").elev.mean(adjoin=True)
stations

adjoin=True keeps all the rows of the table, instead of just providing a mean elevation once for each state (note the three GA rows and the two WV rows). If the table is properly segmented, tenFrame accomplishes this without a join, and performs the calculation quickly. This is useful when you want to perform aggregations, but you want to retain the granularity of the original data. In pandas, you cannot do this type of aggregation without a time-consuming join (the .merge() method).

Creating pivot tables

The tenFrame pivot_table() method works approximately like the corresponding pandas method. It creates a spreadsheet-style pivot table as a TenFrame. If you are familiar with 1010data, these are also referred to as cross tabulations.

The basic syntax of the pivot table is as follows:

pvt = frame.pivot_table(values, index, column, aggfunc="mean", rename=True)

In a pivot table, values become column names. While pandas column names are unrestricted, 1010data columns have restrictions, such as beginning with a letter and consisting of only lower case letters, numbers, and underscores. tenFrame does its best to rename columns using these restrictions, but you should be aware that tenFrame may need to rename columns.

In the following example, the TenFrame cust_sales_2019 filters the cust_sales TenFrame to contain the sales of 2019 only. We merge cust_sales_2019 with products to form the TenFrame cust_sales_2019_with_deptdesc. (See Joining TenFrames for more details.) Finally, we create the TenFrame cust_by_deptdesc. cust_by_deptdesc is a pivot table with the following parameters:

  • xsales is the column to aggregate (total sales)
  • customer_key is the column to group by
  • deptdesc is the column to use for the pivot table's column names
  • sum is the aggregating function
import py1010
import tenFrame as tf
session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
cust_sales = tf.TenFrame(session, "retaildemo.retail.sales_detail_customer", 
    meta=['alloc_rr=20','nomaxtabu','nomaxtabc'])
products = tf.TenFrame(session, "retaildemo.retail.products")
cust_sales_2019 = cust_sales[cust_sales.date.year() == 2019]
cust_sales_2019_with_deptdesc = cust_sales_2019.merge(products, on='sku')
cust_by_deptdesc = cust_sales_2019_with_deptdesc.pivot_table('xsales', 'customer_key', 'deptdesc', 'sum')
print (cust_by_deptdesc)

Joining TenFrames

The merge() method in tenFrame is similar to the pandas merge() method.

You can use the merge() method to join two TenFrames. The basic syntax is tf.merge(frame1, frame2, on=column). This is similar to the pandas syntax. You can also call merge() as a method on a TenFrame. In that case, the syntax is frame1.merge(frame2).

The following code merges two frames, retailitem and productdetail, on the column sku. By default, tenFrame performs a "left" join.
Note: Currently, tenFrame supports "left" and "inner" joins only.
import py1010
import tenFrame as tf

session = py1010.Session("1010data URL", "USERNAME", "PASSWORD", py1010.POSSESS)
retailitem = tf.TenFrame(session, "retaildemo.retail.sales_detail")
productdetail = tf.TenFrame(session, "retaildemo.retail.products")
mergedtable = tf.merge(retailitem, productdetail, on="sku")
If the right-hand side of merge() is just a table, you can use a string naming the table instead of making a TenFrame for it, for example:
frame1.merge("retaildemo.retail.products", on="sku")

or

tf.merge(frame1, "retaildemo.retail.products", on="sku")

Concatenating TenFrames

The concat() method is similar to the one in pandas. concat() adds rows (with axis=0, the default value) or columns (with axis=1). If you are appending rows, the two TenFrames should have the same number of columns. If you are joining by columns, the added TenFrame cannot be a worksheet. In 1010data terms, a worksheet is a 1010data query whose data has all been brought into the accum.
newframe = frame1.concat(frame2, axis=0)

newframe is simply the rows of frame1 followed by the rows of frame2.

Using group functions (g_functions)

If you are an experienced user of the 1010data macro language, you are probably already familiar with group functions, or g_functions. If not, you may want to review the Group functions section of the 1010data Reference Manual. Group functions are a special kind of grouping function that have an important limitation: you can use them only if your table is segmented in some way appropriate to your aggregation, or if your table is a single segment of fewer than 8 million rows. If you are able to use g_functions instead of a tabulation, g_functions are significantly faster. g_functions also allow you to create ordered aggregations.