Server-Side Python#
TenFrame offers support for running your own Python code on the server side, i.e. on the 1010data server which is doing the processing. The actual details and a better description of the limitations involved are the subject of server-side coding documentation, but we’ll look at how to use it in tenFrame. It gets a little complicated, because there are multiple layers of processing involved:
The client-side Python environment that you are running
The XML macrocode of the query, into which your code is inserted
The server-side Python environment invoked by the query.
It is important to keep them straight and recognize what does and does not happen at the various points.
Note also that server-side python is only available in some environments, and depending on your particular situation it may not be available to you. In that case, all the tenFrame functionality that depends on server-side activity will raise an exception.
Automatically Uploaded Functions#
We’ll see below that there are various ways of explicitly marking functions to be included in a TenFrame’s query, but there are also circumstances in which a function which is needed on the server side will “automagically” be made available there, with no effort required on your part. On one hand, this makes things easier for you, but on the other hand, it means that when little quirks show up that interfere, the results can be mystifying.
assign()
#
One situation is the use of the .assign()
method on TenFrames. Just like in Pandas, if one of the columns being assigned is a callable function, then the resulting value is computed by calling the function on the DataFrame. Mostly. You don’t have to worry about special decorators or anything: just define your function that takes a DataFrame and returns a numpy array of results (like you would in pandas, which expects more or less the same thing) and use that callable in the assign()
expression. The function will automatically be uploaded as part of your query and called in the right place.
One thing you need to beware of, though, is that unlike in Pandas, your function might not be getting the whole of the data (see above: Big Data is Big). Your function will be passed a DataFrame that contains one `segment`_ of the data. A lot of the time, this is fine, since .assign() is frequently used to compute functions that are local to a row, either by explicitly iterating over the frame or more likely by relying on the implicit element-wise evaluation of operators in Pandas and numpy. But it’s a thing you need to know about, if you have large, segmented data.
Selection#
Using a callable (function) in subsetting a TenFrame (i.e., placing it in the brackets of a subsetting expression) similarly grabs your function’s source code and makes it part of the query and basically works exactly the same as assign()
. In this case, it’s really unlikely you’ll be using a function that depends on the whole of the data to decide which rows to select, so the problem of the data being broken up will almost surely not be an issue.
apply()
#
TBD
Anonymous Functions (lambdas)#
Frequently, the functions used in these situations are quick one-off callables defined by the Python lambda keyword, which turns an expression into a function. This is quite convenient for programmers, but can cause some unexpected issues when used across different systems, as in tenFrame, where there is the server side and the client side to consider.
Simply put, the problem is that there’s no reliable way to get the source code for a lambda function, in order to upload it to be run on the server side. (Python inspection will only give you the line(s) of code in which it was defined, and finding the lambda expression – the right lambda expression – in such an environment is not feasible.) In order to accomplish using lambdas, tenFrame “pickles” the lambda function, turning its Python opcodes into a string of bytes that can be transmitted. Unfortunately, the system-supplied pickle library does not pickle callable objects, so tenFrame attempts to pickle the lambda using the dill or cloudpickle libraries, whatever is available on your local system. If neither is available, attempting to use a lambda expression in this manner will throw a ValueError
.
Aside from making sure that the necessary library is available in your system, the other thing to watch out for is different versions of python (or the pickling library), which may make the pickling on your local side incapable of being unpickled on the server side. The error message from the server will probably make this situation clear.
Other Quirks#
Of course, you must remember that the function you are writing is not actually being executed on your system. So if it needs some library that isn’t automatically imported in the server-side environment, you’ll need to be sure to import it in the function, even if you have already imported it into your current client-side session. And if it isn’t available at all on the server side, you’ll need to contact 1010data and have it added.
Another very important quirk is that the local python process needs to be able to get the source code of your function in order to upload it at all, and sometimes that isn’t possible. Specifically, if you’re using the ordinary Python REPL, the source code isn’t saved, and Python inspection can’t retrieve it. So using functions in this way will be problematic in that environment. It won’t be a problem in Jupyter notebooks, in IPython, or in programs, though.
u_funs#
The simplest kind of user-defined functions for becoming part of queries in 1010data are called u_funs (as distinct from g_funs, which we will discuss later.) These are user-defined functions which take columns (or constants) as their parameters and return a column. To be specific, the function receives and returns the columns involved as numpy arrays of the appropriate underlying type. Because of how numpy array operators generally operate elementwise, it may seem as if the function is operating on individual values, but it’s really working on arrays and returning an array. You should have some familiarity with numpy in order to use server-side python.
It’s also important to understand that the array that the function gets is not (necessarily) the whole column. This gets back to why we’re using 1010data in the first place. A table’s column may literally be billions of rows long, and the server running the program cannot hold it in memory any more than your local machine can. What the function sees and returns is but a single `segment`_ of the column(s) involved. Sometimes the whole column may fit into a single segment, and sometimes you may know that in advance, and you can code with the knowledge that whole columns are available to you. Often, your function only depends on values within a single row, so the elementwise computation of numpy arrays will do what you need and how much of the column you have in memory won’t matter. The concept of segments will be particularly important when we come to discuss g_funs.
Here is an example of a simple u_fun:
In [1]: frame = tf.TenFrame(session, "default.test.solar")
In [2]: @frame.def_ufun
....: def prod(x, y):
....: return x * y
....:
In [3]: frame['voldist'] = prod(frame.vol, frame.dist)
In [4]: frame[['name', 'vol', 'dist', 'voldist']]
Out[4]:
name vol dist voldist
0 Sun 1.414300e+09 0.000000e+00 0.000000e+00
1 Jupiter 1.431280e+06 7.783000e+08 1.113965e+15
2 Saturn 8.271300e+05 1.427000e+09 1.180315e+15
3 Uranus 6.834000e+04 2.871000e+09 1.962041e+14
4 Neptune 6.254000e+04 4.497100e+09 2.812486e+14
5 Earth 1.083000e+03 1.496000e+08 1.620168e+11
6 Venus 9.280000e+02 1.082000e+08 1.004096e+11
7 Mars 1.630000e+02 2.279000e+08 3.714770e+10
8 Ganymede 7.600000e+01 1.070000e+06 8.132000e+07
9 Titan 7.100000e+01 1.221850e+06 8.675135e+07
Note that to define the u_fun, you just decorate the function with @frame.def_ufun
. The function will then be added to the “library” of the frame and become available to be called in that frame. Even though it looks like the function takes a pair of floating-point values and multiplies them, what’s really happening is the function receives two numpy arrays and multiplies them, resulting in another numpy array of the same length. So the function is only called once (this is a single-segment table); all the looping through the elements of the columns is handled by numpy.
Also note that the actual computation did not happen at line 3, just like it doesn’t happen when you type frame['x'] = frame.y + frame.z
. Line 3 just adds an operation to the query that specifies what the value of this new column will be if and when it gets evaluated.
This also means something more significant and subtle. The function that you get when you decorate a definition with @frame.def_ufun
, when you call it on the client side (i.e. when you use it in your ordinary Python code), does not actually do whatever you programmed it to do. What it does is generate a TenSeries that is fit to be placed into its TenFrame, with a value specified by running the function on the server side, where the code you wrote is.
g_funs#
The other kind of user-defined functions you might use are known as “g_funs,” or “group functions.” These are `aggregating functions`_, like those described above and like the builtin 1010data g_funs, and like them require appropriate `segmentation`_ of the underlying table. You define them the same way you define u_funs, above, except you use @frame.def_gfun
as the decorator, and the resulting XML code is also pretty much the same, except that the call to your function is in a <def_gfun>
tag instead of <def_ufun>
. The server requires, however, that g_funs must have names that start with g_
, so make sure you name your function accordingly.
You call user-defined g_funs the same as you `call any other function`__ on a TenSeries, optionally passing it other TenSeries (or use `func`_), etc. Parameters can also be lists of TenSeries. It will definitely be considered a g_fun, with the g_
prefix, so you can use it with .groupby()
or pass the group=
parameter into the function call.
On the server side, the function doesn’t see the group-by information. It just gets a numpy arrays for the columns, each array containing exactly one group. As with any grouping function, the user-defined g_function is expected to return a scalar value for a group.
The use of the select=, however, is not handled by the server machinery before the function sees it, and interpreting it is up to the user-defined function. A global variable called S_ind
is provided in the context of the function on the server side, which is None
if no select=
was provided, and otherwise is an int32 numpy array containing the indices all of the “selected” rows.
lib#
TenFrames have a special .lib
member for managing the contents of their internal libraries.
lib.ufuns(self, sigs=False): Returns the names of all the u_funs defined in this TenFrame. If
sigs=True
, the names also have their arguments listed with them, e.g.,['sin(x)', 'atan(a, b)']
.lib.gfuns(self, sigs=False): Same as
lib.ufuns()
, except for g_funs.lib.delfun(self, name): Removes the named u_fun or g_fun from this TenFrame’s library. This removes both the
<def_ufun>
(or<def_gfun>
) op and the<resource>
op that contains the function’s definition. Note that this has no effect on the actual definition of any functions locally or on the server side, and is only relevant when it comes to saving the library. It is not an error to try to delete a function that was not defined, and in fact thedef_ufun
anddef_gfun
decorators performlib.delfun()
to remove any prior definitions of the function being defined.lib.getfun(self, name): Returns a string containing the source-code for the function named. A
KeyError
is raised if the function cannot be found in this TenFrame’s library.
Note
These methods generally presume that this TenFrame was created in the usual fashion, as a TenFrame. If you load in a query that was not made with tenFrame with the load= parameter in the TenFrame constructor, it may use different conventions and not conform to what tenFrame expects, which could cause these functions to fail or to behave oddly.
Library#
Library
is a subclass of TenFrame
, and is essentially a TenFrame with no ops and no TenSeries. All it has is its <library>
op (which unfortunately may cause confusion due to nomenclature.) Generally, when you add functions into a TenFrame (usually using the @def_ufun
or @def_gfun
decorators, see below), they get inserted into the TenFrame’s <library>
, which is a section of the macrocode that precedes the normal query operators. This way, the functions are all defined right with the query in which they appear, and it is self-contained and not dependent on anything else. And sometimes this is a good thing. But sometimes it’s preferable to separate out the functions and place them in their own place, and have queries that use them import them in. This allows for code-reuse and code-sharing, and updating a common function in only one place, etc. For this purpose, the Library
subclass provides a “query-free” kind of TenFrame, that can be used just to store and load in functions.
The Library
subclass does not have much code; it mostly just inherits from TenFrame
. Much of what it inherits is useless, because TenFrames are mostly about analyzing their columns, and Libraries don’t have columns. There are just a few overrides, which are really just wrappers around the inherited methods, providing subtly different behavior that’s more in line with what would be expected of libraries.
__init__(self, session, path): Initialized just like a normal TenFrame, except the
path
isn’t the name of the table that will form the basis of the query. Instead, it’s the path to a saved library, saved as a QuickQuery in the 1010data object tree. This saved library is loaded in and parsed and its library (only!) retained. Essentially, the Library constructor calls the TenFrame constructor withload=path
and throws away any ops (outside of the<library>
). It is not an error to construct a Library with apath
that does not currently exist; you just get an empty library. Also, unlike the TenFrame constructor, the Library constructor also “runs” the Library immediately, thereby getting the functions defined in the library into the server-side environment. Libraries should run quickly (except for startup time of the server-side Python engine the first time) and generally are run when things change them, since those changes need to be reflected in the server-side environment.def_ufun(), def_gfun(): Wrappers around the TenFrame decorators, with the difference that they run the Library immediately, to define the function on the server side. (You may need to run
session.clearCache()
, particularly if you’re redefining functions.)save(self, path=None, *args, **kwargs): Wraps around TenFrame’s
save()
method, but has three differences:The
path
is optional, and defaults to the path given in this Library’s constructor call.The
materialize
option is silently ignored.the
force
option defaults toTrue
. The assumption is that you might open a Library, add some stuff to it, and then justlibr.save()
to save it back in its place.
Everything else: Since Libraries have no columns, it doesn’t make sense to have them return TenSeries when accessing an attribute (or an item, for that matter). Libraries are best considered as containers for functions, so
library.xyz
is exactly the same asfunc.xyz
(see the `func`_ element.) So islibrary['xyz']
.