Appendix A: Missing values
Most databases contain at least some missing data values. The values may be missing due to non availability, non applicability, data processing errors, or even intentional omission. Missing values are also referred to as "N/A" values (for "not available" or "not applicable").
N/A values can also be created as a result of certain analyses. Here are some examples:
- A cross tabulation produces results for every combination of values from the specified grouping columns. If there is no data for certain combinations, those results are N/A. For example, suppose we are analyzing a company's sales. The database shows the details of each transaction, including the product sold, the location in which it was sold, and dollar value of the sale. What we would like to know is the average sale size in dollars for each product in each location, a natural cross tabulation. But suppose that a particular product is not sold in a particular location. The average sale size for that product and location is N/A and shows as blank in the result of the cross tabulation.
- When tables are linked, if a particular row in the current table has no match in the "foreign" table, the row is padded with N/A values. For example, suppose we are looking at employee data and we have two tables that give information about each employee. The first shows the address for each employee and the second shows salary-related information. If we were looking at the address table and linked in the salary table, we would have a table that showed both address information and salary information. But suppose that, for some reason, there is no salary information for a particular employee; the salary for that employee shows as blank (N/A) in the combined table.
- Computed columns can have N/A values. This usually happens when one or more of the referenced columns contain N/As, but certain computations can produce N/A results even if the inputs are not N/A.
Accurate analysis requires understanding the nature and source of N/A values and the appropriate ways of dealing with them. The following sections describe how N/A values are represented in the system and how they are handled in selection and value expressions and tabulations. Some additional techniques for handling such values are discussed as well.
Representing N/A values
The first thing we should say is that missing values aren't really missing. That may sound like a contradiction in terms, but here is what we mean: When you see a blank value, it isn't that there is no value there at all, rather there is a special value that indicates that meaningful data is missing. The special value may display as blank, but in the database there really is a value. That is why we prefer to refer to a missing value as an N/A value.
Special values are commonly used to represent missing or meaningless data, although different people use different values. The number 999, for example, it often used as a numeric N/A value, especially in a column containing whole numbers. Take the table:
Since it is unlikely that anyone will actually be 999 years old, it is safe to use 999 as the N/A value for age. 999 also has the nice property that it is easily recognizable. Similarly, values like 999.99 or 999.999 are often used for decimal numbers.
In the 1010data Insights Platform, we
use our own special values. For example, in columns that contain text, we use
''
(the empty string). In numeric columns, we use one of two other
values, depending on whether the column contains whole numbers or decimal numbers. What
these values are isn't really important; what is important is that they are treated
differently than other values.
- They display as blank.
- They are treated specially in selection and value expressions and tabulations. For
example, if a computed column's value expression is
col1+col2
andcol1
(orcol2
) is N/A on a particular row, the computed column will also be N/A on that row. See the following sections about N/A handling in selection and value expressions and tabulations.
Because the system automatically treats these values specially, we try to use them to
represent N/As whenever we can. When loading data into a table, if the source data contains
other types of N/A values (999, 999.99,
NA, etc.), we usually convert those values into our N/A values. We
do not convert such values unless the technical documentation for the data
specifically says the values are in fact N/A values or if it is patently obvious that is the
case. It is therefore entirely possible that some columns in some tables may contain 999,
for instance. Please be aware of this when doing computations; if col1
has
999s in it, and you compute col1+col2
, the result may not be terribly
meaningful!
Looking for N/A values
Before using a database, it is only prudent to see how many N/A values there are in various columns. There are many ways of doing this.
You can use the Select rows panel to select rows where a particular column is N/A. For example:
Or you can tabulate on the column to get all its unique values and look for N/A in the result.
But perhaps the most convenient method is to use the tabulation summary designed for this very purpose: # invalid [COLUMN]. In a quick summary, this allows you to determine the number of N/As in each of several columns in one operation and, in a tabulation or cross tabulation, to determine the number of N/As for each group. Related summarization methods include # valid [COLUMN], # valid pairs [COLUMN], and # invalid pairs [COLUMN].