Removing duplicate rows from a table

Sometimes a table can contain multiple copies of the same information. Whether these duplicates were original to the table, or they occurred after combining data sets, you can easily locate and remove all of the duplicate rows from a table.

Difficulty

Objective

You merged two tables together and noticed that there are now some rows that contain the same information. You want to remove all of these duplicate rows from your worksheet.

Solution

<base table="pub.doc.gov.area.x2003"/>
<merge table2="pub.doc.gov.county.x2003"/>
<tabu label="Tabulation on Worksheet" breaks="msa,countyname,rent50_0">
  <tcol source="rent50_0" fun="first" label="First Rent"/>
</tabu>
<link table2="pub.doc.gov.area.x2003" col="msa,countyname,rent50_0" 
col2="msa,countyname,rent50_0" type="select">
  <merge table2="pub.doc.gov.county.x2003"/>
</link>

Alternate solution

<base table="pub.doc.gov.area.x2003"/>
<merge table2="pub.doc.gov.county.x2003"/>
<willbe name="test" value="g_cnt(msa countyname rent50_0;)>1"/>
<sel value="g_first1(msa countyname rent50_0;;)"/>

Discussion

Sometimes tables contain duplicate information. Additionally, when you merge tables, the resulting worksheet can contain duplicate information. This can cause problems when running analyses and could skew your results. This recipe discusses two ways to locate and remove duplicate information from two merged tables.

When finding duplicates, you need to establish the columns that uniquely identify each row. For example, a table containing sales information could have rows that can be uniquely identified by the transaction ID and the item SKU. In this recipe, rows are uniquely identified by the columns msa, countyname, and rent50_0.

The first solution uses a tabulation where the aforementioned columns are used for grouping and the first entry in each group is selected. Then the original merged worksheet is linked in and all of the duplicate rows are no longer present.

Alternatively, you can use g_cnt(G;S) to discover if there are any duplicates present in the table using the unique identifiers and then use g_first1(G;S;O) to select only the first instance. This method is particularly useful if your table is segmented.