Weighted Least Squares Regression

In this example, a weighted least squares regression is applied to a data set containing weighted census data to show the relationship between both the age and education level of a worker and that person's income.

The weighted least squares regression, using the 1010data function g_wlsq(G;S;Y;W;XX), is applied to the Census Income Data Set, which contains weighted census data extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau.

The analysis uses the following 2 variables in that data set as predictors:
  • age
  • edu_year

It also uses the square of the age, which we calculate in this tutorial.

For the weight, we will use the column instance_weight, which represents how each person in the survey relates demographically to the overall population.

As a response, the column wage_per_hour is used.

After applying the weighted least squares technique, the results show the linear relationship between both the age and education level of the worker and that person's wages per hour.

This analysis will follow the following steps:
  • Select only those rows where wages are not equal to zero, since we only want to do the regression for those people who have a job.
  • Check the relationship of each of the predictors to the response and adjust for those that have a quadratic form.
  • Fit the model on the three predictors and the response.
  • Obtain the predicted value of the linear model.
  • Obtain the coefficients of the linear model.
  • Obtain the p-values of the coefficients.
  • Obtain various statistics for the model such as the degrees of freedom, residual sum of squares, mean squared error, and number of observations.
  • Calculate the AIC.
  • Visualize the results of the weighted least squares regression by plotting the age against both the wages per hour and the predicted value of the linear model.
  1. Open the Census Income data set (pub.demo.mleg.uci.censusincome).

  2. Select only those rows where wage_per_hour is not equal to 0, since we only want to do the regression for those people who have a job.
    <sel value="(wage_per_hour<>0)"/>
  3. Since we're using age as a predictor, let's see if it has a linear relationship to wage_per_hour, our response.
    1. For visualization purposes, we can look at the relationship between the average wage per hour and age.

      Let's first calculate the average wage per hour grouping by age. We'll use the function g_avg(G;S;X) and we'll specify the age column for the G parameter (we're grouping by age) and wage_per_hour as the X parameter (since we want to calculate the average wage per hour). We'll omit the S parameter since we want to consider all rows in the table.

      <willbe name="avg_wage_per_hour_age" value="g_avg(age;;wage_per_hour)" 
      format="dec:2"/>
    2. Click Chart > Line.
    3. Drag the age column to the DATA (X-AXIS) area.
    4. Drag the avg_wage_per_hour_age column to the DATA (Y-AXIS) area.
    5. Click Update.

      Since we can see from this chart that the relationship between age and wage_per_hour has a quadratic form, we will include the square of the age as one of the predictors.

  4. Let's create a column containing the square of the value in the age column, which we will use as one of the predictors.
    <willbe name="age_sq" value="age^2"/>
  5. Since we're also using edu_year as a predictor, let's see if it has a linear relationship to wage_per_hour.
    1. For visualization purposes, we can look at the relationship between the average wage per hour and years of education.

      Let's first calculate the average wage per hour grouping by years of education. We'll use the function g_avg(G;S;X) and we'll specify the edu_year column for the G parameter (we're grouping by years of education) and wage_per_hour as the X parameter (since we want to calculate the average wage per hour). We'll omit the S parameter since we want to consider all rows in the table.

      <willbe name="avg_wage_per_hour_edu_year" value="g_avg(edu_year;;wage_per_hour)" format="dec:2"/>
    2. Click Chart > Line.
    3. Drag the edu_year column to the DATA (X-AXIS) area.
    4. Drag the avg_wage_per_hour_edu_year column to the DATA (Y-AXIS) area.
    5. Click Update.

      We can see from this chart that the relationship between edu_year and wage_per_hour has a linear form.

  6. Now, let's show only those columns that we are going to use in the model (as well as those we used in the charts from the previous steps).
    <colord cols="avg_wage_per_hour_age,avg_wage_per_hour_edu_year,
    wage_per_hour,age,edu_year,instance_weight,age_sq"/>

  7. Next, we fit our model on the three predictors as well as the response that we have selected using the g_wlsq(G;S;Y;W;XX) function.
    <willbe name="model_wlsq" value="g_wlsq(;;wage_per_hour;instance_weight;1 age age_sq edu_year)"/>
    Note: As the first element of XX, we specify the special value 1 for the constant (intercept) term in the linear model.

    This creates a column named model_lsq that contains the results of the least squares regression:

    Clicking on the > opens a window containing a summary of the model output:

  8. We can then obtain the predicted value of the linear model using the score(XX;M;Z) function.
    <note>OBTAIN MODEL SCORE WITH SCORE FUNCTION</note>
    <willbe name="pred" value="score(1 age age_sq edu_year;model_wlsq;)" format="dec:7"/>
    Note: We specify format="dec:7" so that our results show with 7 decimal places.

  9. To obtain the coefficients of the linear model, we can use the param(M;P;I) function.

    In the following Macro Language code, we obtain the parameters for the intercept (b0) and the first two variables (b1 and b2).

    <note>OBTAIN MODEL COEFFICIENTS</note>
    <willbe name="b0" value="param(model_wlsq;'b';1)" format="dec:7"/>
    <willbe name="b1"value="param(model_wlsq;'b';2)" format="dec:7"/>
    <willbe name="b2" value="param(model_wlsq;'b';3)" format="dec:7"/>

  10. We can also obtain the p-values of the coefficients using the param(M;P;I) function. We will use these to conduct the variable selection later in the analysis.

    In the following Macro Language code, we obtain the p-value for the intercept (p0) and the first two variables (p1 and p2).

    <note>OBTAIN P-VALUES</note>
    <willbe name="p0" value="param(model_wlsq;'p';1)"/>
    <willbe name="p1" value="param(model_wlsq;'p';2)"/>
    <willbe name="p2" value="param(model_wlsq;'p';3)"/>

  11. However, one might want to obtain all of the coefficients in one column and the p-values in another, which can be achieved with the following Macro Language code:
    <note>CALCULATE ALL COEFFICIENTS IN ONE COLUMN AND P-VALUES IN ANOTHER</note>
    <willbe name="var_names_1" value="'intercept,age,age_sq,edu_year'"/>
    <willbe name="temp_i_1" value="mod(i_(1);4)"/>
    <willbe name="i_1" value="if(temp_i_1=0;4;temp_i_1)"/>
    <willbe name="b_1" value="param(model_wlsq;'b';i_1)" format="dec:7"/>
    <willbe name="p_1" value="param(model_wlsq;'p';i_1)"/>
    <willbe name="var_name_1" value="csl_pick(var_names_1;i_1)"/>

    Note: The number of coefficients and p-values we obtain from the linear model corresponds to the number of variables in our analysis. So, for our example, we obtain 4 coefficients and 4 p-values, which correspond to the intercept plus the 3 predictors.
  12. Using the param(M;P;I) function, we can obtain various statistics for this model such as the degrees of freedom of the model, residual sum of squares, mean squared error, number of observations, average of Y, R2, and adjusted R2.
    <note>OBTAIN VARIOUS MODEL STATISTICS</note>
    <willbe name="dof" value="param(model_wlsq;'df';)"/>
    <willbe name="sse_temp" value="(wage_per_hour-pred)^2"/>
    <willbe name="sum_sq_resids" value="g_sum(;;sse_temp)" format="dec:7"/>
    <willbe name="mean_sq_err" value="sum_sq_resids/dof" format="dec:7"/>
    <willbe name="num_observations" value="param(model_wlsq;'valcnt';)"/>
    <willbe name="avg_y" value="param(model_wlsq;'ybar';)" format="dec:7"/>
    <willbe name="R_squared" value="param(model_wlsq;'r2';)" format="dec:7"/>
    <willbe name="adjusted_R_squared" value="param(model_wlsq;'adjr2';)" format="dec:7"/>

  13. We can calculate the AIC using the following Macro Language code:
    <note>CALCULATE AIC</note>
    <willbe name="sum_log_likelihood" value="loge(sum_sq_resids/num_observations)"/>
    <willbe name="num_of_var" value="3"/>
    <willbe name="AIC" value="2*num_of_var+num_observations*sum_log_likelihood"/>

  14. Finally, we can visualize the results of the weighted least squares regression using the 1010data Chart Builder.
    1. Click Chart > Scatter.
    2. Drag the age column to the DATA (X-AXIS) area.
    3. Drag the wage_per_hour and pred columns to the DATA (Y-AXIS) area.
    4. Click Update.

    The results should look similar to the following chart:

If you want to see how to calculate the standard error of the coefficients or how to chart the residual plot, QQ plot, PP plot, or how to plot the predicted value against the original response, see the Least Squares Regression tutorial.