Frequently Asked "Pointed" Questions (FAQ)

"Isn't shrinkage in regression a dead topic?  I haven't seen any new papers in years!!!"

I don't know of any recent papers critical of shrinkage methods. Technometrics continues to publish important articles on shrinkage in regression, such as Burr and Fry (2005).  Of course, there's also the exciting “Least Angle Regression” work of Efron, Hastie, Johnstone and Tibshirani (2004) in Annals of Statistics, at least when their LAR beta vector ultimately becomes shorter than the vector of least squares estimates.

Frank and Friedman(1993) and Breiman(1995) express great confidence in "cross validation" methods in shrinkage estimation. Although Mallows(1995) observes that minimizing C-sub-p to pick a regressor subset can be misleading in situations that aren't "clear-cut," he apparently still recommends calculating C-sub-p while shrinking along "smooth" paths. See also: Tibshirani(1996), Fu(1997) and LeBlanc and Tibshirani (1998).

"Aren't some of the early shrinkage/ridge methods still considered rather controversial?"

In a word: Yes! But the great, subjective "passions" (both for and against ridge methods) of the 1970's are now muted if not forgotten.  In my opinion, the keys to avoiding controversy are (1) to use statistical inference to decide how much shrinkage of what type to perform and (2) to be rather conservative as stressed by Burr and Fry (2005).  For example, in my computing algorithms, maximum likelihood methods under normal distribution theory are stressed.  See my "Shrinkage" pages for more details on this.

"How can I form confidence intervals for shrinkage estimates?"

A reasonable (and simple!) approach is to simply use classical confidence intervals, centered at least-squares estimates, computed using your favorite statistics package; see Obenchain(1977). In other words, even though point-estimates of effects change as shrinkage is imposed, there really is no basis in "classical" statistical theory for either shifting the location or changing the width of interval-estimates. In fact, a shrunken estimate can look quite different, numerically, from the least-squares solution without being significantly different, statistically. (Obviously, you don't want to shrink so much that your point estimate ends up OUTSIDE your reported interval!)

If you feel you ABSOLUTELY MUST have an interval centered near or at your shrunken estimate, you are going to have to use either bootstrap resampling, Vinod(1995), or Bayesian methods. Highest Posterior Density (HPD) intervals incorporate "added information" from your prior (centered at zero) to that from your sample. This characterizes your shrunken estimates as "unbiased" compromises between prior and sample information.

"How can a so-called OPTIMAL shrinkage estimator be inferior to a so-called GOOD shrinkage estimator?"

"Optimal" shrinkage estimators attempt to minimize a single (scalar valued) measure of overall MSE risk. "Good" shrinkage estimators are simply those that are better than Ordinary-Least-Squares (OLS) ...but they have to dominate OLS in EVERY (matrix valued) MSE sense. So good shrinkage estimators generally do much less shrinkage (are much closer to OLS, numerically) than optimal shrinkage estimators. In fact, a useful guideline is provided by the "2/p-ths rule-of-thumb," Obenchain(1978), where p=Rank(X). Namely, in terms of the MCAL measure of extent-of-shrinkage, the upper-limit on good shrinkage extents is only 2/p-ths of the extent of shrinkage most likely to the MSE optimal. For example, p=6 for the Longley data, and MCAL = 4 along the Q-shape = -1.5 path is most likely to be MSE optimal; thus good shrinkage estimates tend to be limited to MCAL of no more than 2*4/6 = 1.33 ...which is confirmed by the corresponding excess eigenvalue and inferior direction TRACES for the Q-shape = -1.5 path.

"Why not simply use either Stein-like or minimum-estimated-risk rules?

Minimax rules can tend to do so little shrinkage that they are almost indistinguishable, for all practical purposes, from least squares. Minimum estimated risk rules, like those of Mallows(1973), can shrink quite aggressively. This can lead to a big reduction in MSE risk in "favorable" cases, but aggressive shrinkage can also lead to even bigger MSE risk penalties in "unfavorable" cases. Maximum likelihood approaches represent some sort of "middle ground" between these "extremes." They reduce risk by only about 50% even in the most favorable cases ...where the risk could be reduced 100% by shrinkage all of the way to ZERO. But they also tend to increase MSE risk by at most 25% when truly unfavorable cases are encountered (i.e. when shrinkage factors in the .8 to .9 range are MSE optimal.) See Gibbons(1981) and Obenchain(1996) for more on this.