Once I start shrinking, where do I stop???

Yes, this may well be **the** question!!! This is probably
why shrinkage methods in regression have been considered __highly
controversial __for much of the last 40 years. A brief review
of the history of **ridge regression** may reveal some
potential "root causes" for this controversy...

Relatively widespread interest in ridge regression was
initially sparked by Hoerl and Kennard(1970) when they suggested
plotting the elements of their shrinkage estimator of regression
beta-coefficients in a graphical display called the **RIDGE
TRACE.** They observed that the relative magnitudes of the
fitted coefficients tend to __stabilize__ as shrinkage occurs.
And they over-optimistically implied that it is "easy" to
pick an extent of shrinkage, via visual examination of that
coefficient trace, that achieves lower MSE risk than least
squares.

Today, we **know** it just ain't easy. Even the relatively
"tiny" amount of shrinkage that results using
James-Stein-like estimators [Strawderman(1978),
Casella(1981,1985)] when R-squared is large (greater than .8,
say) is already "too much" to allow them to dominate
least squares in any __matrix-valued MSE risk sense__. In
other words, least squares is **admissible** in all __meaningful
(multivariate) senses__ [Brown(1975), Bunke(1975).]

Ok, so Hoerl-Kennard were wrong about **guaranteed** risk
improvements. But, in the eyes of their major critics, this
probably wasn't their **BIG** mistake! No, they were
unabashedly telling regression practitioners to... **LOOK AT
THEIR DATA** (via that trace display) ...before subjectively
"picking" a solution. And all "purists"
certainly consider any tactic like this a real **NO! NO!**

In all fairness, almost **everybody **does this sort of
thing in one way or another. Regression practitioners are
constantly being encouraged to actively explore many different
potential models for their data. Some of these alternatives change
the functional form of the model, drop relatively uninteresting
variables, or set-aside relatively influential observations. But
who tells those practitioners that it can be __misleading__ to
simply report the least squares estimates and confidence
intervals for that __final model__ as if it were the **only
model** they ever even considered?

In other words, shrinkage/ridge methods have served as a
convenient "whipping-boy" for all sorts of statistical
practices that are __questionable__ simply because they are
based mainly upon heuristics.

My implementations of shrinkage/ridge regression
algorithms skirt the above somewhat delicate issues by providing
theoretically sound and **objective** (rather than subjective)
criteria for deciding which path to follow and where to stop
shrinking along that path. My algorithms use a normal-theory, **maximum likelihood**
formulation to quantify the effects of shrinkage. Simulation studies suggest that
my 2-parameter (Q-shape & M-extent) approach can work
very, very well in practice [Gibbons(1981).]

My shrinkage algorithms also display a **wide spectrum** of "trace"
visualizations. For example, they display traces of scaled (relative) MSE risk
estimates for individual coefficients...

traces of "excess eigen-value" estimates (with at most one negative estimate)...

traces of "inferior direction cosine" estimates (for the one negative eigen-value above)...

and even traces of multiplicative shrinkage factors (not shown) as well as the more traditional traces of shrunken coefficients.

Note that the __horizontal axis__ for all of these traces
is the **Multicollinearity Allowance** parameter, 0
<= MCAL <= p. This MCAL can usually be interpreted as the
approximate __rank deficiency__ in the predictor variables
X-matrix. Displays of fitted coefficients in least angle regression (LAR)
use a horizontal axis scaling quite similar to MCAL!

Why should these trace displays be of interest to __YOU__?
Because... **OH!, WHAT A DATA ANALYTIC STORY THEY CAN TELL! **In
Obenchain(1980), I called this the **"see power" of
shrinkage in regression**.

**F**or explanations of (and interpretations for) these
sorts of traces, see Obenchain(1984, 1995); for the underlying
maximum-likelihood estimation theory, see Obenchain(1978).