\documentclass[10pt,twoside,fleqn]{article}
\usepackage[margin=2cm]{geometry}
\usepackage{graphicx}
\usepackage{multirow}
\usepackage [displaymath, mathlines]{lineno}
\usepackage{titlesec}
\usepackage{color}
\usepackage{hyperref}
\usepackage{amsthm}
\usepackage{amsmath}
\usepackage{booktabs}
\usepackage{url}
\usepackage{amssymb}
\usepackage{epsfig}
\usepackage{enumerate}
\usepackage{graphicx}
\usepackage{fancyvrb}
\usepackage{epigraph}
\newtheorem{theorem}{Theorem}[section]
\newtheorem{lemma}{Lemma}[section]
\newtheorem{corollary}{Corollary}[section]
\newcommand{\Dconv}{\overset{\mathcal{D}}{\longrightarrow}}
\newtheorem{proposition}{Proposition}[section]
\theoremstyle{definition}
\newtheorem{definition}{Definition}[section]
\newtheorem{remark}{Remark}[section]
\numberwithin{table}{section}
\numberwithin{equation}{section} 
%\numberwithin{theorem}{section}
\numberwithin{remark}{section}
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
\newcommand{\argmin}{\operatornamewithlimits{argmin}}
\long\def\symbolfootnote[#1]#2{\begingroup%
\def\thefootnote{\fnsymbol{footnote}}\footnote[#1]{#2}\endgroup} 
\hypersetup{
  colorlinks, linkcolor=blue, urlcolor=blue, citecolor=blue
}
\titlelabel{\thetitle.\quad}
%\linenumbers
\usepackage[font=small,format=plain,labelfont=bf,up]{caption}
\usepackage{amsmath}
\setlength{\mathindent}{0cm}
\renewcommand\linenumberfont{\normalfont\bfseries\small}
\renewcommand{\baselinestretch}{1}

\setcounter{page}{1}
\pagestyle{myheadings}

\thispagestyle{empty}

\markboth{\footnotesize \emph{\emph{International Journal of Advanced Statistics and Probability}}}{\footnotesize \emph{\emph{International Journal of Advanced Statistics and Probability}}}
\date{}
\begin{document}

{\renewcommand{\arraystretch}{0.65}
\begin{table}[ht]
\begin{tabular}{ll}
\multirow{8}{*}{\includegraphics[width=1.4cm]{logo}}&\\&{\scriptsize\emph{\textbf{International Journal of Advanced Statistics and Probability},  2 (x) (2014) xxx-xxx}}\\
 &{\scriptsize\emph{\copyright Science Publishing Corporation}}\\
 &{\scriptsize\emph{www.sciencepubco.com/index.php/IJASP}}\\
 &{\scriptsize\emph{doi: }}\\
 &{\scriptsize\emph {Research paper}}\\
                              &{\scriptsize\emph {\textbf{}}}
\end{tabular}
\end{table}}
%\renewcommand{\arraystretch}{1}
\centerline{}
\centerline{}
\centerline{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\centerline {\huge{\bf Certain Effects of Uncertain Models}}

\centerline{}

%\centerline{\huge{\bf Title second line}}

%\centerline{}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%% My definition
\newcommand{\mvec}[1]{\mbox{\bfseries\itshape #1}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\centerline{\bf {Brian Knaeble$^{1*}$}}

\centerline{}
{\small
\centerline{\emph{$^{1}$University of Wisconsin--Stout }}

%\centerline{\emph{$^{2}$Affiliation of the second author-\textbf{delete if identical with the first author}}}

%\centerline{\emph{$^{3}$Affiliation of the third author-\textbf{delete if identical with the first and second author}}}

\centerline{\emph{*knaebleb@uwstout.edu}}}

\centerline{}
\centerline{}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newtheorem{Theorem}{Theorem}[section]

\newtheorem{Definition}[Theorem]{Definition}

\newtheorem{Corollary}[Theorem]{Corollary}

\newtheorem{Lemma}[Theorem]{Lemma}

\newtheorem{Example}[Theorem]{Example}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent \hspace{-3 pt}{\scriptsize \textbf{ Copyright \copyright 2014 Author. This is an open access article distributed under the \href{http://creativecommons.org/licenses/by/3.0/}{%
Creative Commons Attribution License} Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\smallskip

\noindent
\hrulefill


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent \textbf{Abstract}\\
\centerline{}
Statistical summaries of multiple regression analyses often state conclusions as if model uncertainty is of little concern.  The error due to a mis-specified model, however, can be more significant in practice than the sampling error associated with commonly reported statistics.  The true effect of an explanatory variable may be opposite that indicated by a fitted coefficient of a linear model, even if the model is well fit and the coefficient is deemed statistically significant.  Here we study the sensitivity of the sign of a fitted coefficient to changes in the model structure.  As a consequence of the principle of least squares, we show generally, that a set of covariates with a relatively weak coefficient of determination can not reverse the sign of a relatively strong fitted coefficient of a linear model that has been fit with a regression matrix having orthogonal columns.  A consequence of the theory is a necessary condition for Simpson's paradox.\\
\centerline{}
\noindent {\footnotesize \emph{\textbf{Keywords}}:  \emph{confounding, least squares, model uncertainty, regression, sensitivity analysis}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent\hrulefill
%=============================
\section{Introduction}
%=============================
We start with a simple example that is meant to demonstrate the central problem of this paper.  The particular example has been chosen for illustrative purposes, and it has been built from readily accessible data.  The example highlights a danger of exploratory analysis specifically, and strengthens existing awareness of the difficulties associated with interpretation of observational data generally.  It is a striking example of how different models, each well fit to the same data, can lead to statistically significant yet opposite conclusions.  The example clearly demonstrates the need for more general theory as requested by Chatfield \cite{Chatfield95}.

The ``Swiss'' data set within R contains 47 observations on 6 variables: county-level measurements of fertility, agriculture, education, Catholicism, infant mortality, and examination scores.  A model of fertility in terms of agriculture alone indicates a positive, statistically significant effect of agriculture on fertility.  A model of fertility in terms of agriculture, education, and Catholicism, indicates a negative, statistically significant effect of agriculture on fertility.  The details are provided in Table \ref{table1} and Table \ref{table2}.  
\begin{table}[hb]
\caption{Fertility as a linear function of agriculture.}
\centering
\begin{tabular}{lccccc}
\hline
variable  & $\hat{\beta}_i$ & $\text{SE}(\hat{\beta}_i)$ & $t=\hat{\beta}_i/\text{SE}(\hat{\beta}_i)$ & two-sided p value\\
\hline
\bf{agriculture}	&$0.194$& $0.076$ & $2.532$ & $.015$ \\
\hline
\end{tabular}
\label{table1}
\end{table}
\begin{table}[ht]
\centering
\caption{Fertility as a linear function of agriculture and covariates.}
\begin{tabular}{lcccc}
\hline
variable  & $\hat{\beta}_i$ & $\text{SE}(\hat{\beta}_i)$ & $t=\hat{\beta}_i/\text{SE}(\hat{\beta}_i)$ & two-sided p value\\ 
\hline
\bf{agriculture}	&-0.203& 0.071 & -2.854 & 0.007 \\
education	&-1.072  & 0.156 &      -6.881             & $1.91 \times 10^{-8}$ \\
Catholicism	&0.145& 0.030 & 4.817 & $1.84 \times 10^{-5}$ \\
\hline
\end{tabular}
\label{table2}
\end{table}

The problem is that the two models are incongruous with regards to the effect of agriculture on fertility.  In this instance, any conclusion drawn from the data depends strongly on the choice of model.  It would thus be misleading to summarize just one of the models, as this could foster a false sense of certainty.  Yet scientific articles often summarize results obtained through a single model, without any sensitivity analysis.  For further context see some recent examples \cite{Davis12,Jungert12,Nelson13,Lignell13,Cervellati12}.

By paying attention to issues of model selection, professional communities can guard against publication bias (see \cite{Dickersin90}) and reduce the prevalence of contradictory claims within scientific literature.  For example, Tarino et al have written a paper summarizing the results of a meta study analyzing the effect of saturated fat consumption on cardiovascular health \cite{Tarino10}, and Scarborough et al have responded critically arguing that some of the studies under consideration in the meta analysis adjusted for covariates inappropriately \cite{Scarborough10}.  At the heart of this controversy is a disagreement regarding model structure.  There are also more general concerns about proper analysis of observational data. 

Observational studies do continue to play a significant role in health care \cite{Lu09}, and observational approaches are re-emerging within ecology \cite{Sagarin10}.  Meanwhile, economists continue to make use of observational data \cite{Wooldridge13}, as do social scientists \cite{Morgan07}.  Interestingly, even the ATLAS particle detector observed thirteen petabytes of data in 2010 \cite{Brumfiel11}.  Regarding observational study in general, Rosenbaum states that it is the unmeasured covariates that present the largest difficulties \cite{Rosenbaum05}.   This article aims to introduce mathematical theory that is meant to address some of these difficulties.

\section{Background}

Suppose that a sufficient number of high-dimensional observations have been made, to fit a linear model, and that the set of explanatory variables to be used in the model has yet to be determined.  An estimate for the qualitative nature of the unique effect of $X_i$ on $Y$ is desired, but the dimension is large enough so that it is not computationally feasible to fit every possible model.  Thus, any conclusion reached regarding the effect of $X_i$ on $Y$ must be regarded with some degree of suspicion.

Specifically, suppose that subject matter knowledge has been used to select a linear model, with explanatory variables indexed by $I$.  Denote with $\,_I\hat{\beta}_i$ the $i$th fitted coefficient within this model, obtained using the principle of least squares.  As long as the vector of residuals is nonzero, then it remains possible, through consideration of data associated with additional explanatory variables, indexed by $J$, with left subscripts indicating model structure, that $\mathrm{sign}(\,_{J,I}\hat{\beta}_i)\neq \mathrm{sign}(\,_I\hat{\beta}_i)$.  We call this a reversal.

Relevant to our study of reversals is general theory regarding the least-squares fitting of linear models.  It is possible to mathematically derive a formula that expresses how a single covariate influences an existing model.  When an additional column of data is appended to the regression matrix the vector of fitted coefficients changes by 
\begin{equation}\label{Seb} (X^tX)^{-1}X^t\mathbf{x}_j\frac{\mathbf{x}_j^t(I-X(X^tX)^{-1}X^t)\mathbf{y}}{\mathbf{x}_j^t(I-X(X^tX)^{-1}X^t)\mathbf{x}_j},
\end{equation} where $\mathbf{x}$ is the additional column of data, $\mathbf{y}$ is the vector of response data, and $X$ is the original regression matrix \cite{Seber03}.

Hosman et al have shown, for a single coefficient of interest, how the expression in (\ref{Seb}) decomposes into a ratio of standard errors, a fitted coefficient when $X_j$ is regressed onto the original explanatory variables, and the partial correlation between $X_j$ and $Y$ given the original explanatory variables \cite{Hosman10}.  It remains unclear, however, how to use such theory pragmatically as part of a model selection procedure. 

According to Myers there are some situations where certain covariates should not be adjusted for \cite{Myers11}, although Rubin tends to disagree \cite{Rubin09}).   In a medical context, Kurth states that it is often insufficient to adjust only for a few demographic variables \cite{Kurth07}.  On the other hand, Robins et al point out that adjusting for too many covariates can be problematic \cite{Robins86}.  Pearl suggests that practicioners should use graphs to determine an admissible set of covariates for adjustment \cite{Pearl09}.  It is apparent that there is a lack of consensus on how best to proceed.

In the presence of many unobserved and potentially confounding variables, it is inherently difficult to interpret results.  Chatfield has urged statisticians to ``stop pretending that model uncertainty does not exist and begin to find ways of coping with it \cite{Chatfield95}.''  Our strategy here is based on the intuitive nature of correlation.  The theoretical approach is applicable whenever a coefficient of determination can be estimated, even for unmeasured sets of covariates, where estimates are based on subject matter knowledge.  The mathematical theory then leads to strengthened defense of conclusions drawn from a specific model, even in the presence of substantial model uncertainty. 

\section{Model-Independent Estimation}
We prepare for a theorem that can make possible model-independent estimation of the direction of an effect.
Let $r$ denote Pearson's correlation coefficient, and let $R$ denote the positive square root of the coefficient of determination.  Let $I$ index centered, orthogonal columns of data for a subset of explanatory variables, and let $J$ index disjoint (from $I$), not necessarily centered or orthogonal (to themselves or the vectors indexed by $I$), additional columns of data.  All columns contain the same number of observations.
\begin{theorem}\label{mainTH}
\[\,_JR<|r(\mathbf{x}_i,\mathbf{y})| \implies \mathrm{sign}(\,_{J,I}\hat{\beta}_i)=\mathrm{sign}(\,_{I}\hat{\beta}_i).\]
\end{theorem}

Theorem \ref{mainTH} is reminiscent of a line of reasoning (see (\cite{Cornfield59}) that was used to implicate smoking as a cause of lung cancer in American men \cite{Lin98}.  Fisher had earlier argued essentially that `correlation is not causation', and he maintained that the observed association between smoking and lung cancer could be due to a third factor \cite{Fisher58}.  Cornfield et al then responded with ``the magnitude of the excess lung-cancer risk among cigarette smokers is so great that the results can not be interpreted as arising from an indirect association of cigarette smoking with some other agent or characteristic, since this hypothetical agent would have to be at least as strongly associated with lung cancer as cigarette use; no such agent has been found or suggested \cite{Cornfield59b}.''  

Theorem \ref{mainTH} is formulated to be applicable in much the same way that the argument of Cornfield et al. has been used.  Theorem \ref{mainTH} regards the sensitivity of a fitted coefficient to expansion of a linear model, assuming the principle of least-squares.  The theorem can provide researchers with an argument to use against any claims that their model failed to account for a set of covariates---the covariates can not reverse the observed direction of a unique effect unless they as a whole possess a relatively large coefficient of determination for the response variable.  Thus, in conjunction with subject matter knowledge, Theorem \ref{mainTH} can apply even to sets of unmeasured covariates.  This and the theorem's general formulation within the context of linear modeling distinguish it from other similar results (see \cite{Lin98}, \cite{Giles89}, \cite{McAleer86} or \cite{Rosenbaum83}). 
%It also leads to a necessary condition for Simpson's paradox.

A few precautionary remarks are needed.  First, note that for Theorem \ref{mainTH} to apply, the regression matrix must have orthogonal columns.  The counter example in Table \ref{tab1} shows that any weakening of this assumption leaves open the possibility for a covariate associated with neither the response variable nor the explanatory variable of interest, nonetheless, to induce a reversal.  In this sense, models fit over principal components, in addition to producing estimators with less variance (see \cite{Seber03b}), also lead to more robust interpretations and conclusions.  Second, it is not enough to consider potentially confounding covariates individually.  As shown in Table \ref{tab2}, a set of covariates each correlating arbitrarily weakly with the response data, can together as a whole have an arbitrarily large coefficient of determination, and together they are thus capable of inducing reversals.

With $d$ covariates under consideration, a linear model can be linearly expanded in $2^d$ possible ways: one expansion for each subset of covariates.  If we allow for non linear expansions, say by using higher order combinations of covariates, then the number of possible expansions is greater yet.  It may not be computationally feasible to fit all of these models.  However, we can compute $R^2$ for the largest conceivable set of covariates, and if this value is small enough, then we can conclude that this largest extension can not induce a reversal {\it and} none of the many, smaller, sub extensions can produce a reversal either.  These conclusions follow from Theorem \ref{mainTH} and the observation that deleting explanatory columns of data from the analysis can not increase $R^2$.  This latter claim can be rigorously justified using Definition 4.1, Lemma 4.2, Proposition 4.3 and Proposition 4.5 of this paper.  

The theory of the preceding paragraph is illustrated in Table \ref{tab3}.  The statistics on display were computed from data associated with an ecological study of mortality, biochemistry, diet and lifestyle that was carried out in rural China in the 1980s and early 1990s (see \cite{Chen1990}).  For each of sixty four counties, heart disease rates were obtained along with county-level consumption values for each of ten dietary variables.  These particular variables were selected for their familiarity and (among the dietary variables) their disparity.  The resulting data made for an interesting exploratory analysis.  

Wheat is the dietary variable most strongly correlated with heart disease, and the set of remaining dietary variables as a whole possesses a relatively weak coefficient of determination.  With an awareness of Theorem \ref{mainTH} and a familiarity with $R^2$ we can thus quickly conclude that none of the $2^9=512$ possible linear regression models that utilize wheat as an explanatory variable can contain an estimate for the unique effect of wheat on heart disease that is negative.  We quickly summarize this theoretical conclusion by stating that the data most likely do not indicate a protective effect of wheat consumption on county-level heart disease rates. 

\begin{table}[hb]
\caption{A contrived dataset where $\mathbf{x}_3$ is uncorrelated with both $\mathbf{x}_1$ and $\mathbf{y}$, yet $\mathrm{sign}(\,_{1,2,3}\hat{\beta}_1) \neq \mathrm{sign}(\,_{1,2}\hat{\beta}_1$).}
\label{tab1}
\centering
    \begin{tabular}{cccc}
\toprule
        $\mathbf{y}$ & $\mathbf{x}_1$ & $\mathbf{x}_2$ & $\mathbf{x}_3$\\ 
\midrule
        $\sqrt{2}$	& $\sqrt{2}$ & $1$ & $5+\sqrt{2}$ \\
$-\sqrt{2}$	& $-\sqrt{2}$ & $1$ & $5-\sqrt{2}$ \\
$0$	& $2\sqrt{2}$ & $-5\sqrt{2}-1$ & $\sqrt{2}-5$ \\
$0$	& $-2\sqrt{2}$ & $5\sqrt{2}-1$ & $-\sqrt{2}-5$ \\
\bottomrule
\end{tabular}
\label{epicor}
\end{table}
\begin{table}[hb]
\caption{A contrived data set illustrating how the reversal potential of $\mathbf{x}_2$ and $\mathbf{x}_3$ combined can be greater than expected: $\,_{1}\hat{\beta}_1=0.5=\,_IR$, as $\epsilon \downarrow 0$ both $\,_2R \downarrow 0$ and $\,_3R\downarrow 0$, while $\,_{2,3}R\equiv 0.75>0.5$, and $\,_{1,2,3}\hat{\beta}_1 =-1.0$.  Incidentally, $\,_{1}\hat{\beta}_1=\,_{1,2,3}\hat{\beta}_1=0.5$ when $\epsilon =0$.}
\label{tab2}
\centering
    \begin{tabular}{cccc}
\toprule
        $\mathbf{y}$ & $\mathbf{x}_1$ & $\mathbf{x}_2$ & $\mathbf{x}_3$\\ 
\midrule
        $\sqrt{2}+\sqrt{3}$	& $2\sqrt{2}$ & $\sqrt{2}\sqrt{3}\epsilon+\epsilon$ & $\sqrt{2}\sqrt{3}\epsilon+\epsilon$ \\
$-\sqrt{2}+\sqrt{3}$	& $-2\sqrt{2}$ & $-\sqrt{2}\sqrt{3}\epsilon+\epsilon$ & $-\sqrt{2}\sqrt{3}\epsilon+\epsilon$ \\
$-\sqrt{3}$	& $2\sqrt{2}$ & $2\sqrt{2}-\epsilon$ & $-2\sqrt{2}-\epsilon$ \\
$-\sqrt{3}$	& $-2\sqrt{2}$ & $-2\sqrt{2}-\epsilon$ & $2\sqrt{2}-\epsilon$ \\
 \bottomrule
\end{tabular}
\end{table}
\begin{table}[hb]
\caption{Correlations between dietary variables and county level heart disease rates; with wheat excluded $R^2=0.30$.}
\label{tab3}
\centering
\begin{tabular}{lc}
\toprule
dietary & observed correlation \\%with {\it Heart Disease}\\
variable & with {\it Heart Disease}\\
\midrule
{\it Cholesterol} & -.15 \\
{\it Saturated Fat} & -.18 \\
{\it Fish} & -.21 \\
{\it Nuts} & ~.01 \\
{\it Salt} & ~.00 \\
{\it Spices} & ~.33 \\
{\it Wheat} & ~.64 \\
{\it Beans} & -.33 \\
{\it Fruits} & -.03 \\
{\it Vegetables} & -.13 \\
\bottomrule
\end{tabular}
\end{table}

\section{Simpson's Paradox}
The reversal of a fitted coefficient's sign brings to mind Simpson's paradox.  Wagner has described Simpson's paradox as ``the designation for a surprising situation that may occur when two populations are compared with respect to the incidence of some attribute: if the populations are separated in parallel into a set of descriptive categories, the population with higher overall incidence may yet exhibit a lower incidence within each such category \cite{Wagner82}.''  A mathematical definition is provided in Table \ref{artex}.  See Good and Mittal's article \cite{Good87} for an overview of related concepts and terminology in the literature.

\begin{table}[ht]
\caption{Simpson's paradox occurs when $\frac{\sum a_j }{\sum b_j} > \frac{\sum c_j}{\sum d_j}$, yet 
$\forall j~\frac{a_j}{b_j}<\frac{c_j}{d_j}$.}
\label{artex}
\centering
\begin{tabular}{lcccc}
\toprule
&category 1&category 2&$\cdots$&category s\\
\midrule
population 1&$a_1/b_1$&$a_2/b_2$&$\cdots$&$a_s/b_s$\\
population 2&$c_1/d_1$&$c_2/d_2$&$\cdots$&$c_s/d_s$\\
\bottomrule
\end{tabular}
\end{table}

Julious has described Simpson's paradox in a medical setting \cite{Julious94}, and Bickel et al have spoken of a related phenomenon when analyzing admissions data from the University of California at Berkeley \cite{Bickel75}.  They have written the following: ``Examination of aggregate data on graduate admissions to the University of California, Berkeley, for fall 1973 shows a clear but misleading pattern of bias against female applicants. . . . If the data is properly pooled, taking into account the autonomy of departmental decision making, thus correcting for the tendency of women to apply to graduate departments that are more difficult for applicants of either sex to enter, there is a small but statistically significant bias in favor of women."

The details show that not {\it every} department had a higher acceptance rate for females.  Nonetheless, the authors have chosen to describe the reversal as ``a paradox, sometimes referred to as Simpson's'', likely because after adjusting for a confounding variable, namely `department', an opposite interpretation of the data becomes possible.  To recognize their use of the term, and other similar usage (see \cite{Appleton96}), an alternative, weaker definition of Simpson's paradox should be considered.  The terminology of linear modeling can apply.

Let $Y$ indicate the presence or absence of an attribute, taking the values one or zero.  Let $X_i$ indicate membership within one population or another, taking the values zero or one.  Let the $s$ indicator variables $X_{j_1},X_{j_2},...,X_{j_s}$ together indicate category.  With $I=\{i\}$ and $J=\{j_1,j_2,...,j_s\}$, Simpson's paradox, in its weaker sense, can be said to occur when $\mathrm{sign}(\,_{J,I}\hat{\beta}_i)\neq \mathrm{sign}(\,_{I}\hat{\beta}_i)$.  We say that Simpson's paradox, in its stronger sense, occurs when Wagner's previously stated definition is satisfied. 

\begin{lemma}
\label{sw}
Occurrence of the strong Simpson's paradox implies occurrence of the weak Simpson's paradox.
\end{lemma}
\begin{proof}
Let $\mathbf{y}$, $\mathbf{x}_1$, and $\{\mathbf{x}_j\}_{j=2,3,...k}$ be vectors each taking only the values zero and one, with the latter set associated with a single categorical variable.  For each $j$ in $\{2,3,...,k\}$, let $\hat{\beta}_1(j)$ represent the first, least-squares fitted coefficient when the model is fit over only those observations with $x_j=1$.  Let $\hat{\beta}_1(1)$ represent the first, least-squares fitted coefficient when the model is fit over only those observations where for every $j$, $x_j=0$.  It suffices to show that if for all $j=1,2,3,...,k$, $\hat{\beta}_1(j)>0$, then with $J=\{2,3,...,k\}$ we have $\,_{J,1}\hat{\beta}_1>0.$

Start by considering a length-$n$ vector of real-valued observations, namely $\mathbf{x}$.  Consider the quantity $\sum_{i=1}^n(x_i-z)^2$, as a function of $z$, and note that it is concave up.  Its derivative with respect to $z$ is $-2\sum_{i=1}^n(x_i-z)$, which is equal to zero precisely when $z=\sum_{i=1}^nx_i/n=\bar{x}$.  We state these observations for future reference.  We also purposefully shift the entries of $\mathbf{x}_1$ so that instead of being $0$ or $1$ they are $-.5$ or $.5$.  This does not effect $\,_{J,1}\hat{\beta}_1$.

There are $k$ categories and within each there are two values for $X_1$.  We thus divide the sample of observations of $Y$ into $2k$ sub samples, and compute each mean.  Our assumption that for $j=1,2,3,...,k$, $\hat{\beta}_1(j)>0$ ensures that paired means within a given category are different.  We can thus set each of $\{\,_{J,1}\beta_0,\,_{J,1}\beta_2,\,_{J,1}\beta_3,...,\,_{J,1}\beta_k\}$ within the closed interval bounded by the differing means of its associated category.  Note that the least-squares estimates must come from such a subset of the parameter space.  Observe, given our setup, that for any $\alpha>0$, due to the observations of the opening paragraph, that $\,_{J,1}\beta_1=\alpha$ results in a lower sum of the squares of the residuals than $\,_{J,1}\beta_1=-\alpha$.  We thus rule out the possibility of a negative value for $\,_{J,1}\hat{\beta}_1$.
\end{proof}

Lemma \ref{sw} combined with the contrapositive statement of Theorem \ref{mainTH} (after squaring the inequality) thus results in the following necessary condition for (the strong or weak) Simpson's paradox.  The coefficients of determination refer to the coefficients of determination for the incidence of the attribute of interest.
\begin{corollary}
For Simpson's paradox to occur it is necessary for the set of indicator variables associated with categorization to possess a coefficient of determination that is larger than the coefficient of determination possessed by the variable indicating population.
\end{corollary}

\section{Mathematical Theory}
\label{Defs}
This section develops some mathematics that can be used to prove Theorem \ref{mainTH}.  There is a geometric flavor to the definitions that is best embraced before moving on to the lemmas and propositions.  Attention should be drawn to Proposition \ref{uphill} in particular as it may prove useful during future in depth study of the least-squares fitting procedure.  A solid understanding of this proposition leads to a thorough understanding of the proof of the theorem. 
\subsection{Notation}
\label{notats}
The existence of a general data set as depicted in Table \ref{data} is assumed.   There are $n$, $m$-dimensional observations.  Let $I$ index a subset of $\{1,2,...,m\}$, $J$ index a disjoint subset, and $K$ index a generic subset.  Let $i$ stand for a generic element of $I$, $j$ stand for a generic element of $J$, and $k$ stand for a generic element of $K$.  

Bold symbols indicate observed vectors of data
within $\mathbb{R}^n$.  Also, $\langle \cdot,\cdot \rangle$ is used for the standard inner product, $|\cdot|$ for the associated, Euclidean norm, and $\perp$ to indicate orthogonality. 

With $\mathbf{e}$ denoting a vector of $n$ ones, the vectors $\{\mathbf{e},\mathbf{x_1},\mathbf{x_2},...,\mathbf{x_m}\}$ are assumed to be a linearly independent set.  The span of $\mathbf{e}$, and a subset of vectors indexed by $K$, is a vector subspace denoted with $\,_KV$.  For every $K$, both $\mathbf{y} \not \in \,_KV$ and $\mathbf{y} \not \perp \,_KV$ are assumed. 

In general, $V$ stands for a vector subspace.  Also, left subscripts indicate a subset of explanatory variables, and a post subscript typically indicates a variable of interest.
\begin{table}[hb]
\caption{A sufficiently general data set that illustrates the notation.} 
\label{data}
\centering
    \begin{tabular}{lllll}
\toprule
        $\mathbf{y}$ & $\mathbf{x}_1$ & $\mathbf{x}_2$ & $\hdots$ & $\mathbf{x}_m$\\ %&$\mathbf{x}_{p+1}$&$\mathbf{x}_{p+2}$&$\cdots$&$\mathbf{x}_{p+r}$\\
\midrule
        $y_1$	& $x_{1,1}$	&$x_{2,1}$& $\cdots$ & $x_{m,1}$\\%&$x_{p+1,1}$&$x_{p+2,1}$&$\cdots$&$x_{p+r,1}$\\
        $y_2$	& $x_{1,2}$	&$x_{2,2}$& $\hdots$ & $x_{m,2}$\\%&$x_{p+1,2}$&$x_{p+2,2}$&$\cdots$&$x_{p+r,2}$\\
        $y_3$	& $x_{1,3}$      &$x_{2,3}$& $\hdots$ & $x_{m,3}$\\%$&$x_{p+1,3}$&$x_{p+2,3}$&$\cdots$&$x_{p+r,3}$\\
       $\vdots$	& $\vdots$        &$\vdots$& $\ddots$ & $\vdots$\\%$&$\vdots$&$\vdots$&$\ddots$&$\vdots$\\
        $y_n$ & $x_{1,n}$          &$x_{2,n}$& $\hdots$ & $x_{m,n}$\\%&$x_{p+1,n}$&$x_{p+2,n}$&$\cdots$&$x_{p+r,n}$\\
 \bottomrule
\end{tabular}
\end{table}
\subsection{\bf \large Definitions}
In this subsection $K=\{k_1,k_2,...,k_p\}$.% and $V$ indicates a vector subspace.
\begin{definition}
\label{proj}
Denote the projection of $\mathbf{y}$ onto $V$ with
\[p_{V}(\mathbf{y})=\displaystyle \argmin_{\mathbf{v}\in V}(|\mathbf{y}-\mathbf{v}|).\]
\end{definition}
\begin{definition}
\label{betasdef}
The vector of fitted coefficients, 
$(\,_{K}\hat{\beta}_0,\,_{K}\hat{\beta}_{k_1},\,_{K}\hat{\beta}_{k_2},...,\,_{K}\hat{\beta}_{k_p})$, is the unique solution of
\[p_{\,_KV}(\mathbf{y})=\,_{K}\hat{\beta}_0\mathbf{e}+\,_{K}\hat{\beta}_{k_1}\mathbf{x}_{k_1}+\,_{K}\hat{\beta}_{k_2}\mathbf{x}_{k_2}+...+\,_{K}\hat{\beta}_{k_p}\mathbf{x}_{k_p}.\] 
\end{definition}
\begin{definition}
\label{thefittedmodel}
$\,_Ky$ is the function
\[\,_Ky:\mathbb{R}^p \to \mathbb{R}\]
\[\,_Ky:(\alpha_{k_1},\alpha_{k_2},...,\alpha_{k_p})\mapsto \,_K\hat{\beta}_0+\,_K\hat{\beta}_{k_1}\alpha_{k_1}+\,_K\hat{\beta}_{k_2}\alpha_{k_2}+...+\,_K\hat{\beta}_{k_p}\alpha_{k_p}.\]
\end{definition}
\begin{definition}
\label{fittedvalues}
The $q$th fitted value is
%For any indexing set $I=\{i_1,i_2,...,i_k\}$, denote the fitted value of the $l$th multivariate observation $(x_{i_1,l},x_{i_2,l},...,x_{i_k,l})$ with
\[\,_K\hat{y}_q=\,_Ky(x_{k_1,q},x_{k_2,q},...,x_{k_p,q}).\]
\end{definition}
\begin{definition}
The vector of fitted values is
\[\mathbf{\,_K\hat{y}}=(\,_K\hat{y}_1,\,_K\hat{y}_2,...,\,_K\hat{y}_n).\]
\end{definition}  
\begin{remark}
Within $\mathbb{R}^n$, $\mathbf{\,_K\hat{y}}=p_{\,_KV}(\mathbf{y}).$
\end{remark}
\begin{definition}
\label{R222}
Define $\,_KR$ as the positive square root of the coefficient of determination:
\[\,_KR=+\sqrt{\,_{K}R^2}=+\sqrt{\frac{\sum_{q=1}^n (\,_K\hat{y}_q-\bar{\mathbf{y}})^2}{\sum_{q=1}^n (y_q-\bar{\mathbf{y}})^2}}.\]
\end{definition} 
\begin{definition}
\label{r}
For generic vectors $\mathbf{x}=(x_1,x_2,...,x_n)$ and $\mathbf{y}=(y_1,y_2,...,y_n)$, and with $s$ denoting the sample standard deviation, define the Pearson correlation coefficient $r$ as
\[r(\mathbf{x},\mathbf{y})=\frac{1}{n-1}\sum_{q=1}^n\left(\frac{x_q-\bar{\mathbf{x}}}{s_{\mathbf{x}}}\right)\left(\frac{y_q-\bar{\mathbf{y}}}{s_{\mathbf{y}}}\right).\]
\end{definition}
\subsection{\bf \large Geometry}\label{georesults}
The following lemmas are stated without proof, as they can be surmised to be true or derived from the material in books on mathematical analysis (e.g. Cheney's text, \cite{Cheney01}).  See the appendix of this article for a proof of Proposition \ref{uphill}.
\begin{lemma} \label{firstlem} For any $\mathbf{y}$ and for any $V$\[(\mathbf{y}-p_{V}(\mathbf{y})) \perp V.\]
\end{lemma}
\begin{lemma} \label{pyth} For any $\mathbf{y}$ and for any $V$
\[|p_{V}(\mathbf{y})|^2+|\mathbf{y}-p_{V}(\mathbf{y})|^2=|\mathbf{y}|^2.\]
\end{lemma}
\begin{lemma}\label{orth}
For any vectors $\mathbf{x},\mathbf{y}$
\[\mathbf{x}\perp \mathbf{y} \implies |\mathbf{x}|^2+|\mathbf{y}|^2=|\mathbf{x}+\mathbf{y}|^2.\]
\end{lemma}
\begin{lemma}\label{twoprojs} For $V_1\perp V_2$ and $V=\mathrm{span}\{V_1,V_2\}$
\[p_{V}(\mathbf{y})=p_{V_1}(\mathbf{y})+p_{V_2}(\mathbf{y}).\]
\end{lemma}
\begin{definition}
\label{angles}
For nonzero vectors $\mathbf{y}\in \mathbb{R}^n$ and $\mathbf{v} \in V$, define $\theta(\mathbf{y},\mathbf{v})$, with $0\leq \theta \leq \pi$, via
\[\cos(\theta)=\frac{\langle \mathbf{y},\mathbf{v} \rangle}{|\mathbf{y}| |\mathbf{v}|}.\] 
\end{definition}
\begin{proposition}
\label{uphill}
Let $V$ be a vector subspace of $\mathbb{R}^n$.  For a fixed vector $\mathbf{y}\not \in V$, with $\mathbf{y}\not \perp V$, and for a fixed, nonzero vector $\mathbf{w}\in V$: 
\begin{enumerate}
\renewcommand{\theenumi}{(\roman{enumi})}
\renewcommand{\labelenumi}{\theenumi}
\item \label{part1} If $\mathbf{w}$ is a scalar multiple of $p_V(\mathbf{y})$, then $\theta(\mathbf{y},p_{V}(\mathbf{y})+t\mathbf{w})$ is non decreasing on $\{t:t>0,p_V(\mathbf{y})+t\mathbf{w}\neq 0\}$.
\item \label{part2} If $\mathbf{w}$ is not a scalar multple of $p_V(\mathbf{y})$, then $\theta(\mathbf{y},p_{V}(\mathbf{y})+t\mathbf{w})$ is a strictly increasing function of $t>0$.
\end{enumerate} 
\end{proposition}
\subsection{\bf \large Simplifications}\label{simple}
Proofs of the propositions in this section are left to the reader.
\begin{definition}
\label{center}
A vector of data $\mathbf{x}$ is {\it centered} if $\bar{\mathbf{x}}=0$.
\end{definition}
\begin{definition}
A vector of data $\mathbf{x}$ is {\it geometrically standardized} if $\bar{\mathbf{x}}=0$ and $|\mathbf{x}|=1$.
\end{definition}
\begin{definition}
Given a vector of data $\mathbf{x}$ we use the term {\it standardization} to describe the process
\[\mathbf{x} \mapsto \frac{\mathbf{x}-\bar{\mathbf{x}}\mathbf{e}}{|\mathbf{x}-\bar{\mathbf{x}}\mathbf{e}|}.\]
\end{definition}
\begin{remark}
Standardization results in geometrically standardized data.
\end{remark}
\begin{proposition}
\label{projperp}
Standardization preserves the orthogonality of a set of centered vectors.
\end{proposition}
\begin{proposition}
\label{wlogstand}
For any $K$, standardization preserves the signs of $\{\,_K\hat{\beta}_k\}_{k \in K}$ and the value of $\,_KR$.
\end{proposition}
\begin{proposition}
\label{betanot}
For any $K$, if the data is geometrically standardized, then $\,_K\hat{\beta}_0=0.$
\end{proposition}
\begin{proposition}
\label{uptwo}
For any $K$, if the data is geometrically standardized, then $\,_KR=\cos(\theta(\mathbf{y},p_{\,_KV}(\mathbf{y}))=|p_{\,_KV}(\mathbf{y})|.$
\end{proposition}
\begin{proposition}
\label{rR}
For $k=1,2,...,m$, $\,_kR=|r(\mathbf{x}_k,\mathbf{y})|$.
\end{proposition}
\begin{proposition}
\label{propthree}
For any disjoint, indexing sets $I$ and $J$, and for any $i \in I$, if $I$ indexes orthogonal vectors of data, %\footnote{This is where we use the orthogonality assumption} 
then
$\mathrm{sign}(\,_{J,I}\hat{\beta}_i)=\mathrm{sign}(\,_{J,i}\hat{\beta}_i).$
\end{proposition}

\subsection{\bf \large Proof of Theorem \ref{mainTH}}
By Proposition \ref{wlogstand}, geometrically standardized data can be assumed, and by Proposition \ref{projperp}, orthogonality of the vectors indexed by $I$ is retained.  

Proposition \ref{rR} allows us to state the contrapositive of the implication from Theorem \ref{mainTH} as
\[\mathrm{sign}(\,_{J,I}\hat{\beta}_i)\neq \mathrm{sign}(\,_{I}\hat{\beta}_i) \implies \,_JR>\,_iR.\]
By Proposition \ref{propthree} it suffices to demonstrate
\begin{equation}
\label{contra}
\mathrm{sign}(\,_{J,i}\hat{\beta}_i)\neq \mathrm{sign}(\,_{i}\hat{\beta}_i) \implies \,_JR>\,_iR.
\end{equation}

The hypothesis, $\mathrm{sign}(\,_{J,i}\hat{\beta}_i)\neq \mathrm{sign}(\,_{i}\hat{\beta}_i)$, implies that within $\,_{J,i}V$
\[\,_{J}V \text{~separates~}  p_{\,_{J,i}V}(\mathbf{y}) \text{~from~} p_{\,_iV}(\mathbf{y}).\]
Thus the straight line from $p_{\,_{J,i}V}(\mathbf{y})$ to $p_{\,_iV}(\mathbf{y})$ intersects $\,_{J}V$ at a point $\mathbf{q}$.  

Consider the two-stage path: from $p_{\,_{J}V}(\mathbf{y})$ to $\mathbf{q}$ within $\,_{J}V$, and then from $\mathbf{q}$ to $p_{\,_{i}V}(\mathbf{y})$ within $\,_{J,i}V$, along two straight line segments.  Using Proposition \ref{uphill} we can conclude that
\begin{equation}
\label{best}
\theta(\mathbf{y},p_{\,_JV}(\mathbf{y})) \leq \theta(\mathbf{y},\mathbf{q}) < \theta(\mathbf{y},p_{\,_iV}(\mathbf{y})).
\end{equation}
This conclusion is valid for the following reasons.  We have assumed in Section \ref{notats} that for any $K$, $\mathbf{y} \not \perp \,_KV$, which implies, even for geometrically standardized data, and again for any $K$, that $\,_K\hat{\beta}_i\neq 0$.  Also, if $p_{\,_iV}(\mathbf{y})-p_{\,_{J,i}V}(\mathbf{y})$ is a scalar multiple of $p_{\,_{J,i}}(\mathbf{y})$, then $\mathbf{q}=\mathbf{0}$ and Proposition \ref{betanot} ensures that $p_{\,_{J,i}}(\mathbf{y})$ is a scalar multiple of $\mathbf{x_i}$.  This contradicts either $\,_{J,i}\hat{\beta}_i\neq 0$ or $\,_{J,i}\hat{\beta}_j\neq 0$ for $j\in J$.  Thus we conclude that $p_{\,_iV}(\mathbf{y})-p_{\,_{J,i}V}(\mathbf{y})$ is not a scalar multiple of $p_{\,_{J,i}}(\mathbf{y})$, and we are justified in using part \ref{part2} of Proposition \ref{uphill} along the first segment.  Finally, note that Proposition \ref{uphill} applies along the second segment because the segment lies along a ray emanating from $p_{\,_{J,i}V}(\mathbf{y})$.

To finish this proof we apply the cosine function to (\ref{best}), reversing the ordering, resulting in 
\[ \cos(\theta(\mathbf{y},p_{\,_JV}(\mathbf{y})))\geq \cos(\theta(\mathbf{y},\mathbf{q}))> \cos(\theta(\mathbf{y},p_{\,_iV}(\mathbf{y}))).\] 
Proposition \ref{uptwo} then allows us to substitute $\,_JR$ for $\cos(\theta(\mathbf{y},p_{\,_JV}(\mathbf{y})))$ and $\,_iR$ for $\cos(\theta(\mathbf{y},p_{\,_iV}(\mathbf{y})))$, resulting in
\[\,_JR>\,_iR,\]
which is the desired conclusion from line (\ref{contra}).
\qed

\appendix
\setcounter{lemma}{0}
    \renewcommand{\thelemma}{\Alph{section}\arabic{lemma}}

\section{Appendix: Proof of Proposition \ref{uphill}}
For part \ref{part1}, with $\alpha \neq 0$, it suffices to show that \[\cos(\theta)=\frac{\langle \mathbf{y},p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y})) \rangle}{|\mathbf{y}| |p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y}))|}\] is non increasing on $\{t:t>0,t\neq -1/\alpha\}$.  
For $\alpha>0$, or for $\alpha<0$ and $t<-1/\alpha$, \[\frac{\langle \mathbf{y},p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y})) \rangle}{|\mathbf{y}| |p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y}))|}=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}| |(1+t\alpha)p_V(\mathbf{y})|}=\frac{(1+t\alpha)}{(1+t\alpha)}\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|}=\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|},\] which is constant.  For $\alpha<0$ and $t>-1/\alpha$ then \[\frac{\langle \mathbf{y},p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y})) \rangle}{|\mathbf{y}| |p_V(\mathbf{y})+t (\alpha p_V(\mathbf{y}))|}=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}| |(1+t\alpha)p_V(\mathbf{y})|}=\frac{(1+t\alpha)}{-(1+t\alpha)}\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|}=-\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|},\] which is also constant. 
Furthermore, \[-\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|}\leq\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}| |p_V(\mathbf{y})|}\] because Lemma \ref{pyth} states \[|p_{V}(\mathbf{y})|^2+|\mathbf{y}-p_{V}(\mathbf{y})|^2=|\mathbf{y}|^2,\] which expands to give \[\langle p_V(\mathbf{y}),p_V(\mathbf{y})\rangle+\langle \mathbf{y},\mathbf{y}\rangle-2\langle \mathbf{y},p_V(\mathbf{y})\rangle+\langle p_V(\mathbf{y}),p_V(\mathbf{y}) \rangle = \langle \mathbf{y},\mathbf{y} \rangle,\] which implies \[\langle \mathbf{y},p_V(\mathbf{y})\rangle \geq 0.\]   

For part \ref{part2}, with $\alpha \in \mathbb{R}$, write $\mathbf{w}=\alpha p_V(\mathbf{y})+\mathbf{u}$, where $\mathbf{u}\perp p_V(\mathbf{y})$.
%, and by assumption $\mathbf{u}\neq \mathbf{0}$.  
$\cos(\theta)$ thus becomes \[\frac{\langle \mathbf{y},p_V(\mathbf{y})+t(\alpha p_V(\mathbf{y})+\mathbf{u}) \rangle}{|\mathbf{y}||p_V(\mathbf{y})+t(\alpha p_V(\mathbf{y})+\mathbf{u})|}=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u} \rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle+\langle \mathbf{y},t\mathbf{u}\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}.\]  The $\langle \mathbf{y},t\mathbf{u}\rangle$ term can be dropped since \[\langle \mathbf{y},t\mathbf{u}\rangle=t\langle \mathbf{y},\mathbf{u}\rangle=t\langle p_V(\mathbf{y})+(\mathbf{y}-p_V(\mathbf{y})),\mathbf{u}\rangle=t\langle p_V(\mathbf{y}),\mathbf{u}\rangle+\langle (\mathbf{y}-p_V(\mathbf{y})),\mathbf{u}\rangle=0+0,\] where the final zero is due to Lemma \ref{firstlem}.  Thus, it suffices to show that \[L=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}\] is decreasing for $t>0$.  

First we state and prove a Lemma.
\begin{lemma}
\label{derivative}
For $(1+t \alpha) \neq 0$, $t/(1+t\alpha)$ is a strictly increasing function of $t$.
\end{lemma}
\begin{proof}
$\frac{d}{dt}\frac{t}{1+t\alpha}=\frac{1(1+t\alpha)-\alpha t}{(1+t\alpha)^2}=\frac{1}{(1+t\alpha)^2}>0.$
\renewcommand{\qedsymbol}{}
\end{proof}

Now for $t$ such that $(1+t \alpha)>0$, \[L=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}=\frac{1/(1+t\alpha)}{1/(1+t\alpha)}\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}=\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}||p_V(\mathbf{y})+t\mathbf{u}/(1+t\alpha)|}.\]  Note that $t/(1+t \alpha)$ is positive because $t>0$ and $(1+t \alpha)>0$, and note also that $t/(1+t \alpha)$ is increasing by Lemma \ref{derivative}.  Thus, as a consequence of Lemma \ref{orth}, $|p_V(\mathbf{y})+t\mathbf{u}/(1+t\alpha)|$ is increasing in $t$, which implies that $L$ is decreasing in $t$ as desired.

For $t$ such that $(1+t \alpha)<0$, \[L=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}=\frac{1/(1+t\alpha)}{1/(1+t\alpha)}\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}=\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{-|\mathbf{y}||p_V(\mathbf{y})+t\mathbf{u}/(1+t\alpha)|}.\]  Note that $t/(1+t \alpha)$ is negative because $t>0$ and $(1+t \alpha)<0$, and note also that $t/(1+t \alpha)$ is increasing by Lemma \ref{derivative}.  Thus, as a consequence of Lemma \ref{orth}, $|p_V(\mathbf{y})+t\mathbf{u}/(1+t\alpha)|$ is decreasing in $t$, so that $-|\mathbf{y}||p_V(\mathbf{y})+t\mathbf{u}/(1+t\alpha)|$ is increasing in $t$, which implies that $L$ is decreasing in $t$, again as desired.

For $t$ such that $(1+t \alpha)=0$, note that $\alpha<0$ so that $0<t<-1/\alpha \iff (1+t\alpha)>0$, $t=-1/\alpha \iff (1+t\alpha)=0$, and $t>-1/\alpha \iff (1+t \alpha)<0$.  Note also that since $\mathbf{y}\not \in V$ and $\mathbf{y} \not \perp V$, Lemma \ref{pyth} implies not only $\langle \mathbf{y},p_V(\mathbf{y})\rangle\geq 0$ as derived previously, but also the strict inequality $\langle \mathbf{y},p_V(\mathbf{y})\rangle>0$.  Thus for $\{(t_1,t_2,t_3): 0<t_1<t_2=-1/\alpha<t_3<\infty\}$, \[\frac{\langle \mathbf{y},(1+t_1\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t_1\alpha)p_V(\mathbf{y})+t_1\mathbf{u}|}=\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{|\mathbf{y}||p_V(\mathbf{y})+t_1\mathbf{u}/(1+t_1\alpha)|}>0,~\frac{\langle \mathbf{y},(1+t_2\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t_2\alpha)p_V(\mathbf{y})+t_2\mathbf{u}|}=0,\] and \[\frac{\langle \mathbf{y},(1+t_3\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t_3\alpha)p_V(\mathbf{y})+t_3\mathbf{u}|}=\frac{\langle \mathbf{y},p_V(\mathbf{y})\rangle}{-|\mathbf{y}||p_V(\mathbf{y})+t_3\mathbf{u}/(1+t_3\alpha)|}<0.\]  This shows that \[L=\frac{\langle \mathbf{y},(1+t\alpha)p_V(\mathbf{y})\rangle}{|\mathbf{y}||(1+t\alpha)p_V(\mathbf{y})+t\mathbf{u}|}\] must be decreasing at any positive $t$ satisfying $(1+t\alpha)=0$.
\qed

\begin{thebibliography}{99}
{\small

\bibitem{Chatfield95} C. Chatfield, Model uncertainty, data mining and statistical inference, Journal of the Royal Statistical Society: Series A, 158, part 3, (1995), pp. 419–466.

\bibitem{Davis12} Davis et al, Rice consumption and urinary arsenic concentrations in U.S. children, Environmental Health Perspectives, vol.120, issue 10, (2012), p1418-1424.

\bibitem{Jungert12} Jungert et al, Serum 25-hydroxyvitamin $D_3$ and body composition in an elderly cohort from Germany: a cross-sectional study, Nutrition \& Metabolism, 9,42, (2012), Accessed in 2013 from 
http://www.nutritionandmetabolism.com/content/9/1/42.

\bibitem{Nelson13} Nelson et al, Daily physical activity predicts degree of insulin resistance: a cross-sectional observational study using the 2003--2004 National Health and Nutrition Examination Survey, International Journal of Behavioral Nutrition and Physical Activity, 10, 10, (2013), Accessed in 2013 from http://www.ijbnpa.org/content/10/1/10.

\bibitem{Lignell13} Lignell et al, Prenatal exposure to polychlorinated biphenyls and polybriminated diphenyl ethers may influence birth weight among infants in a Swedish cohort with background exposure: a cross-sectional study, Environmental Health, 12, 44, (2013), Accessed in 2013 from http://www.ehjournal.net/content/12/1/44.

\bibitem{Cervellati12} Cervellati et al, Bone mass density selectively correlates with serum markers of oxidative damage in post-menopausal women, Clinical Chemistry and Laboratory Medicine, volume 51, issue 2, (2012), pages 333-338.

\bibitem{Dickersin90} K. Dickersin, The existence of publication bias and risk factors for its occurence, The Journal of the American Medical Association, (1990), 1385-1389. 

\bibitem{Tarino10}  Tarino et al, Meta-analysis of prospective cohort studies evaluating the association of saturated fat with cardiovascular disease, The American Journal of Clinical Nutrition, 91, 3, (2010), 535-546.

\bibitem{Scarborough10} Scarborough et al, Meta-analysis of effect of saturated fat intake on cardiovascular disease: overadjustment obscures true associations, The American Journal of Clinical Nutrition, vol. 92, no. 2, (2010), 458-459.

\bibitem{Lu09} C.Y. Lu, Observational studies: a review of study designs, challenges and strategies to reduce confounding, The International Journal of Clinical Practice, Blackwell Publishing Ltd., 63, 5, (2009), 691-697.

\bibitem{Sagarin10} R. Sagarin, A. Pauchard, Observational approaches in ecology open new ground in a changing world, Frontiers in Ecology and the Environment, 8, (2010), 379-386.

\bibitem{Wooldridge13}  J. Wooldridge, Introductory Econometrics, A Modern Approach, South-Western Cengage Learning, USA, (2013).

\bibitem{Morgan07}  S.L. Morgan, C Winship, Counterfacutals and Causal Inference: Mathods and Principles for Social Research,   Cambridge University Press, New York USA, (2007).

\bibitem{Brumfiel11}  G. Brumfiel, High-energy physics: down the petabyte highway, Nature, 469, (2011), 282-283.

\bibitem{Rosenbaum05} P.R. Rosenbaum, Observational study, Encyclopedia of Statistics in Behavioral Science, volume 3, (2005), pp. 1451-1462.

\bibitem{Seber03} G. Seber, A. Lee, Linear Regression Analysis, John Wiley \& Sons, Hoboken USA, (2003), Equation (3.32).

\bibitem{Hosman10}  C.A. Hosman, B.B. Hansen, P.W. Holland, The sensitivity of linear regression coefficients' confidence limits to the omission of a confounder, The Annals of Applied Statistics, vol. 4, no. 2, (2010), 849-870, Proposition 2.1.

\bibitem{Myers11} Myers et al, Effects of adjusting for instrumental variables on bias and precision of effect estimates, American Journal of Epidemiology, 174, 11, (2011), 1213-1222.

\bibitem{Rubin09} D. Rubin, Author's reply: Should observational studies be designed to allow lack of balance in covariate distributions across treatment groups?, Statistics in Medicine, 28, 9, (2009), 1420-123.

\bibitem{Kurth07}D. Kurth, J. Sonis, Assessment and control of confounding in trauma research, Journal of Traumatic Stress, vol. 20, no. 5, (2007), pp. 807–820.

\bibitem{Robins86}  J.M. Robins, S. Greenland, The role of model selection in causal inference from nonexperimental data, American Journal of Epidemiology, vol. 123, no. 3, (1986).

\bibitem{Pearl09} J. Pearl, Causal inference in statistics: an overview, Statistical Surveys, (2009), 96-146.

\bibitem{Cornfield59} Cornfield et al, Smoking and lung cancer: recent evidence and a discussion of some questions, Journal of the National Cancer Institute, 22, (1959), 173-203, Appendix A.

\bibitem{Lin98} D.Y Lin, B.M. Psaty, R.A. Kronmal, Assessing the sensitivity of regression results to unmeasured confounders in observational studies, Biometrics, 54, (1998), 948-963.

\bibitem{Fisher58} R.A. Fisher, Cigarettes, cancer and statistics, Centennial Rev Arts and Sciences, Michigan State University, 2, 151, (1958). 

\bibitem{Cornfield59b} Cornfield et al, Smoking and lung cancer: recent evidence and a discussion of some questions, Journal of the National Cancer Institute, 22, (1959), 173-203.

\bibitem{Giles89} D. Giles, Coefficient sign changes when restricting regression models under instrumental variables estimation, Oxford Bulletin of Economics and Statistics, 51, (1989), 465-467.

\bibitem{McAleer86}  McAleer et al, A further result on the sign of restricted least-squares estimates, Journal of Econometrics, 32, (1986), 287-290.

\bibitem{Rosenbaum83} P.R. Rosenbaum, D.B. Rubin, Assessing sensitivity to an unobserved binary covariate in an observational study with binary outcome, Journal of the Royal Statistical Society, Series B, 11, (1983), 212-218.

\bibitem{Seber03b} G. Seber, A. Lee, Linear Regression Analysis, John Wiley \& Sons, Hoboken USA, (2003), Section 3.6.

\bibitem{Chen1990} Chen et al; Geographic study of mortality, biochemistry, diet and lifestyle in rural China; Epidemiological Studies Unit, Oxford; http://www.ctsu.ox.ac.uk/~china/monograph/; Revised (1990); Accessed 2009.

\bibitem{Wagner82}  C.H. Wagner, Simpson's paradox in real life, The American Statistician, 36, 1, (1982), 46–48.

\bibitem{Good87} I.J. Good, Y. Mittal, The amalgamation and geometry of two-by-two contingency tables, The Annals of Statistics, vol. 15, no. 2, (1987), pp. 694-711.

\bibitem{Julious94} S.A. Julious, M.A. Mullee, Confounding and Simpson's paradox. British Medical Journal, 309, 6967, (1994), 1480–1481.

\bibitem{Bickel75} P.J. Bickel, E.A. Hammel, J.W. O'Connell, Sex bias in graduate admissions: data from Berkeley, Science, 187, 4175, (1975), 398–404.

\bibitem{Appleton96} D.R. Appleton, J.M. French, M. Vanderpump, Ignoring a covariate: an example of Simpson's paradox, The American Statistician, volume 50, issue 4, (1996), 340-341.

\bibitem{Cheney01} W. Cheney, Analysis for Applied Mathematics, Springer, New York USA, (2001).

%\bibitem{McNamee04} R. McNamee, Regression modeling and other methods to control confounding, Occupational \& Environmental Medicine, 62, (2004), 500-506, doi:10.1136/oem.2002.001115.

}

\end{thebibliography}

\end{document}
?
?