Fast online detection of outliers using leasttrimmed squares regression with nondominated sorting based initial subsets
 Abstract
 Keywords
 References

Abstract
In this paper, a new algorithm is devised for calculating the Least Trimmed of Squares (LTS) estimator. The algorithm consists of two steps. In the first step, the nondominated sorting algorithm is applied on the design matrix of regression data for selecting a clean subset of observations. In the second step, Csteps are iterated to adjust the LTS estimators. The algorithm is fast and precise for small sample sizes, however, the sorting algorithm is computationally inefficient in large datasets. A fast update mechanism can be used in online data with a linear increase in computation time. Some properties of the sorting algorithm are also investigated under some transformations. Results of applying the algorithm on some wellknown datasets and Monte Carlo simulations show that the proposed algorithm is suitable to use in many cases when the computation time is the major objective and a moderate level of precision is enough.

Keywords
Data streams; Initial subset selection; Nondominated sorting; Outlier detection; Robust regression.

References
[1] J. Agullo. New algorithms for computing the least trimmed squares regression estimator. Computational Statistics & Data Analysis, 36(4):425439, 2001.
[2] A. Atkinson. Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89(428):13291339, 1994.
[3] A. C. Atkinson and T.C. Cheng. Computing least trimmed squares regression with the forward search. Statistics and Computing, 9(4):251263, 1999.
[4] V. Barnett and T. Lewis. Outliers in statistical data. John Wiley & Sons, 286:293, 1978.
[5] C. Beguin and B. Hulliger. The baconeem algorithm for multivariate outlier detection in incomplete survey data. Survey Methodology, 34(1):91, 2008.
[6] N. Billor, S. Chatterjee, and A. S. Hadi. A reweighted least squares method for robust regression estimation. American journal of mathematical and management sciences, 26(34):229252, 2006.
[7] N. Billor, A. S. Hadi, and P. F. Velleman. Bacon: blocked adaptive computationally efficient outlier nominators. Computational Statistics & Data Analysis, 34(3):279298, 2000.
[8] N. Billor and G. Kiral. A comparison of multiple outlier detection methods for regression data. Communications in Statistics Simulation and Computation, 37(3):521545, 2008.
[9] B. Brown. Statistical uses of the spatial median. Journal of the Royal Statistical Society. Series B (Methodological), pages 2530, 1983.
[10] K. A. Brownlee and K. A. Brownlee. Statistical theory and methodology in science and engineering, volume 150. Wiley New York, 1965.
[11] L. Davies. The asymptotics of rousseeuw's minimum volume ellipsoid estimator. The Annals of Statistics, pages 18281843, 1992.
[12] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan. A fast elitist nondominated sorting genetic algorithm for multiobjective optimization: Nsgaii. Lecture notes in computer science, 1917:849858, 2000.
[13] D. Eddelbuettel and R. Francois. Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8):118, 2011.
[14] A. Giloni and M. Padberg. Least trimmed squares regression, least median squares regression, and mathematical programming. Mathematical and Computer Modelling, 35(9):10431060, 2002.
[15] A. S. Hadi and J. S. Simonoff. Procedures for the identification of multiple outliers in linear models. Journal of the American Statistical Association, 88(424):12641272, 1993.
[16] J. Hardin and D. M. Rocke. Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Statistics & Data Analysis, 44(4):625638, 2004.
[17] D. M. Hawkins, D. Bradu, and G. V. Kass. Location of several outliers in multipleregression data using elemental sets. Technometrics, 26(3):197208, 1984.
[18] R. A. Horn. The hadamard product. In Proc. Symp. Appl. Math, volume 40, pages 87169, 1990.
[19] A. M. Leroy and P. J. Rousseeuw. Robust regression and outlier detection. 1987.
[20] R. A. Maronna and V. J. Yohai. Robust estimation of multivariate location and scatter. Encyclopedia of Statistical Sciences, 1998.
[21] R. Nunkesser and O. Morell. An evolutionary algorithm for robust regression. Computational Statistics & Data Analysis, 54(12):32423248, 2010.
[22] D. Pena and V. J. Yohai. The detection of influential subsets in linear regression by using an influence matrix. Journal of the Royal Statistical Society. Series B (Methodological), pages 145156, 1995.
[23] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.
[24] P. J. Rousseeuw and K. Van Driessen. An algorithm for positivebreakdown regression based on concentration steps. In Data Analysis, pages 335346. Springer, 2000.
[25] P. J. Rousseeuw and K. Van Driessen. Computing lts regression for large data sets. Data mining and knowledge discovery, 12(1):2945, 2006.
[26] P. J. Rousseeuw and B. C. Van Zomeren. Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85(411):633639, 1990.
[27] M. H. Satman. A genetic algorithm based modification on the lts algorithm for large data sets. Communications in StatisticsSimulation and Computation, 41(5):644652, 2012.
[28] M. H. Satman. A new algorithm for detecting outliers in linear regression. International Journal of Statistics and Probability, 2(3):p101, 2013.
[29] D. M. Sebert, D. C. Montgomery, and D. A. Rollier. A clustering algorithm for identifying multiple outliers in linear regression. Computational statistics & data analysis, 27(4):461484, 1998.
[30] N. Srinivas and K. Deb. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary computation, 2(3):221248, 1994.
[31] A. J. Stromberg. Computing the exact least median of squares estimate and stability diagnostics in multiple linear regression. SIAM Journal on Scientific Computing, 14(6):12891299, 1993.
[32] K. Tamura and S. Miura. Necessary and sufficient conditions for local and global nondominated solutions in decision problems with multiobjectives. Journal of Optimization Theory and Applications, 28(4):501523,1979.
[33] W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Springer, New York, fourth edition, 2002. ISBN 0387954570.

View 
Download  Article ID: 4439 
