Principal Component Regression Modelling with Variational Bayesian Approach to Overcome Multicollinearity at Various Levels of Missing Data Proportion

Authors

  • Nabila Azarin Balqis Departement of Statistics, University of Brawijaya
  • Suci Astutik Departement of Statistics, University of Brawijaya
  • Solimun Solimun Departement of Statistics, University of Brawijaya

DOI:

https://doi.org/10.31764/jtam.v6i4.10223

Keywords:

Missing Data, Multicollinearity, Principal Component Analysis, Principal Component Regression, Variational Bayesian PCA.

Abstract

This study aims to model Principal Component Regression (PCR) using Variational Bayesian Principal Component Analysis (VBPCA) with Ordinary Least Square (OLS) as a method of estimating regression parameters to overcome multicollinearity at various levels of the proportion of missing data. The data used in this study are secondary data and simulation data contaminated with collinearity in the predictor variables with various missing data proportions of 1%, 5%, and 10%. The secondary data used is the Human Depth Index in Java in 2021, complete data without missing values. The results indicate that the multicollinearity in secondary and original data can be optimally overcome as indicated by the smaller standard error value of the regression parameter for the PCR using VBPCA method which is smaller and has a relative efficiency value of less than 1. VBPCA can handle the proportion of missing data to less than 10%. The proportion of missing data causes information from the original variable to decrease, as evidenced by immense MAPE value and the parameter estimation bias that gets bigger. Then the cross validation (Q^2 ) value and the coefficient of determination (adjusted R^2 ) are get smaller as the proportion of missing data increases.

 

References

Agarwal, A., Shah, D., Shen, D., & Song, D. (2021). On Robustness of Principal Component Regression. Journal of the American Statistical Association, 116(536), 1731–1745. https://doi.org/10.1080/01621459.2021.1928513

Ahmad, A. U., Balakrishnan, U. V., & Jha, S. (2021). A Study of Multicollinearity Detection and Rectification under Missing Values. Turkish Journal of Computer and Mathematics Education, 12(1), 399-418. https://doi.org/10.17762/turcomat.v12i1s.1880

Alabi, O. O., Ayinde, K., Babalola, O. E., Bello, H. A., & Okon, E. C. (2020). Effects of Multicollinearity on Type I Error of Some Methods of Detecting Heteroscedasticity in Linear Regression Model. Open Journal of Statistics, 10(04), 664–677. https://doi.org/10.4236/ojs.2020.104041

Alruhaymi, A. Z., & Kim, C. J. (2021). Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics, 11(04), 477–492. https://doi.org/10.4236/ojs.2021.114030

Arumsari, M., Tri, A., & Dani, R. (2021). Peramalan Data Runtun Waktu Menggunakan Model Hybrid Time Series Regression-Autoregressive Integrated Moving Average. In Jurnal Siger Matematika (Vol. 02, Issue 01). http://dx.doi.org/10.23960%2Fjsm.v2i1.2736.

Ayinde, K., Lukman, A. F., Alabi, O. O., & Bello, H. A. (2020). A New Approach of Principal Component Regression Estimator with Applications to Collinear Data. International Journal of Engineering Research and Technology, 13(7), 1616–1622. https://doi.org/10.37624/ijert/13.7.2020.1616-1622

Bennet, D. A. (2001). How Can I Deal with Missing Data in My Study? Aust N Z J Public Health, 25(5), 464-469. DOI: 10.1111/j.1467-842X.2001.tb00294.x

Bishop, C. M. (1999). Variational Principal Components. Ninth International Conference on Artificial Neural Networks, ICANN, IEE, Vol. 1, 509-514. https://doi.org/10.1049/CP:19991160.

Schipper, N. C., & Deun, K. V. (2021). Model Selection Techniques for Sparse Weight-Based Principal Component Analysis. Journal of Chemometrics, 35(2). https://doi.org/10.1002/cem.3289

Diah, S., Larasati, A., Nisa, K., Setiawan, D. E., Soemantri Brojonegoro, J., & Lampung, B. (2020). Analisis Regresi Komponen Utama Robust dengan Metode Minimum Covariance Determinant-Least Trimmed Square (MCD-LTS). Jurnal Siger Matematika, 1(1), 1-9. http://dx.doi.org/10.23960%2Fjsm.v1i1.2472

Estrada, Ma. del R. C., Camarillo, M. E. G., Parraguirre, M. E. S., Castillo, M. E. G., Juárez, E. M., & Gómez, M. J. C. (2020). Evaluation of Several Error Measures Applied to the Sales Forecast System of Chemicals Supply Enterprises. International Journal of Business Administration, 11(4), 39. https://doi.org/10.5430/ijba.v11n4p39

Groenwold, R. H. H., & Dekkers, O. M. (2020). Missing Data: The Impact of What is Not There. European Journal of Endocrinology, 183(4), E7–E9. https://doi.org/10.1530/EJE-20-0732

Jollife, I. T., & Cadima, J. (2016). Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences (Vol. 374, Issue 2065). https://doi.org/10.1098/rsta.2015.0202

Kang, H. (2013). The Prevention and Handling of the Missing Data. Korean Journal of Anesthesiology (Vol. 64, Issue 5, pp. 402–406). https://doi.org/10.4097/kjae.2013.64.5.402

Karch, J. (2020). Improving on Adjusted R-squared. Collabra: Psychology, 6(1). https://doi.org/10.1525/collabra.343

Kim, H., & Jung, H. Y. (2020). Ridge Fuzzy Regression Modelling for Solving Multicollinearity. Mathematics, 8(9). https://doi.org/10.3390/math8091572

Kim, S., & Kim, H. (2016). A New Metric of Absolute Percentage Error for Intermittent Demand Forecasts. International Journal of Forecasting, 32(3), 669–679. https://doi.org/10.1016/j.ijforecast.2015.12.003

Li, W., Jiang, W., Li, Z., Chen, H., Chen, Q., Wang, J., & Zhu, G. (2020). Extracting Common Mode Errors of Regional GNSS Position Time Series in the Presence of Missing Data by Variational Bayesian Principal Component Analysis. Sensors (Switzerland), 20(8). https://doi.org/10.3390/s20082298

Liantoni, F., & Agusti, A. (2020). Forecasting Bitcoin Using Double Exponential Smoothing Method Based on Mean Absolute Percentage Error. International Journal on Informatics Visualization, 4(2). https://doi.org/10.20630/joiv.4.2.335

Little, R. J. A. & Rubin, D. B. (1987). Statistical Analysis with Missing Data. Hoboken: John Wiley and Sons.

Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722

Mahmoudi, M. R., Heydari, M. H., Qasem, S. N., Mosavi, A., & Band, S. S. (2021). Principal Component Analysis to Study the Relations Between the Spread Rates of COVID-19 in High Risks Countries. Alexandria Engineering Journal, 60(1), 457–464. https://doi.org/10.1016/j.aej.2020.09.013

Marcelino, C. G., Leite, G. M. C., Celes, P., & Pedreira, C. E. (2022). Missing Data Analysis in Regression. Applied Artificial Intelligence. https://doi.org/10.1080/08839514.2022.2032925

McDonald, G. C., & Galarneau, D. I. (1975). A Monte Carlo Evaluation of Some Ridge-type Estimators. Journal of the American Statistical Association, 70(350), 407–416. https://doi.org/10.1080/01621459.1975.10479882

Astivia, O. L. O. & Zumbo, B. D. (2019). Heteroskedasticity in Multiple Regression Analysis: What it is, How to Detect it and How to Solve it with Applications in R and SPSS. Practical Assessment, Research, and Evaluation, 24. https://doi.org/10.7275/q5xr-fr95

Pham, H. (2019). A New Criterion for Model Selection. Mathematics, 7(12), 1215. https://doi.org/10.3390/MATH7121215

Rutledge, D. N., Roger, J.-M., & Lesnoff, M. (2021). Different Methods for Determining the Dimensionality of Multivariate Models. Frontiers in Analytical Science, 1. https://doi.org/10.3389/frans.2021.754447

Tsiampalis, T., & Panagiotakos, D. B. (2020). Missing-data Analysis: Socio- demographic, Clinical and Lifestyle Determinants of Low Response Rate on Self-reported Psychological and Nutrition Related Multi-item Instruments in the Context of the ATTICA Epidemiological Study. BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874-020-01038-3

Wulandari, S., Salam, N., & Anggraini, D. (2010). Perbandingan Metode Robust MCD-LMS, MCD-LTS, MVE-LMS, dan MVE-LTS dalam Analisis Regresi Komponen Utama. Jurnal Matematika Murni dan Terapan, 4(1), 57-64.

Yordani, R. (2015). Penerapan Model Inferemsi Bayesian dengan Variational Bayesian Principal Component Analysis (VBPCA) dalam Mengatasi Missing Data Analisis Komponen Utama. Jurnal Aplikasi Statistika & Komputasi Statistik, 7(2), 51-69. https://doi.org/10.34123/jurnalasks.vbi1.12

Ziegel, E. R. (1991). Linear Statistical Models: An Applied Approach. Technometrics, 33(2), 248–248. https://doi.org/10.1080/00401706.1991.10484830

Published

2022-10-07

Issue

Section

Articles