Principal Component Regression Modelling with Variational Bayesian Approach to Overcome Multicollinearity at Various Levels of Missing Data Proportion
DOI:
https://doi.org/10.31764/jtam.v6i4.10223Keywords:
Missing Data, Multicollinearity, Principal Component Analysis, Principal Component Regression, Variational Bayesian PCA.Abstract
This study aims to model Principal Component Regression (PCR) using Variational Bayesian Principal Component Analysis (VBPCA) with Ordinary Least Square (OLS) as a method of estimating regression parameters to overcome multicollinearity at various levels of the proportion of missing data. The data used in this study are secondary data and simulation data contaminated with collinearity in the predictor variables with various missing data proportions of 1%, 5%, and 10%. The secondary data used is the Human Depth Index in Java in 2021, complete data without missing values. The results indicate that the multicollinearity in secondary and original data can be optimally overcome as indicated by the smaller standard error value of the regression parameter for the PCR using VBPCA method which is smaller and has a relative efficiency value of less than 1. VBPCA can handle the proportion of missing data to less than 10%. The proportion of missing data causes information from the original variable to decrease, as evidenced by immense MAPE value and the parameter estimation bias that gets bigger. Then the cross validation (Q^2 ) value and the coefficient of determination (adjusted R^2 ) are get smaller as the proportion of missing data increases.
Â
References
Agarwal, A., Shah, D., Shen, D., & Song, D. (2021). On Robustness of Principal Component Regression. Journal of the American Statistical Association, 116(536), 1731–1745. https://doi.org/10.1080/01621459.2021.1928513
Ahmad, A. U., Balakrishnan, U. V., & Jha, S. (2021). A Study of Multicollinearity Detection and Rectification under Missing Values. Turkish Journal of Computer and Mathematics Education, 12(1), 399-418. https://doi.org/10.17762/turcomat.v12i1s.1880
Alabi, O. O., Ayinde, K., Babalola, O. E., Bello, H. A., & Okon, E. C. (2020). Effects of Multicollinearity on Type I Error of Some Methods of Detecting Heteroscedasticity in Linear Regression Model. Open Journal of Statistics, 10(04), 664–677. https://doi.org/10.4236/ojs.2020.104041
Alruhaymi, A. Z., & Kim, C. J. (2021). Study on the Missing Data Mechanisms and Imputation Methods. Open Journal of Statistics, 11(04), 477–492. https://doi.org/10.4236/ojs.2021.114030
Arumsari, M., Tri, A., & Dani, R. (2021). Peramalan Data Runtun Waktu Menggunakan Model Hybrid Time Series Regression-Autoregressive Integrated Moving Average. In Jurnal Siger Matematika (Vol. 02, Issue 01). http://dx.doi.org/10.23960%2Fjsm.v2i1.2736.
Ayinde, K., Lukman, A. F., Alabi, O. O., & Bello, H. A. (2020). A New Approach of Principal Component Regression Estimator with Applications to Collinear Data. International Journal of Engineering Research and Technology, 13(7), 1616–1622. https://doi.org/10.37624/ijert/13.7.2020.1616-1622
Bennet, D. A. (2001). How Can I Deal with Missing Data in My Study? Aust N Z J Public Health, 25(5), 464-469. DOI: 10.1111/j.1467-842X.2001.tb00294.x
Bishop, C. M. (1999). Variational Principal Components. Ninth International Conference on Artificial Neural Networks, ICANN, IEE, Vol. 1, 509-514. https://doi.org/10.1049/CP:19991160.
Schipper, N. C., & Deun, K. V. (2021). Model Selection Techniques for Sparse Weight-Based Principal Component Analysis. Journal of Chemometrics, 35(2). https://doi.org/10.1002/cem.3289
Diah, S., Larasati, A., Nisa, K., Setiawan, D. E., Soemantri Brojonegoro, J., & Lampung, B. (2020). Analisis Regresi Komponen Utama Robust dengan Metode Minimum Covariance Determinant-Least Trimmed Square (MCD-LTS). Jurnal Siger Matematika, 1(1), 1-9. http://dx.doi.org/10.23960%2Fjsm.v1i1.2472
Estrada, Ma. del R. C., Camarillo, M. E. G., Parraguirre, M. E. S., Castillo, M. E. G., Juárez, E. M., & Gómez, M. J. C. (2020). Evaluation of Several Error Measures Applied to the Sales Forecast System of Chemicals Supply Enterprises. International Journal of Business Administration, 11(4), 39. https://doi.org/10.5430/ijba.v11n4p39
Groenwold, R. H. H., & Dekkers, O. M. (2020). Missing Data: The Impact of What is Not There. European Journal of Endocrinology, 183(4), E7–E9. https://doi.org/10.1530/EJE-20-0732
Jollife, I. T., & Cadima, J. (2016). Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences (Vol. 374, Issue 2065). https://doi.org/10.1098/rsta.2015.0202
Kang, H. (2013). The Prevention and Handling of the Missing Data. Korean Journal of Anesthesiology (Vol. 64, Issue 5, pp. 402–406). https://doi.org/10.4097/kjae.2013.64.5.402
Karch, J. (2020). Improving on Adjusted R-squared. Collabra: Psychology, 6(1). https://doi.org/10.1525/collabra.343
Kim, H., & Jung, H. Y. (2020). Ridge Fuzzy Regression Modelling for Solving Multicollinearity. Mathematics, 8(9). https://doi.org/10.3390/math8091572
Kim, S., & Kim, H. (2016). A New Metric of Absolute Percentage Error for Intermittent Demand Forecasts. International Journal of Forecasting, 32(3), 669–679. https://doi.org/10.1016/j.ijforecast.2015.12.003
Li, W., Jiang, W., Li, Z., Chen, H., Chen, Q., Wang, J., & Zhu, G. (2020). Extracting Common Mode Errors of Regional GNSS Position Time Series in the Presence of Missing Data by Variational Bayesian Principal Component Analysis. Sensors (Switzerland), 20(8). https://doi.org/10.3390/s20082298
Liantoni, F., & Agusti, A. (2020). Forecasting Bitcoin Using Double Exponential Smoothing Method Based on Mean Absolute Percentage Error. International Journal on Informatics Visualization, 4(2). https://doi.org/10.20630/joiv.4.2.335
Little, R. J. A. & Rubin, D. B. (1987). Statistical Analysis with Missing Data. Hoboken: John Wiley and Sons.
Little, R. J. A. (1988). A Test of Missing Completely at Random for Multivariate Data with Missing Values. Journal of the American Statistical Association, 83(404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Mahmoudi, M. R., Heydari, M. H., Qasem, S. N., Mosavi, A., & Band, S. S. (2021). Principal Component Analysis to Study the Relations Between the Spread Rates of COVID-19 in High Risks Countries. Alexandria Engineering Journal, 60(1), 457–464. https://doi.org/10.1016/j.aej.2020.09.013
Marcelino, C. G., Leite, G. M. C., Celes, P., & Pedreira, C. E. (2022). Missing Data Analysis in Regression. Applied Artificial Intelligence. https://doi.org/10.1080/08839514.2022.2032925
McDonald, G. C., & Galarneau, D. I. (1975). A Monte Carlo Evaluation of Some Ridge-type Estimators. Journal of the American Statistical Association, 70(350), 407–416. https://doi.org/10.1080/01621459.1975.10479882
Astivia, O. L. O. & Zumbo, B. D. (2019). Heteroskedasticity in Multiple Regression Analysis: What it is, How to Detect it and How to Solve it with Applications in R and SPSS. Practical Assessment, Research, and Evaluation, 24. https://doi.org/10.7275/q5xr-fr95
Pham, H. (2019). A New Criterion for Model Selection. Mathematics, 7(12), 1215. https://doi.org/10.3390/MATH7121215
Rutledge, D. N., Roger, J.-M., & Lesnoff, M. (2021). Different Methods for Determining the Dimensionality of Multivariate Models. Frontiers in Analytical Science, 1. https://doi.org/10.3389/frans.2021.754447
Tsiampalis, T., & Panagiotakos, D. B. (2020). Missing-data Analysis: Socio- demographic, Clinical and Lifestyle Determinants of Low Response Rate on Self-reported Psychological and Nutrition Related Multi-item Instruments in the Context of the ATTICA Epidemiological Study. BMC Medical Research Methodology, 20(1). https://doi.org/10.1186/s12874-020-01038-3
Wulandari, S., Salam, N., & Anggraini, D. (2010). Perbandingan Metode Robust MCD-LMS, MCD-LTS, MVE-LMS, dan MVE-LTS dalam Analisis Regresi Komponen Utama. Jurnal Matematika Murni dan Terapan, 4(1), 57-64.
Yordani, R. (2015). Penerapan Model Inferemsi Bayesian dengan Variational Bayesian Principal Component Analysis (VBPCA) dalam Mengatasi Missing Data Analisis Komponen Utama. Jurnal Aplikasi Statistika & Komputasi Statistik, 7(2), 51-69. https://doi.org/10.34123/jurnalasks.vbi1.12
Ziegel, E. R. (1991). Linear Statistical Models: An Applied Approach. Technometrics, 33(2), 248–248. https://doi.org/10.1080/00401706.1991.10484830
Downloads
Published
Issue
Section
License
Authors who publish articles in JTAM (Jurnal Teori dan Aplikasi Matematika) agree to the following terms:
- Authors retain copyright of the article and grant the journal right of first publication with the work simultaneously licensed under a CC-BY-SA or The Creative Commons Attribution–ShareAlike License.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).