{"id":269761,"date":"2025-07-17T16:30:18","date_gmt":"2025-07-17T16:30:18","guid":{"rendered":"https:\/\/www.europesays.com\/uk\/269761\/"},"modified":"2025-07-17T16:30:18","modified_gmt":"2025-07-17T16:30:18","slug":"prognosis-of-air-quality-index-and-air-pollution-using-machine-learning-techniques","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/uk\/269761\/","title":{"rendered":"Prognosis of air quality index and air pollution using machine learning techniques"},"content":{"rendered":"<p>This research has progressed through three main stages. The first one included preparing and processing air quality parameters. The second stage comprises calculating the AQI. Finally, developing and evaluating ML models. The adopted framework has been presented in Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>.<\/p>\n<p><b id=\"Fig1\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 1<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41598-025-11260-y\/figures\/1\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig1\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/41598_2025_11260_Fig1_HTML.png\" alt=\"figure 1\" loading=\"lazy\" width=\"685\" height=\"201\"\/><\/a><\/p>\n<p>A flowchart illustrating the machine learning approach for AQI prediction.<\/p>\n<p>Data preparation and processing<\/p>\n<p>The air pollution data utilized in this study is accessible online at A Real-time Dataset of Air Pollution Monitoring Generated Using IoT\u2014Mendeley Data<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 22\" title=\"Islam, M. M., Jibon, F. A., Tarek, M. M., Kanchan, M. H. &amp; PerbhezShakil, S. U. A real-time dataset of air quality index monitoring using IoT and machine learning in the perspective of Bangladesh. Data Brief 55, 110578 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#ref-CR22\" id=\"ref-link-section-d4934181e520\" target=\"_blank\" rel=\"noopener\">22<\/a>. This dataset was collected hourly from 1st January 2022 to 31st December 2022 in Gazipur, Bangladesh, using an IoT-based monitoring system. It includes concentration levels of six pollutants: PM2.5, PM10, CO, NO2, SO2, and O3, which were used to compute the Air Quality Index (AQI). The AQI was calculated following the methodology of the U.S. Environmental Protection Agency (EPA), using the linear interpolation formula and the national air quality breakpoints adopted by the Department of Environment (DoE) in Bangladesh (see Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Tab1\" target=\"_blank\" rel=\"noopener\">1<\/a>). For each pollutant, the sub-index \\({I}_{p}\\) was calculated using Eq.\u00a0(<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Equ1\" target=\"_blank\" rel=\"noopener\">1<\/a>)<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"Gao, L., Cai, C. &amp; Hu, X. M. Air quality prediction using machine learning. In Machine learning in chemical safety and health: fundamentals with applications 267&#x2013;288 (2022) &#010;                  https:\/\/doi.org\/10.1002\/9781119817512.CH11&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#ref-CR23\" id=\"ref-link-section-d4934181e552\" target=\"_blank\" rel=\"noopener\">23<\/a>, and the overall AQI was determined as the maximum computed sub-indices for the six pollutants.<\/p>\n<p>$${I}_{p}= \\frac{{I}_{high}-{I}_{low}}{{C}_{high}- {C}_{low}}\\left({C}_{p}- {C}_{low}\\right)+ {I}_{low}$$<\/p>\n<p>\n                    (1)\n                <\/p>\n<p>where:<\/p>\n<p><b id=\"Tab1\" data-test=\"table-caption\">Table 1 AQI Standards by DoE, Bangladesh.<\/b><\/p>\n<p>\\({I}_{p}\\)= The AQI value corresponding to the pollutant p.<\/p>\n<p>\\({C}_{p}\\)= The measured concentration of pollutant p<\/p>\n<p>\\({C}_{low}\\)\u2009=\u2009The threshold of the concentration that is\u2009\u2264\u2009\\({C}_{p}\\)<\/p>\n<p>\\({C}_{high}\\)\u2009=\u2009The threshold of the concentration that is\u2009\u2265\u2009\\({C}_{p}\\)<\/p>\n<p>\\({I}_{low}\\)\u2009=\u2009The index threshold associated with \\({C}_{low}\\)<\/p>\n<p>\\({I}_{high}\\)\u2009=\u2009The index threshold associated with \\({C}_{high}\\)<\/p>\n<p>To ensure data quality, box plotting (Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>) was first applied to identify and remove outliers from the raw concentration values of each pollutant. Each box plot displays the distribution of one pollutant using its actual measurement unit: PM2.5 and PM10 (\u03bcm), CO (mg\/m3), and SO2, NO2, and O3 (g\/m3). Following outlier removal, all variables were normalized to a range between 0 and 1 using the min\u2013max scaling technique, which preserved the original distribution shapes while bringing the features into a comparable scale suitable for machine learning algorithms. The cleaned dataset was split into 80% for training and 20% for testing. To reduce sampling bias and improve generalizability, training, and testing were repeated multiple times, and tenfold cross-validation was conducted to evaluate model stability.<\/p>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41598-025-11260-y\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/41598_2025_11260_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"620\"\/><\/a><\/p>\n<p>Box plotting of input and output parameters: (<b>a<\/b>) sulfur dioxide, (<b>b<\/b>) nitrogen dioxide, (<b>c<\/b>) ozone, (<b>d<\/b>) carbon monoxide, (<b>e<\/b>) particulate matter (D\u2009\u2264\u20092.5 \u00b5m), and (f) particulate matter (2.5 \u00b5m\u2009\u2264\u2009D\u2009\u2264\u200910 \u00b5m).<\/p>\n<p>To identify the most influential input variables for AQI prediction, a Random Forest was employed for feature importance evaluation. This technique effectively captures nonlinear relationships and interactions among variables, enabling a robust and data-driven approach to feature selection. The analysis revealed that PM2.5 had the highest importance score (12.6654), followed by PM10 (1.8387) and CO (1.7082). Although PM2.5 and PM10 exhibited a moderate correlation (r\u2009=\u20090.3014), computed using the Pearson correlation coefficient as defined in Eq.\u00a0(<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Equ2\" target=\"_blank\" rel=\"noopener\">2<\/a>), both were retained due to their distinct and substantial contributions to AQI prediction. In contrast, NO2 (0.7395), SO2 (0.6767), and O3 (0.6499) demonstrated lower importance and were excluded from the final model. A bar chart summarizing these feature importance scores is presented in Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Fig3\" target=\"_blank\" rel=\"noopener\">3<\/a> to enhance clarity, transparency, and reproducibility of the variable selection process in alignment with best practices in machine learning-based environmental modeling.<\/p>\n<p>$$r= \\frac{\\sum_{i=1}^{n}\\left({x}_{i}- \\overline{x }\\right)({y}_{i}- \\overline{y })}{\\sqrt{\\sum_{i=1}^{n}{({x}_{i}- \\overline{x })}^{2}}. \\sqrt{\\sum_{i=1}^{n}{({y}_{i}- \\overline{y })}^{2}}}$$<\/p>\n<p>\n                    (2)\n                <\/p>\n<p>where:<\/p>\n<p><b id=\"Fig3\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 3<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41598-025-11260-y\/figures\/3\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig3\" src=\"https:\/\/www.europesays.com\/uk\/wp-content\/uploads\/2025\/07\/41598_2025_11260_Fig3_HTML.png\" alt=\"figure 3\" loading=\"lazy\" width=\"685\" height=\"393\"\/><\/a><\/p>\n<p>A bar chart for Random Forest variable importance.<\/p>\n<p>\\(r= \\text{Pearson correlation coefficient}\\)<\/p>\n<p>\\({x}_{i}= Individual values of pollutant x and y\\)<\/p>\n<p>\\(\\overline{x } and \\overline{y }= Mean of x and y, respectively\\)<\/p>\n<p>n\u2009=\u2009Number of data points<\/p>\n<p>Developing and evaluating ML models<\/p>\n<p>The Learner Regression App is a graphical interface provided within MATLAB\u2019s Statistics and Machine Learning Toolbox<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"Paluszek, M., Thomas, S. &amp; Ham, E. Practical MATLAB deep learning: A projects-based approach. In Practical MATLAB Deep Learning: A Projects-Based Approach 1&#x2013;329 (2022) &#010;                  https:\/\/doi.org\/10.1007\/978-1-4842-7912-0\/COVER&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#ref-CR24\" id=\"ref-link-section-d4934181e1203\" target=\"_blank\" rel=\"noopener\">24<\/a>. Regression model development and analysis for use in predictive modeling tasks are made straightforward by this tool. The application provides an intuitive interface that facilitates interactive exploration and analysis of data, forecasting model construction, algorithm performance assessment, and prediction. This study utilizes regression techniques, including GPR, ER, SVM, RT, and KAR. Each model was selected based on its theoretical suitability and previous success in environmental prediction tasks. GPR provides probabilistic outputs and robustness to noise; ER enhances generalization by aggregating multiple base learners; SVM is well-suited for high-dimensional data spaces; RT offers interpretability and simplicity; and KAR strengthens the model\u2019s ability to capture complex, nonlinear relationships.<\/p>\n<p>Model training was conducted using standardized input variables (PM2.5, CO, and PM10), and hyperparameters were tuned to optimize performance. To ensure robustness and minimize overfitting, all models were cross-validated using tenfold cross-validation, a method that systematically partitions the data to reduce model bias and variance. Performance evaluation was carried out using established regression metrics, including R2, RMSE, and MAE. The detailed configurations and optimized hyperparameters applied for each model are summarized in Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Tab2\" target=\"_blank\" rel=\"noopener\">2<\/a>.<\/p>\n<p><b id=\"Tab2\" data-test=\"table-caption\">Table 2 Applied Hyperparameters during the training phase.<\/b>Performance evaluation of machine learning models<\/p>\n<p>The models\u2019 evaluation is critical when utilizing ML to predict the AQI, so the learner regression tool provides three main metrics to assess it. These three metrics are Mean absolute error (MAE), Root mean square error (RMSE), and Determination coefficient (R2). The following equations can represent these statistical indicators:<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (a)<\/p>\n<p>MAE.<\/p>\n<\/li>\n<\/ol>\n<p>This condition allows the error\u2019s value to be measured in the forecast dataset while being heedless of directions. MAE reflects the average of absolute deviations between observed and predicted values across test samples. As can be calculated from Eq.\u00a0(<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Equ3\" target=\"_blank\" rel=\"noopener\">3<\/a>):<\/p>\n<p>$$MAE=\\frac{1}{n}\\sum_{i=1}^{n}\\left|{x}_{i}-{y}_{i}\\right|$$<\/p>\n<p>\n                    (3)\n                <\/p>\n<p>where:<\/p>\n<p>\\(n\\) = Data points Number.<\/p>\n<p>\\({x}_{i}\\) = Actual value.<\/p>\n<p>\\({y}_{i}\\) = Predicted value.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (b)<\/p>\n<p>RMSE.<\/p>\n<\/li>\n<\/ol>\n<p>The RMSE is further used to estimate the value of the errors. To accomplish this, one also finds the square root of the latter by taking the mean of the square of the statistical variable in terms of the actual and predicted values as calculated in Eq.\u00a0(<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Equ4\" target=\"_blank\" rel=\"noopener\">4<\/a>):<\/p>\n<p>$$RMSE=\\sqrt{\\frac{1}{n}\\sum_{i=1}^{n}{\\left({x}_{i}-{y}_{i}\\right)}^{2}}$$<\/p>\n<p>\n                    (4)\n                <\/p>\n<p>where:<\/p>\n<p>\\({x}_{i}\\) = actual observation.<\/p>\n<p>\\({y}_{i}\\) = predicted values.<\/p>\n<p>n\u2009=\u2009number of data points.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (c)<\/p>\n<p>R2.<\/p>\n<\/li>\n<\/ol>\n<p>The coefficient of determination represents a metric that assesses the extent to which a model accounts for the variance in observed data relative to its predictions. Specifically, it quantifies the proportion of total variability in actual values that the model\u2019s predictions can explain. Its values range between 0 and 1, where a higher value suggests superior model performance. Conceptually, it is the ratio of variance explained by the model to the total variance observed in the data. An R-squared value nearing 1 indicates that the model\u2019s predictions align with the actual data values. It can be calculated as shown in Eq.\u00a0(<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"equation anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-11260-y#Equ5\" target=\"_blank\" rel=\"noopener\">5<\/a>):<\/p>\n<p>$${R}^{2}=1-\\frac{{\\sum }_{i=1}^{n}{\\left({X}_{i}-{Y}_{i}\\right)}^{2}}{\\sum_{i=1}^{n}{\\left({X}_{i}- \\overline{X }\\right)}^{2}}$$<\/p>\n<p>\n                    (5)\n                <\/p>\n<p>where:<\/p>\n<p>\\({X}_{i}\\) = Actual values.<\/p>\n<p>\\({Y}_{i}\\) = Predicted values.<\/p>\n<p>\\(\\overline{X }\\) = The mean of actual values.<\/p>\n<p>n\u2009=\u2009Data points number.<\/p>\n","protected":false},"excerpt":{"rendered":"This research has progressed through three main stages. The first one included preparing and processing air quality parameters.&hellip;\n","protected":false},"author":2,"featured_media":269762,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3843],"tags":[728,79421,2202,3965,3966,70,16,15],"class_list":{"0":"post-269761","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-environment","8":"tag-environment","9":"tag-environmental-chemistry","10":"tag-environmental-impact","11":"tag-humanities-and-social-sciences","12":"tag-multidisciplinary","13":"tag-science","14":"tag-uk","15":"tag-united-kingdom"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@uk\/114869530584907985","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/269761","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/comments?post=269761"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/posts\/269761\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media\/269762"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/media?parent=269761"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/categories?post=269761"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/uk\/wp-json\/wp\/v2\/tags?post=269761"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}