{"id":140497,"date":"2025-08-12T19:54:13","date_gmt":"2025-08-12T19:54:13","guid":{"rendered":"https:\/\/www.europesays.com\/us\/140497\/"},"modified":"2025-08-12T19:54:13","modified_gmt":"2025-08-12T19:54:13","slug":"ai-big-model-and-text-mining-driven-framework-for-urban-greening-policy-analysis","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/us\/140497\/","title":{"rendered":"AI big model and text mining-driven framework for urban greening policy analysis"},"content":{"rendered":"<p>Framework for intelligent analysis of greening policy texts based on text mining and AI big models<\/p>\n<p>In this study, we developed an intelligent analysis framework for greening policy texts based on text mining and AI big models, aiming to improve the rationality and practicality of policy analysis. The framework consists of the following seven main components (Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#Fig1\" target=\"_blank\" rel=\"noopener\">1<\/a>). 1) Automated timed data collection and preprocessing, which allows automatic collection of greening policy texts from government gazettes and related agencies to ensure real-time data update and consistency for subsequent analysis. 2) Policy keyword extraction, which employs NLP techniques to extract keywords and phrases from the texts to reveal the core content of the policy. 3) Policy topic categorization, which allows automatic identification and classification of the main topics in policy documents with topic modeling techniques to quickly understand the policy focus. 4) Extraction of greening core indicator: which means identification and extraction of key greening planning indicators from the policy documents, such as green space area and greenway construction, and is a key to assessing the effectiveness of policy implementation. 5) Policy AI interpretation, which uses AI big models to deeply interpret the policy text, analyze its main goals, and predict the possible outcomes of policy implementation. 6) Real-time policy tracking, which can collect dynamic data related to greening policies to provide real-time feedback to policy makers. 7) Visualization of the intelligent analysis results, which can display all analysis results through user-friendly interfaces, including charts, timelines, and maps, to facilitate the intuitive understanding of policy makers and researchers on the policy trends and specific objectives. Overall, the framework integrates seven functions to create a multi-level system for analyzing urban greening policies across macro, meso, and micro dimensions, addressing the time-consuming nature and interpretive biases of traditional methods in processing large-scale policy texts. These functions are interconnected through multidimensional analysis logic: (1) automated data collection and preprocessing establish the data foundation; (2) keyword extraction and thematic categorization identify macro trends and meso-level priorities of greening policies; (3) core indicator extraction and AI-driven interpretation assess micro-level indicators and policy outcomes; and (4) visualization integrates multi-level results into intuitive decision-making insights. This multi-layered design ensures comprehensive and systematic analysis while addressing specific policy needs, thereby enhancing the scientific rigor, timeliness, and practicality of policy formulation.<\/p>\n<p><b id=\"Fig1\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 1<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41598-025-05842-z\/figures\/1\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig1\" src=\"https:\/\/www.europesays.com\/us\/wp-content\/uploads\/2025\/08\/41598_2025_5842_Fig1_HTML.png\" alt=\"figure 1\" loading=\"lazy\" width=\"685\" height=\"448\"\/><\/a><\/p>\n<p>Framework for intelligent analysis of greening policy texts based on AI big models and text mining.<\/p>\n<p>Study area<\/p>\n<p>Wuhan was chosen as the study area for this study due to its unique urban characteristics and environmental challenges. As an important industrial and commercial center in central China, Wuhan is characterized by abundant water resources and vast green spaces, but it is also faced with great environmental pressures due to rapid urbanization<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 37\" title=\"Lyu, F. &amp; Zhang, L. Using multi-source big data to understand the factors affecting urban park use in Wuhan. Urban For. Urban Green. 43, 126367 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR37\" id=\"ref-link-section-d27007064e728\" target=\"_blank\" rel=\"noopener\">37<\/a>. In recent years, urban heat island effect and air pollution have become increasingly serious<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 38\" title=\"Chen, H., Deng, Q., Zhou, Z., Ren, Z. &amp; Shan, X. Influence of land cover change on spatio-temporal distribution of urban heat island &#x2014;a case in Wuhan main urban area. Sustain. Cities Soc. 79, 103715 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR38\" id=\"ref-link-section-d27007064e732\" target=\"_blank\" rel=\"noopener\">38<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 39\" title=\"Bi, S., Dai, F., Chen, M. &amp; Xu, S. A new framework for analysis of the morphological spatial patterns of urban green space to reduce PM2.5 pollution: A case study in Wuhan, China. Sustain. Cities Soc. 82, 103900 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR39\" id=\"ref-link-section-d27007064e735\" target=\"_blank\" rel=\"noopener\">39<\/a>. Official data indicate that Wuhan\u2019s average annual temperature increased significantly from 1951 to 2018, with a warming rate of 0.30\u00a0\u00b0C per decade<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 40\" title=\"Hubei Provincial Department of Ecology and Environment. Notice on Printing and Issuing the Action Plan for Adapting to Climate Change in Hubei Province (2023&#x2013;2035). (2023).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR40\" id=\"ref-link-section-d27007064e739\" target=\"_blank\" rel=\"noopener\">40<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 41\" title=\"Zheng, X. et al. Temporal characteristics of extreme high temperatures in Wuhan since 1881. Clim. Res. 92, 1&#x2013;20 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR41\" id=\"ref-link-section-d27007064e742\" target=\"_blank\" rel=\"noopener\">41<\/a>. The average annual PM2.5 concentration in 2023 was 38\u00a0\u03bcg\/m3, exceeding the national ambient air quality secondary standard by 9% and rising 8.6% from 2022<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 42\" title=\"Wuhan Municipal Bureau of Ecology and Environment. Bulletin on the Ecological Environment of Wuhan City in 2023. (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR42\" id=\"ref-link-section-d27007064e748\" target=\"_blank\" rel=\"noopener\">42<\/a>, highlighting the urgent need for effective greening policies. In this context, Wuhan government has implemented several greening and ecological policies<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wuhan Municipal People&#x2019;s Government. Notice of the Municipal People&#x2019;s Government on Issuing the Ecological Compensation Measures for Wuhan Wetland Nature Reserve. &#10;                  https:\/\/www.wuhan.gov.cn\/zwgk\/xxgk\/zfwj\/gfxwj\/202201\/t20220111_1893695.shtml&#10;                  &#10;                 (2021).\" href=\"#ref-CR43\" id=\"ref-link-section-d27007064e753\">43<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wuhan Municipal People&#x2019;s Government. Notice of the Municipal People&#x2019;s Government on Issuing the 14th Five Year Plan for the Protection and Development of Wuhan East Lake Ecological Tourism Scenic Area. &#10;                  https:\/\/www.wuhan.gov.cn\/zwgk\/xxgk\/zfwj\/szfwj\/202206\/t20220621_1990810.shtml&#10;                  &#10;                 (2022).\" href=\"#ref-CR44\" id=\"ref-link-section-d27007064e753_1\">44<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" title=\"Wuhan Municipal People&#x2019;s Government. Notice of the Municipal People&#x2019;s Government on the Issuance of the Work Program for the Creation of a National Eco-Garden City in Wuhan (2022&#x2013;2023). &#10;                  https:\/\/www.wuhan.gov.cn\/zwgk\/xxgk\/zfwj\/szfwj\/202210\/t20221021_2064439.shtml&#10;                  &#10;                 (2022).\" href=\"#ref-CR45\" id=\"ref-link-section-d27007064e753_2\">45<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 46\" title=\"Wuhan Municipal Landscape Gardens and Forestry Bureau. Notice of the Municipal Bureau of Landscape Architecture and Forestry on Issuing the Key Points for Landscape Architecture and Forestry Work in the City in 2024. &#010;                  https:\/\/ylj.wuhan.gov.cn\/zwgk\/zcwj\/qtwj\/202404\/t20240408_2385743.shtml&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR46\" id=\"ref-link-section-d27007064e756\" target=\"_blank\" rel=\"noopener\">46<\/a> to increase the green space area, improve the ecological environment, and enhance the sustainable development of the city. These policies make Wuhan an ideal city to study the impact and effectiveness of urban greening policies and provide a good basis for this study. Therefore, this study explores the rationality, design effectiveness, and core content of these policies, to facilitate future informed urban greening policymaking.<\/p>\n<p>Automatic data collection and pre-processing<\/p>\n<p>In the automatic data collection and data pre-processing stage, this study focuses on systematic collection and organization of urban greening policy texts in Wuhan. The data were mainly obtained from two official platforms: Wuhan Municipal People\u2019s Government Portal (<a href=\"https:\/\/www.wuhan.gov.cn\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.wuhan.gov.cn\/<\/a>) and the official website of the Wuhan Municipal Bureau of Landscape and Forestry (<a href=\"https:\/\/ylj.wuhan.gov.cn\/\" target=\"_blank\" rel=\"noopener\">https:\/\/ylj.wuhan.gov.cn\/<\/a>). By searching these websites using the keywords \u201cgreening program\u201d, \u201cgreening\u201d, and \u201cparks\u201d, the relevant policy documents can be precisely located. The specific steps are as follows.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (1)<\/p>\n<p>Automated data collection. This study develops automated crawler scripts via Python, which can automatically crawl the corresponding text data in the specified websites. The captured text data include 12 greening policy texts (2009\u20132024) for thematic analysis, and 10 additional greening-related activity records from the past six months for real-time information delivery (Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"table anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#Tab1\" target=\"_blank\" rel=\"noopener\">1<\/a>). The acquired textual data were subsequently stored into a MySQL database with records including document source, year, and textual content to ensure that the data are organized and searchable.<\/p>\n<\/li>\n<\/ol>\n<p><b id=\"Tab1\" data-test=\"table-caption\">Table 1 Policy documents related to greening in Wuhan.<\/b><\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (2)<\/p>\n<p>Data pre-processing. The following steps are taken to process the data in order to improve the accuracy of the analysis.<\/p>\n<\/li>\n<\/ol>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>Text cleaning Python\u2019s standard and RE libraries are used to remove HTML tags, spaces, special characters, and numbers, as these are irrelevant in most text analysis.<\/p>\n<\/li>\n<li>\n<p>Segmentation processing Text segmentation can split continuous text into individually manageable lexical units. Chinese text segmentation is particularly critical because Chinese writing is not separated by distinct spaces like English. This study utilizes a specialized Chinese word-splitting tool, Jieba, to ensure efficient and accurate recognition of Chinese words<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 33\" title=\"Dai, Y., Xu, L., Zhang, X., Fu, Y. &amp; Dong, W. Promoting sustainable development: A study of China&#x2019;s bicycle sharing industry policies based on text analysis. Res. Transp. Bus. Manag. 52, 101085 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR33\" id=\"ref-link-section-d27007064e1519\" target=\"_blank\" rel=\"noopener\">33<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 47\" title=\"Xu, H. &amp; Lv, Y. Mining and application of tourism online review text based on natural language processing and text classification technology. Wirel. Commun. Mob. Comput. 2022, 1&#x2013;13 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR47\" id=\"ref-link-section-d27007064e1522\" target=\"_blank\" rel=\"noopener\">47<\/a>.<\/p>\n<\/li>\n<li>\n<p>Removal of stop words The common stop words that do not contribute to the analysis are removed, such as \u201cand\u201d and \u201cis\u201d, and the list of stop words in the field of greening are customized to exclude specialized but unrecognizable words, such as \u201chectare\u201d and \u201carea\u201d.<\/p>\n<\/li>\n<li>\n<p>These preprocessing steps are aimed to optimize the dataset and provide clean and accurate input data for subsequent topic modeling and keyword extraction.<\/p>\n<\/li>\n<\/ul>\n<p>Policy keyword extraction<\/p>\n<p>TF-IDF (Term Frequency-Inverse Document Frequency) is a common weighting technique used in information retrieval and text mining to assess the importance of a word to a set of documents or one document in a corpus<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 48\" title=\"Kim, S.-W. &amp; Gil, J.-M. Research paper classification systems based on TF-IDF and LDA schemes. Hum. Cent. Comput. Inf. Sci. 9, 30 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR48\" id=\"ref-link-section-d27007064e1550\" target=\"_blank\" rel=\"noopener\">48<\/a>. In this study, the TF-IDF model was used to extract keywords from policy texts. The algorithm involves two concepts: term frequency (TF) and inverse document frequency (IDF).<\/p>\n<p>TF represents the occurrence frequency of a word in a document and is normalized to the number of words. For the word \\(t\\) in document \\(d\\), its TF is calculated as follows<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 49\" title=\"Yao, L., Pengzhou, Z. &amp; Chi, Z. Research on news keyword extraction technology based on TF-IDF and TextRank. In: 2019 IEEE\/ACIS 18th International Conference on Computer and Information Science (ICIS) 452&#x2013;455 (IEEE, Beijing, China, 2019). &#010;                  https:\/\/doi.org\/10.1109\/ICIS46139.2019.8940293&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR49\" id=\"ref-link-section-d27007064e1579\" target=\"_blank\" rel=\"noopener\">49<\/a>:<\/p>\n<p>$$\\begin{array}{*{20}c} {TF_{ij} = \\frac{{{\\text{n}}_{ij} }}{{\\mathop \\sum \\nolimits_{{\\text{k}}} {\\text{n}}_{kj} }}} \\\\ \\end{array}$$<\/p>\n<p>\n                    (1)\n                <\/p>\n<p>where \\({\\text{n}}_{ij}\\) is the number of occurrences of the word \\(t\\) in document \\(d\\), and the denominator is the sum of occurrences of all words in document \\(d\\).<\/p>\n<p>IDF measures the general importance of a word in a document collection. To calculate the IDF value of a particular word, the total number of documents in the document collection is first divided by the number of documents containing the word, and then the natural logarithm is taken to obtain the IDF value for the word:<\/p>\n<p>$$\\begin{array}{*{20}c} {IDF_{i} = \\log \\frac{\\left| D \\right|}{{\\left| {\\left\\{ {j:t_{i} \\in d_{j} } \\right\\}} \\right|}}} \\\\ \\end{array}$$<\/p>\n<p>\n                    (2)\n                <\/p>\n<p>where \\(\\left| D \\right|\\) represents the total number of texts in the document collection and \\(\\left| {\\left\\{ {j:t_{i} \\in d_{j} } \\right\\}} \\right|\\) denotes the number of documents containing the particular word \\(t\\). In order to prevent the denominator from becoming zero if \\(t\\) does not appear in any document, the denominator is usually added by one to ensure the stability of computation and avoid the error of dividing by zero.<\/p>\n<p>Finally, TF-IDF can be obtained by multiplying the two:<\/p>\n<p>$$\\begin{array}{*{20}c} {TF &#8211; IDF_{ij} = TF_{ij} \\times IDF_{i} } \\\\ \\end{array}$$<\/p>\n<p>\n                    (3)\n                <\/p>\n<p>The TF-IDF score can indicate the importance of the word in the document, with a higher value indicating higher uniqueness of the word in the document.<\/p>\n<p>The TF-IDF model was chosen because it can effectively distinguish between high-frequency words and key words in a document, and is particularly powerful in analyzing policy texts containing specialized terminology, which can accurately identify highly indicative words. This model improves the accuracy and depth of text analysis by attenuating the influence of common words and emphasizing key thematic vocabulary<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 50\" title=\"Zhang, W., Yoshida, T. &amp; Tang, X. A comparative study of TF*IDF, LSI and multi-words for text classification. Expert Syst. Appl. 38, 2758&#x2013;2765 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR50\" id=\"ref-link-section-d27007064e1733\" target=\"_blank\" rel=\"noopener\">50<\/a>. Specifically, this study calculates the TF-IDF value of each word in a policy document and selects the top 30 words with the highest TF-IDF value as the keywords of the document, which are used to reveal the policy content and tendency in depth. This method not only improves the accuracy of keyword extraction, but also enhances the systematicity of policy text analysis.<\/p>\n<p>Policy topic classification<\/p>\n<p>Latent Dirichlet Allocation (LDA) is a statistical model that contains a three-layer structure of words, topics, and documents, and is designed to identify implicit topics in large-scale document sets. In this study, we used LDA as a core tool for policy topic classification. The model assumes that each document is generated by a set of implicit topics, with each being defined by a probability distribution of words<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Blei, D. M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993&#x2013;1022 (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR51\" id=\"ref-link-section-d27007064e1746\" target=\"_blank\" rel=\"noopener\">51<\/a>. It can reveal the statistical patterns of texts, and deeply parse the structure and topic content of policy texts<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Song, C., Liu, Z., Yuan, M. &amp; Zhao, C. From text to effectiveness: Quantifying green industrial policies in China. J. Clean. Prod. 446, 141445 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR26\" id=\"ref-link-section-d27007064e1750\" target=\"_blank\" rel=\"noopener\">26<\/a>. The LDA model can be applied to a wide range of policy documents. Although the framework supports dynamic analysis, we selected traditional LDA with topic similarity analysis (Step 4) over Dynamic Topic Modeling (DTM) to track policy topic evolution, as LDA is more computationally efficient and balances complexity and dynamism effectively<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 52\" title=\"Lei, L., Qiao, G., Qimin, C. &amp; Qitao, L. LDA boost classification: Boosting by topics. EURASIP J. Adv. Signal Process. 2012, 233 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR52\" id=\"ref-link-section-d27007064e1754\" target=\"_blank\" rel=\"noopener\">52<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 53\" title=\"Chen, L.-C. An effective LDA-based time topic model to improve blog search performance. Inf. Process. Manage. 53, 1299&#x2013;1319 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR53\" id=\"ref-link-section-d27007064e1757\" target=\"_blank\" rel=\"noopener\">53<\/a>. Additionally, LDA reduces computational costs with limited data, minimizing the overfitting risk associated with DTM<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Blei, D. M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993&#x2013;1022 (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR51\" id=\"ref-link-section-d27007064e1761\" target=\"_blank\" rel=\"noopener\">51<\/a>.<\/p>\n<p>Topic modeling with LDA has significant advantages over traditional text analysis methods such as frequency statistics or manual classification. First, it can automatically extract topics through probability distribution, which significantly reduces human error and subjectivity. Second, the model can handle unstructured text and reveal deep semantic connections by simulating the text generation process. In addition, LDA supports categorization of multi-theme documents, which can accurately reflect the complexity of documents and provide a dynamic and precise analytical perspective for complex policy texts<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Song, C., Liu, Z., Yuan, M. &amp; Zhao, C. From text to effectiveness: Quantifying green industrial policies in China. J. Clean. Prod. 446, 141445 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR26\" id=\"ref-link-section-d27007064e1768\" target=\"_blank\" rel=\"noopener\">26<\/a>.<\/p>\n<p>The operation process of the LDA model can be divided into the following steps. First, the text is preprocessed (such as word splitting and deactivation removal) to construct a document-vocabulary matrix. Next, topics are randomly assigned to each document, and based on these initial topic assignments, iterative computations are performed to adjust the probability distribution of each word across topics and the distribution of each topic across documents. This iterative process relies on two key Dirichlet distributions: for each topic \\(k\\), the distribution \\(P\\left( {w\\left| k \\right.} \\right)\\) of the word \\(w\\) is governed by the prior distribution \\(Dir\\left( \\beta \\right)\\), while document \\(d\\) in the subject distribution \\(P\\left( {k\\left| d \\right.} \\right)\\) is controlled by the prior distribution \\(Dir\\left( \\alpha \\right)\\), where \\(\\alpha\\) and \\(\\beta\\) are model hyperparameters that affect the dispersion degree of topics in a document and that of words in a certain topic, respectively<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 54\" title=\"Jelodar, H. et al. Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimed. Tools Appl. 78, 15169&#x2013;15211 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR54\" id=\"ref-link-section-d27007064e1875\" target=\"_blank\" rel=\"noopener\">54<\/a>.<\/p>\n<p>By iteration until model convergence, the LDA model can output the topic distribution of each document and the lexical distribution of each topic. The results not only reveal the underlying topic structure in the document set, but also improve understanding of the implicit semantic hierarchy in the text. By analyzing these probability distributions, we can characterize each topic in detail, and thus further explore the key messages and trends of the policy texts. The specific steps are as follows.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (1)<\/p>\n<p>Data categorization: according to the needs of the study, the greening policy texts are categorized into two classes for different analysis purposes. The first category of data is used for topic evolution analysis, and segmentation of the policy text data was primarily based on the number distribution of texts. Specifically, the data are segmented into four phases: 2009\u20132012, 2013\u20132015, 2016\u20132021, and 2022\u20132024, with each phase including three years of policy texts. This arrangement not only ensures data consistency and sufficient sample size within each phase, but also allows the comparison of changes and development trends of the policies between different phases. The second category of data is used to reveal the distribution of annual topics in recent years, including separate analyses of the data for 2022, 2023, and 2024 to explore in detail the policy priorities and changes in each year.<\/p>\n<\/li>\n<li>\n                    (2)<\/p>\n<p>Determination of the optimal number of topics \\(k\\): To determine the optimal number of topics \\(k\\), it is necessary to calculate the consistency score and perplexity of the topic model, both of which are effective in evaluating the performance of the model. Perplexity measures the model\u2019s ability to predict the new text, reflecting the model\u2019s generalization ability, while consistency score assesses the differentiation between the topics generated by the model, i.e., whether the vocabularies of different topics are highly unique and relevant. A lower perplexity typically indicates that the model has better predictive accuracy and internal consistency, whereas a high consistency score represents that there are clear boundaries between topics, and the vocabulary contained in each topic is highly relevant and unambiguous<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Song, C., Liu, Z., Yuan, M. &amp; Zhao, C. From text to effectiveness: Quantifying green industrial policies in China. J. Clean. Prod. 446, 141445 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR26\" id=\"ref-link-section-d27007064e1927\" target=\"_blank\" rel=\"noopener\">26<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 55\" title=\"O&#x2019;Callaghan, D., Greene, D., Carthy, J. &amp; Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 42, 5645&#x2013;5657 (2015).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR55\" id=\"ref-link-section-d27007064e1930\" target=\"_blank\" rel=\"noopener\">55<\/a>. Although both perplexity and consistency score can be used to determine the \\(k\\), we found that consistency score is more effective in evaluating the model as it can more directly and clearly reflect the optimal number of topics, which better meets our research needs of analyzing policy texts and revealing policy topics. Therefore, the \\(k\\) with the highest consistency score was chosen as the basis for topic categorization in this study (Fig.\u00a0<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"figure anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#Fig2\" target=\"_blank\" rel=\"noopener\">2<\/a>).<\/p>\n<\/li>\n<\/ol>\n<p><b id=\"Fig2\" class=\"c-article-section__figure-caption\" data-test=\"figure-caption-text\">Fig. 2<\/b><a class=\"c-article-section__figure-link\" data-test=\"img-link\" data-track=\"click\" data-track-label=\"image\" data-track-action=\"view figure\" href=\"https:\/\/www.nature.com\/articles\/s41598-025-05842-z\/figures\/2\" rel=\"nofollow noopener\" target=\"_blank\"><img decoding=\"async\" aria-describedby=\"Fig2\" src=\"https:\/\/www.europesays.com\/us\/wp-content\/uploads\/2025\/08\/41598_2025_5842_Fig2_HTML.png\" alt=\"figure 2\" loading=\"lazy\" width=\"685\" height=\"333\"\/><\/a><\/p>\n<p>Thematic consistency scores of greening policies in Wuhan.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (3)<\/p>\n<p>Topic classification with LDA: After determination of \\(k\\), the LDA model was parameterized. Supported by prior validations and research<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"Song, C., Liu, Z., Yuan, M. &amp; Zhao, C. From text to effectiveness: Quantifying green industrial policies in China. J. Clean. Prod. 446, 141445 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR26\" id=\"ref-link-section-d27007064e2005\" target=\"_blank\" rel=\"noopener\">26<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 56\" title=\"Lu, Y., Mei, Q. &amp; Zhai, C. Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retr. 14, 178&#x2013;203 (2011).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR56\" id=\"ref-link-section-d27007064e2008\" target=\"_blank\" rel=\"noopener\">56<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 57\" title=\"Xu, G., Wu, X., Yao, H., Li, F. &amp; Yu, Z. Research on topic recognition of network sensitive information based on SW-LDA model. IEEE Access 7, 21527&#x2013;21538 (2019).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR57\" id=\"ref-link-section-d27007064e2011\" target=\"_blank\" rel=\"noopener\">57<\/a>, we adopted default settings of \u03b1\u2009=\u200950\/k and \u03b2\u2009=\u20090.01 for LDA. The parameter \u03b1 controls topic sparsity per document, with a higher value (e.g., 50\/k) promoting more topics, ideal for diverse policy texts. Conversely, a lower \u03b2 (e.g., 0.01) concentrates vocabulary distribution per topic, improving interpretability<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 51\" title=\"Blei, D. M. Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993&#x2013;1022 (2003).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR51\" id=\"ref-link-section-d27007064e2015\" target=\"_blank\" rel=\"noopener\">51<\/a>.Such settings can help the model to better learn the associations between documents and topics as well as between words and topics. The model was iterated 1000 times to ensure adequate learning and stable results.<\/p>\n<\/li>\n<li>\n                    (4)<\/p>\n<p>Calculation of topic similarity: After completion of topic model training for each stage, the topic similarity between successive stages was calculated to identify the evolution and persistence of topics over a time span. Specifically, the calculation involves integration of the topics at each stage into text strings according to the contained vocabulary. For two consecutive stages, we transformed the textual data using the TF-IDF vectorization method and computed the cosine similarity between the TF-IDF vectors, which is a metric used to assess the directional similarity between the two sets of data with the following equation<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 15\" title=\"Park, C. &amp; Yong, T. Prospect of Korean nuclear policy change through text mining. Energy Proc. 128, 72&#x2013;78 (2017).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR15\" id=\"ref-link-section-d27007064e2030\" target=\"_blank\" rel=\"noopener\">15<\/a>:<\/p>\n<\/li>\n<\/ol>\n<p>$$\\begin{array}{*{20}c} {Cosine Similarity\\left( {A,B} \\right) = \\frac{A \\cdot B}{{\\left| {\\left| A \\right|} \\right|\\left| {\\left| B \\right|} \\right|}}} \\\\ \\end{array}$$<\/p>\n<p>\n                    (4)\n                <\/p>\n<p>where \\(A\\) and \\(B\\) are two TF-IDF vectors, \\(A \\cdot B\\) is the dot product of the vectors, and \\(\\left| {\\left| A \\right|} \\right|\\) and \\(\\left| {\\left| B \\right|} \\right|\\) are the modes of the vectors (i.e., the lengths of the vectors). The result of this formula ranges from\u2009\u2212\u20091 to 1. When the cosine similarity value is 1,\u2009\u2212\u20091, and 0, it indicates that the two vectors have exactly the same, opposite, perpendicular direction, respectively.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (5)<\/p>\n<p>Calculation of the share of each topic in each year: For the annual data, we calculated the share of each topic in each year. This step provides data support for policy development and adjustment by analyzing the occurrence frequency of each topic in each year\u2019s documents and proportionally assigning the topics to each document, thereby revealing the focus and changes in policy concerns in each year. The specific calculation process is as follows.<\/p>\n<\/li>\n<\/ol>\n<ul class=\"u-list-style-bullet\">\n<li>\n<p>Topic occurrence frequency: The times that each topic appears in each document is counted to obtain the absolute frequency \\(f_{t,d}\\) of each topic, where \\(t\\) represents a specific topic and \\(d\\) represents a specific document.<\/p>\n<\/li>\n<li>\n<p>Total topic frequency: The total occurrence frequency \\(F\\) of all topics in all documents, i.e., the sum of all \\(f_{t,d}\\), is calculated.<\/p>\n<\/li>\n<li>\n<p>Topic percentage calculation: For each topic, the percentage is derived from the ratio of its occurrence frequency in all documents to the total frequency of all the topics:<\/p>\n<\/li>\n<\/ul>\n<p>$$\\begin{array}{*{20}c} {P\\left( t \\right) = \\left( {\\frac{{\\mathop \\sum \\nolimits_{d} f_{t,d} }}{F}} \\right) \\times 100\\% } \\\\ \\end{array}$$<\/p>\n<p>\n                    (5)\n                <\/p>\n<p>where \\(P\\left( t \\right)\\) denotes the percentage of topic \\(t\\), reflecting the importance of that topic relative to all topics in the year.<\/p>\n<p>In this way, the percentage of each topic can be used to indicate its relative weight in the annual policy, providing a quantitative basis for analyzing the policy focus in each year.<\/p>\n<p>AI extraction of greening core indicators<\/p>\n<p>In policy text analysis, extraction of core indicators is a key step to understand and quantify the effects of policy implementation. The common approaches include named entity recognition and entity relationship extraction, which are mainly used to identify and classify key entities in a text (such as locations, names of people, organizations, or other proper names) and their interrelationships in an unstructured text. Named entity recognition aims at tagging out specific entities in the text, while entity relationship extraction further analyzes the semantic connections between these entities, such as attribution and location relationships, thereby enabling more accurate and data-driven analysis and decision support<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 58\" title=\"Sumathy, L. K. &amp; Chidambaram, M. Text mining: Concepts, applications, tools and issues an overview. IJCA 80, 29&#x2013;32 (2013).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR58\" id=\"ref-link-section-d27007064e2250\" target=\"_blank\" rel=\"noopener\">58<\/a>,<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 59\" title=\"Jusoh, S. &amp; Alfawareh, H. M. Techniques, applications and challenging issue in text mining. Int. J. Comput. Sci. Issues (IJCSI) 9, 431 (2012).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR59\" id=\"ref-link-section-d27007064e2253\" target=\"_blank\" rel=\"noopener\">59<\/a>.<\/p>\n<p>In this study, we chose Baidu\u2019s AI Qianfan Big Model Platform (<a href=\"https:\/\/qianfan.cloud.baidu.com\/\" target=\"_blank\" rel=\"noopener\">https:\/\/qianfan.cloud.baidu.com\/<\/a>) to extract the core metrics of greening efforts. This platform allows one-step big model development and service operation for enterprise developers, providing multi-functional natural language processing capabilities including Wenxin Yiyin\u2019s underlying model and third-party open-source big models. The platform is based on Baidu Intelligent Cloud and adopts the PaddlePaddle deep learning framework as the underlying support, which is capable of realizing high-precision and high-performance model output through a small amount of data adjustment. The application of Qianfan Big Model avoids the complex training process of traditional deep learning models on entity relationship extraction, which greatly saves the development and training time. In addition, the platform provides AI development tools and a complete development environment, making it fast and accurate to handle large-scale and diverse policy documents<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Baidu. Baidu AI Cloud Qianfan Large Model Platform. &#010;                  https:\/\/qianfan.cloud.baidu.com\/&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR60\" id=\"ref-link-section-d27007064e2267\" target=\"_blank\" rel=\"noopener\">60<\/a>. The use of the Baidu AI Qianfan Big Model enables the researcher to focus on building and validating a methodological framework for policy analysis without inputting too many resources into the development and optimization of technical details. In this way, the researchers can more effectively focus on parsing and applying the policy data, thereby promoting the integration of theoretical research and practical application.<\/p>\n<p>The module leverages a large language model and prompt engineering to generate structured greening metrics (e.g., year, region, metric value) in JSON format, enabling subsequent visualization and analysis (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#MOESM2\" target=\"_blank\" rel=\"noopener\">S1<\/a>). The specific operational process includes the following steps.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (1)<\/p>\n<p>Prompt design Prompts are directive to guide a large model. The instruction can be a question or a text description with multiple parameters. Based on the prompts provided, the large model will generate corresponding texts or images<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 60\" title=\"Baidu. Baidu AI Cloud Qianfan Large Model Platform. &#010;                  https:\/\/qianfan.cloud.baidu.com\/&#010;                  &#010;                 (2024).\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#ref-CR60\" id=\"ref-link-section-d27007064e2291\" target=\"_blank\" rel=\"noopener\">60<\/a>. In the large model, prompts affect the quality of the output as instructions. We carefully designed the prompts and debugged them to ensure that they correctly guide the model in extracting key information from the text, such as \u201cyear\u201d, \u201cregion\u201d, \u201ctype of indicator\u201d, \u201cvalue\u201d, and \u201cunit\u201d (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#MOESM2\" target=\"_blank\" rel=\"noopener\">S1<\/a>).<\/p>\n<\/li>\n<li>\n                    (2)<\/p>\n<p>Interface call The API of Baidu AI Qianfan model is programmatically called, and the pre-processed policy text and prompts are input.<\/p>\n<\/li>\n<li>\n                    (3)<\/p>\n<p>Data storage The results extracted by the model are formatted and stored directly into a MySQL database, which facilitates subsequent data analysis, visualization, and provides structured data support for policy evaluation and decision-making.<\/p>\n<\/li>\n<\/ol>\n<p>Policy AI interpretation<\/p>\n<p>This module uses the Baidu AI Qianfan Big Model for in-depth analysis and understanding of policy texts, following the AI processing workflow outlined in \u201c<a data-track=\"click\" data-track-label=\"link\" data-track-action=\"section anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#Sec8\" target=\"_blank\" rel=\"noopener\">AI extraction of greening core indicators<\/a>\u201d section, to produce standardized policy interpretations for decision-making support. Its advanced semantic understanding capability is especially suitable for dealing with complex or large-scale policy documents, which can greatly improve the standardization and objectivity of interpretation. For instance, using tailored prompt engineering, the model generates a summary of policy objectives, key terms, and potential impacts (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#MOESM2\" target=\"_blank\" rel=\"noopener\">S1<\/a>), enabling policymakers to quickly understand core policy elements.<\/p>\n<p>The operation process is similar to the \u201cExtraction of greening core indicators\u201d module, which first determines the interpretation needs and designs appropriate prompts to guide the AI model to focus on the key contents or issues of the policy text. Through the programming interface, the pre-processed policy text is submitted to the large model and parameters are configured to perform in-depth semantic analysis (Supplementary Table <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s41598-025-05842-z#MOESM2\" target=\"_blank\" rel=\"noopener\">S1<\/a>). Ultimately, the model outputs will form policy interpretations to provide important insights and support to policy makers and analysts.<\/p>\n<p>Real-time policy tracking<\/p>\n<p>Real-time policy tracking is a key component of this study, designed to enable continuous monitoring of policy releases. While current data sources remain limited, the tracking module lays the foundation for integrating multi-source data (e.g., greening policies from government platforms, social media evaluations) to provide policymakers with comprehensive policy dynamics for timely adjustments and optimized designs. The specific operation process is as follows.<\/p>\n<ol class=\"u-list-style-none\">\n<li>\n                    (1)<\/p>\n<p>Setting of a monitoring module: An automated policy monitoring module is set up, which regularly scans the official websites of the government and related departments on a daily basis to identify newly released or updated policy information. This module utilizes search with keywords such as \u201cgreening\u201d and \u201cparks\u201d and automated crawling to ensure that all relevant policies are captured in time.<\/p>\n<\/li>\n<li>\n                    (2)<\/p>\n<p>Real-time push: Once a new policy is identified, the module will automatically extract and store key metadata in the database, including the policy\u2019s publication source, date, title, and link. The data are pushed to the front-end display interface in real-time through a customized interface, ensuring that decision makers and analysts can instantly receive the latest policy changes and information.<\/p>\n<\/li>\n<\/ol>\n<p>Visualization of intelligent analysis results<\/p>\n<p>In this study, a series of cutting-edge front-end and back-end technologies and tools were used to visualize the results of intelligent analysis. The front-end part mainly uses basic technologies such as HTML, CSS, and JavaScript, combined with the Vue.js framework to build the user interaction interface to ensure the responsiveness and interactivity of the application. For data visualization, we chose AntV G2Plot charting library and AntV L7 geographic information visualization tool from AntV, which together support the graphical representation of complex data and dynamic display of geographic information. The back-end part utilizes Python and Flask frameworks to develop API for processing the front-end requests and realize data interaction with MySQL database, thereby ensuring real-time data processing and feedback.<\/p>\n<p>The whole visualization provides not only a convenient tool for researchers to observe and analyze the impact of policies, but also an intuitive platform for policy makers and the public to understand policy directions and trends, which greatly enhances the transparency and public participation in policy research, thus promoting the open sharing of information and democratization of the decision-making process.<\/p>\n","protected":false},"excerpt":{"rendered":"Framework for intelligent analysis of greening policy texts based on text mining and AI big models In this&hellip;\n","protected":false},"author":3,"featured_media":140498,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21],"tags":[691,83917,738,857,83915,10046,65,83916,10047,83914,159,793,1763,158,83918,67,132,74288,68],"class_list":{"0":"post-140497","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"tag-ai","9":"tag-ai-big-model","10":"tag-artificial-intelligence","11":"tag-environmental-impact","12":"tag-greening","13":"tag-humanities-and-social-sciences","14":"tag-information-technology","15":"tag-methodological-framework","16":"tag-multidisciplinary","17":"tag-policy-analysis","18":"tag-science","19":"tag-software","20":"tag-sustainability","21":"tag-technology","22":"tag-text-mining","23":"tag-united-states","24":"tag-unitedstates","25":"tag-urban-ecology","26":"tag-us"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@us\/115017552004727230","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/140497","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/comments?post=140497"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/posts\/140497\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media\/140498"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/media?parent=140497"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/categories?post=140497"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/us\/wp-json\/wp\/v2\/tags?post=140497"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}