{"id":298306,"date":"2026-01-22T20:59:14","date_gmt":"2026-01-22T20:59:14","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/298306\/"},"modified":"2026-01-22T20:59:14","modified_gmt":"2026-01-22T20:59:14","slug":"nutrimatch-harmonizing-food-composition-databases-with-large-language-models-for-enhanced-nutritional-prediction","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/298306\/","title":{"rendered":"NutriMatch: harmonizing food composition databases with large language models for enhanced nutritional prediction"},"content":{"rendered":"<p>Data collection<\/p>\n<p>We utilized data from the 10K Project, a prospective human cohort study involving over 10,000 healthy participants aged 40\u201370 at recruitment. The study focuses on in-depth clinical, physiological, behavioral, and multi-omic profiling. Specific exclusion criteria were applied to maintain the relevance and homogeneity of cohort<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 31\" title=\"Shilo, S. et al. 10K-a large-scale prospective longitudinal study in Israel. Eur. J. Epidemiol. 36, 1187&#x2013;1194 (2021).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR31\" id=\"ref-link-section-d91699474e1052\" rel=\"nofollow noopener\" target=\"_blank\">31<\/a>.<\/p>\n<p>Dietary data were collected via continuous real-time diet logging. Participants recorded daily food and beverage consumption using a dedicated mobile app for a continuous 2-week period. The HPP FCDB linked to this app contains 7765 unique food items, categorized into 33 distinct food categories and associated with 718 short food names for high-level grouping.<\/p>\n<p>As part of our external validation, we utilized data from the Australian PREDICT cohort<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 27\" title=\"Htet, T. D. et al. Rationale and design of a randomised controlled trial testing the effect of personalised diet in individuals with pre-diabetes or type 2 diabetes mellitus treated with metformin. BMJ Open 10, e037859 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR27\" id=\"ref-link-section-d91699474e1062\" rel=\"nofollow noopener\" target=\"_blank\">27<\/a>. It is a randomized controlled trial of personalized diet interventions in individuals with prediabetes or early-stage T2DM on metformin (N\u2009=\u2009138). Detailed dietary logging and clinical measurements were collected using a dedicated mobile app, as previously described<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 27\" title=\"Htet, T. D. et al. Rationale and design of a randomised controlled trial testing the effect of personalised diet in individuals with pre-diabetes or type 2 diabetes mellitus treated with metformin. BMJ Open 10, e037859 (2020).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR27\" id=\"ref-link-section-d91699474e1069\" rel=\"nofollow noopener\" target=\"_blank\">27<\/a>.<\/p>\n<p>Ethical approval<\/p>\n<p>All participants signed an informed consent form upon arrival at the research site. All identifying details of the participants were removed prior to the computational analysis. The 10K cohort study is conducted according to the principles of the Declaration of Helsinki and was approved by the Institutional Review Board of the Weizmann Institute of Science.<\/p>\n<p>External databases<\/p>\n<p>Our alignment process involved matching the HPP FCDB with several key external FCDB. These databases were selected to provide comprehensive coverage of regional and global dietary habits:<\/p>\n<p>USDA SR Legacy, a comprehensive source of nutritional data for U.S. foods, providing detailed profiles of macronutrients, vitamins, minerals, and bioactive compounds. Widely used in diet-related research<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 21\" title=\"Fukagawa, N. K. et al. USDA&#x2019;s FoodData Central: what is it and why is it needed today?. Am. J. Clin. Nutr. 115, 619&#x2013;624 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR21\" id=\"ref-link-section-d91699474e1092\" rel=\"nofollow noopener\" target=\"_blank\">21<\/a>;<\/p>\n<p>USDA FNDDS, primarily used for dietary intake surveys in the U.S., offers nutrient content, serving sizes, and food descriptions, frequently updated for public health research<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 21\" title=\"Fukagawa, N. K. et al. USDA&#x2019;s FoodData Central: what is it and why is it needed today?. Am. J. Clin. Nutr. 115, 619&#x2013;624 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR21\" id=\"ref-link-section-d91699474e1099\" rel=\"nofollow noopener\" target=\"_blank\">21<\/a>;<\/p>\n<p>Tzameret, an Israeli FCDB focused on nutrient data for locally consumed foods, essential for studying Israeli dietary patterns<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 23\" title=\"&#x5DE;&#x5D0;&#x5D2;&#x5E8; &#x5D4;&#x5EA;&#x5D6;&#x5D5;&#x5E0;&#x5D4; &#x5D4;&#x5DC;&#x5D0;&#x5D5;&#x5DE;&#x5D9; &#x5D4;&#x5D9;&#x5E9;&#x5E8;&#x5D0;&#x5DC;&#x5D9; - &#x5DE;&#x5D0;&#x5D2;&#x5E8;&#x5D9; &#x5DE;&#x5D9;&#x5D3;&#x5E2; - Government Data. &#010;                  https:\/\/data.gov.il\/dataset\/nutrition-database&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR23\" id=\"ref-link-section-d91699474e1106\" rel=\"nofollow noopener\" target=\"_blank\">23<\/a>;<\/p>\n<p>MEXT (Japan) provides nutrient profiles of Japanese foods, reflecting regional dietary habits, and commonly used in studies of Japanese diets<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 24\" title=\"MEXT: Standards Tables of Food Composition in Japan -2015- (Seventh Revised Edition) Documentation and Table. &#010;                  https:\/\/www.mext.go.jp\/en\/policy\/science_technology\/policy\/title01\/detail01\/sdetail01\/sdetail01\/1385122.htm&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR24\" id=\"ref-link-section-d91699474e1114\" rel=\"nofollow noopener\" target=\"_blank\">24<\/a>;<\/p>\n<p>Bahrain Food Database, developed by Bahrain\u2019s Ministry of Health, provides essential nutritional data on local foods to support public health and dietary research<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 25\" title=\"Musaiger, A. Food Composition Tables for Kingdom of Bahrain. (2011).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR25\" id=\"ref-link-section-d91699474e1121\" rel=\"nofollow noopener\" target=\"_blank\">25<\/a>;<\/p>\n<p>AUSNUT, the Australian food composition database, was developed for the 2011\u20132013 Australian Health Survey (AHS), providing detailed nutrient profiles for foods and dietary supplements consumed in Australia<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 26\" title=\"AUSNUT 2011-13 | Food Standards Australia New Zealand. &#010;                  https:\/\/www.foodstandards.gov.au\/science-data\/food-composition-databases\/ausnut&#010;                  &#010;                .\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR26\" id=\"ref-link-section-d91699474e1128\" rel=\"nofollow noopener\" target=\"_blank\">26<\/a>.<\/p>\n<p>Alignment methodology<\/p>\n<p>Our alignment methodology follows four stages:<\/p>\n<p>Dataset Standardization: We used structured outputs from LLMs to classify food item names and categories consistently across all datasets. This ensured uniformity in food classifications.<\/p>\n<p>Embedding Projections: We converted food items into semantic embeddings using a model from Open AI (<a href=\"https:\/\/platform.openai.com\/docs\/guides\/embeddings\" rel=\"nofollow noopener\" target=\"_blank\">https:\/\/platform.openai.com\/docs\/guides\/embeddings<\/a>). We have used the \u201ctext-embedding-3-large\u201d model to represent each food item as a vector of 3072 dimensions.<\/p>\n<p>Matching: We employed cosine similarity as the distance metric to compare and match food items from different databases.<\/p>\n<p>Validation with LLM: Finally, we used a prompt-based approach with an LLM to validate that the matched food items were indeed equivalent. The validation focused on ensuring that nutrients from one food item could be accurately imputed to the matched item.<\/p>\n<p>Imputation methodology<\/p>\n<p>To address missing nutrient data in FCDBs, NutriMatch employs a structured imputation strategy that integrates embedding-based matching and LLM-assisted validation. This approach ensures that missing nutrients are inferred based on the most robust and validated sources while maintaining transparency in decision-making.<\/p>\n<p>Embedding-Based Candidate Selection: for each food item requiring nutrient imputation, we first identify the top 5 closest matches based on their embeddings. These embeddings, derived from a deep-learning model trained on food descriptions and nutrient compositions, enable semantic comparisons beyond simple keyword matching.<\/p>\n<p>LLM Validation of Food Equivalence: the match between the original food item and the closest matches is then evaluated using an LLM. The LLM is prompted with structured queries to determine whether the candidate foods are nutritionally equivalent to the target food item (In our case, our standardized food item structure is that of the SR Legacy containing description and category). If the LLM confirms equivalence, these matches are flagged as valid references for nutrient imputation.<\/p>\n<p>While this automated validation reduces the need for manual expert review and enables greater scalability, occasional mismatches may still arise in edge cases where domain expertise could offer added value.<\/p>\n<p>Hierarchical Dataset Ranking for Selection: we prioritize FCDBs based on their validation rigor and data robustness. Databases with stringent quality control measures\u2014such as USDA Standard Reference (SR Legacy) and USDA FNDDS\u2014are given higher priority over sources with less validation, such as Tzameret. This ranking ensures that imputed values are derived from the most reliable sources whenever possible.<\/p>\n<p>Selecting the Closest Match for Nutrient Imputation: once the top LLM-validated match is identified, nutrient values are imputed sequentially, starting from the highest-ranked database. If a match is found within a highly validated FCDB, its nutrient composition is directly transferred. Otherwise, the best available match in the embedding space is selected to provide the missing values.<\/p>\n<p>Post-Imputation Matching for Unresolved Cases: for food items without an exact LLM-confirmed match, we leverage the embedding space to identify the most similar food and assign its nutrient values. This ensures that all food items receive a complete nutrient profile, even when exact database matches are unavailable.<\/p>\n<p>This systematic imputation methodology makes NutriMatch fully explainable, as every imputed nutrient can be traced back to a specific food item in a known FCDB. By combining semantic embeddings, LLM validation, and dataset prioritization, we enhance the completeness and reliability of dietary data while maintaining methodological transparency.<\/p>\n<p>Quantifying intra\u2011 and inter\u2011FCDB nutrient variability<\/p>\n<p>We accessed the inter-database correlations using the shared nutrients. The three study databases AUSNUT (PREDICT cohort), Tzameret, and SR Legacy, share 37 nutrients (non-imputed). After NutriMatch alignment, we retained every nutrient represented by at least 50 food items in each comparison (all 37 met this criterion). Match counts were 1964 foods for AUSNUT\u2009\u2194\u2009SR Legacy, 4132 for AUSNUT \u2194 Tzameret, and 3409 for Tzameret \u2194 SR Legacy. Log Pearson (minimum clipping of 1e-5) correlations were computed nutrients\u2011wise for each two\u2011way combination and are displayed in Extended Data Fig. <a data-track=\"click\" data-track-label=\"link\" data-track-action=\"supplementary material anchor\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#MOESM1\" rel=\"nofollow noopener\" target=\"_blank\">S2<\/a> due to the large zero tail of some of the nutrients in question.<\/p>\n<p>To estimate the upper bound of reproducibility expected under ideal conditions, we used the Foundation Foods subset of USDA FoodData Central<a data-track=\"click\" data-track-action=\"reference anchor\" data-track-label=\"link\" data-test=\"citation-ref\" aria-label=\"Reference 21\" title=\"Fukagawa, N. K. et al. USDA&#x2019;s FoodData Central: what is it and why is it needed today?. Am. J. Clin. Nutr. 115, 619&#x2013;624 (2022).\" href=\"http:\/\/www.nature.com\/articles\/s44482-025-00001-7#ref-CR21\" id=\"ref-link-section-d91699474e1204\" rel=\"nofollow noopener\" target=\"_blank\">21<\/a>, which includes repeated analytical measurements for the same food items, within the same country, while measured by the same laboratory methods. Even in this best-case scenario\u2014where all external sources of variability are minimized\u2014nutrient values still show variation due to intrinsic measurement noise. Within the 37 nutrients analyzed in our inter-database comparison, 25 were represented in Foundation Foods with \u22654 replicate determinations, yielding 10,076 food\u2013nutrient pairs. Since only summary statistics (minimum, maximum, etc.) were available, we approximated the within-food standard deviation as \u03c3\u2009\u2248\u2009(max\u2013min)\/4. This value corresponds to the theoretical \u03c3 of a uniform (rectangular) distribution, a widely used range-based estimator when an empirical variance is unavailable. We drew 100 pseudo\u2011observations from N(mean, \u03c3\u00b2) for each pair and calculated log Spearman correlations across all non-identical food pairs. The 0.05\/0.95 percentiles of this distribution (\u03c1\u2009\u2248\u20090.81\u20130.99) define an empirical \u201cbest-case\u201d reproducibility band against which inter-database correlations were compared.<\/p>\n<p>Machine learning models<\/p>\n<p>For regression and classification tasks, we utilized the LightGBM library, implementing a fivefold cross-validation approach to evaluate model performance. Dietary log data was preprocessed by including only days with a recorded intake of at least 800\u2009kcal.<\/p>\n<p>We compared three hierarchical feature subsets in our predictive models: (1) age and sex only, (2) basic nutrients (macronutrients and sodium) along with age and sex, and (3) all nutrients, including the basic set, expanded by NutriMatch imputation. Each subsequent subset fully contains the previous one, allowing clear assessment of incremental predictive value from additional nutrient features.<\/p>\n<p>To compare macronutrient and micronutrient consumption between the Australian and Israeli cohorts, participants were matched based on age, gender, and BMI using propensity score matching.<\/p>\n<p>Propensity score matching<\/p>\n<p>Propensity score matching balances baseline covariates by pairing participants with similar estimated probabilities of group assignment based on age, gender, and BMI. Matching was carried out via nearest-neighbor selection without replacement to create comparable groups. The matched cohort, with aligned distributions of age, gender, and BMI, was then used for downstream effect estimation.<\/p>\n<p>SHAP<\/p>\n<p>For model interpretability, SHAP (SHapley Additive exPlanations) decomposes individual predictions into per-feature contributions, quantifying the extent to which each variable shifts the prediction from its baseline. Positive and negative SHAP values indicate upward or downward effects on the model output, respectively. Contribution distributions are summarized with a beeswarm plot: features are ordered by mean absolute SHAP value, each point represents a sample\u2019s SHAP value for that feature, horizontal position denotes effect size and direction, and color encodes the raw feature value. This visualization simultaneously conveys feature importance and inter-sample variability in effect magnitude and direction.<\/p>\n","protected":false},"excerpt":{"rendered":"Data collection We utilized data from the 10K Project, a prospective human cohort study involving over 10,000 healthy&hellip;\n","protected":false},"author":2,"featured_media":298307,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[277],"tags":[3547,4154,9111,133942,147730,18,910,135,62062,2100,19,17,7482,508,2101,147729],"class_list":{"0":"post-298306","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-nutrition","8":"tag-biomedical-engineering-biotechnology","9":"tag-computational-biology-and-bioinformatics","10":"tag-computational-models","11":"tag-data-integration","12":"tag-data-mining-and-knowledge-discovery","13":"tag-eire","14":"tag-general","15":"tag-health","16":"tag-health-informatics","17":"tag-health-promotion-and-disease-prevention","18":"tag-ie","19":"tag-ireland","20":"tag-medicine-public-health","21":"tag-nutrition","22":"tag-public-health","23":"tag-special-purpose-and-application-based-systems"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/115940764159168564","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/298306","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=298306"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/298306\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/298307"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=298306"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=298306"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=298306"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}