We tested GENIUS with multiple LLMs to evaluate our approach, each exhibiting incremental capabilities, to obtain more comprehensive diagnostic insights into the workflow’s performance. By gathering a large dataset of human-generated prompts for DFT calculations using QE and the corresponding workflow logs (see Fig. 4), we analyzed the number of attempts required to successfully complete each prompt. These prompts were authored by chemists and physicists who routinely perform DFT simulations with electronic-structure packages other than Quantum ESPRESSO, ensuring that the benchmark reflects realistic expert usage while remaining unbiased toward QE-specific syntax.

To understand the diversity of the prompts dataset, we converted the 295 prompts into 3072-dimensional embedding vectors with OpenAI’s text-embedding-3-large model. A 10 × 10 self-organizing map (SOM)34 was then trained to visualize their semantic landscape. The SOM is an unsupervised neural network for dimensionality reduction and clustering by projecting high-dimensional input data onto a lower-dimensional grid (here, 2-dimensional) while preserving topological relationships between the input data. Each of the 100 SOM neurons was initialized with a random weight vector of equal dimensionality, and training proceeded for 50,000 iterations in mini-batches of 50 samples. During each iteration, the neurons compete to best represent the input pattern. The algorithm identifies the best matching unit (BMU) for each input vector, which is defined as the neuron whose weight vector has the smallest Euclidean distance to the input vector. Afterward, the BMU and its neighboring neurons are updated, with the adjustment magnitude decreasing as a function of distance from the BMU, according to a Gaussian neighborhood function. This neighborhood influence, combined with a linearly decreasing learning rate, allows the SOM to form a topologically ordered map of the input space. The convergence and quality of the resulting map were quantitatively validated by calculating standard SOM metrics: the quantization error, representing the average distance between input data vectors and their best matching unit’s weight vector, and the topological error (TE), measuring the proportion of data points for which the first and second BMUs are not adjacent on the map grid.

The SOM and BMU analyses provide information on the prompts’ degree of complexity and semantic similarity. The score-based metric evaluation shows that the prompts comprise 44.3% basic, 48.5% standard, and 7.2% complex prompts. This evaluation is performed by an LLM, which assigns a numerical value to text data35. The SOM analysis creates a self-organizing map grid of hexagonally packed neurons, Fig. 2, trained on the embeddings of the prompts, which are then clustered into different groups. The SOM preserves topological identity while showing the clusters. The quality of the SOM representation reinforces the reliability of the observed prompt distribution (Fig. 2). The low TE (0.0373) confirms excellent preservation of the original data’s neighborhood structure. The quantization error (0.4970) indicates good representational fidelity; given that the input vectors were unit-normalized (maximum pairwise distance of 2.0), the average distance between data points and their map representatives is low, especially given the significant dimensionality reduction. The U-matrix (unified distance matrix) in Fig. 2a shows the distances between neurons; the hexagonal structure captures six equidistant neighbors, compared to a square grid with only four. To analyze the learned representations, two complementary visualizations were generated. The U-matrix visualizes the average distance between neighboring neurons, with higher values indicating cluster boundaries and lower values indicating dense regions of similar inputs. The BMU activation count plot (hit map) in Fig. 2b shows how frequently each neuron was selected as a BMU, revealing the distribution of input assignments across the SOM grid. Neurons with zero activations indicate they serve as boundary regions or represent semantic areas not covered by the current dataset. These empty neurons are crucial for preserving topology, helping maintain proper inter-cluster distance relationships.

Fig. 2: Self-organizing map (SOM) analysis of user input prompt embeddings.Fig. 2: Self-organizing map (SOM) analysis of user input prompt embeddings.The alternative text for this image may have been generated using AI.

a U-matrix visualization of the SOM trained on user prompt embeddings, where regions with more distance indicate cluster boundaries and regions with lower distance denote dense clusters of similar prompts. b SOM hit map (BMU activation count) showing the distribution of prompts across the neuron grid. Five distinct high-activation and low-activation regions are scattered in between, suggesting a balance between semantic clusters and diverse request types in the input data.

A similar SOM representation can be generated for each component of the embedding vectors, revealing which dimensions contribute most significantly to semantic distinctions and helping identify correlated components that form meaningful semantic features. These visualizations are available in the project’s GitHub repository. The figure shows five neurons with the highest activation counts, including their respective BMUs. These neurons are well-separated on the SOM grid, indicating that the calculation prompts can be grouped primarily into five semantic clusters, with additional residual prompts that share similarities with the core concepts. Overall, the SOM grid shows that the prompt collection is dispersed, providing a good balance between semantic cohesion within the identified clusters and conceptual diversity across them. Upon manual inspection of prompts, it was found that the prompts are mainly divided into two major categories, namely structural relaxation and a variety of single-shot DFT calculations, using different methodologies available in the QE code, which is further confirmed by a simpler k-means analysis (See GitHub repository). Discussing these findings highlights the framework’s sensitivity to input quality and type, potentially informing future improvements to user interaction or prompt pre-processing. This analysis uncovers latent structures in the user-request space that are directly relevant to the development of automated protocol-generation mechanisms.

In Fig. 3, the user’s prompt (classified as standard) is shown at the top, specifying a geometry optimization for a 2D PdS2 structure using the B3LYP functional, whereas the bottom portion highlights the valid QE input file generated by the GENIUS framework. In Fig. 4, we illustrate the complete timeline log for the same prompt, emphasizing the sequence of events, as indicated by color-coded statuses (PENDING, SUCCESS, RETRY, and ERROR), as the system progresses from the interface agent phase to the final solution generation. As evidenced in Fig. 4, the extended time required to evaluate input parameters is a secondary constraint imposed by LLM API providers, which a self-host service could mitigate; without such limitations, parameters could be processed in parallel, reducing overall latency. The stepwise entries in the log figure demonstrate the framework’s resilience, including how QE crashes (red dots) can result from hallucinations or confabulations. Our framework automatically detects and resolves these failures, iteratively refining and validating the input parameters until the final QE calculation is completed. It’s important to note that the wall-clock performance illustrated in Fig. 4 reflects the specific benchmark setup used in this research. In this setup, considering only the QE attempt has a maximum validation period of 60 s. As a result, the overall workflow runtime is primarily influenced by orchestration factors, such as model inference and retry mechanisms, rather than the time required to complete a full production DFT calculation.

Fig. 3: Real simulation protocol example of Quantum Espresso generated by GENIUS.Fig. 3: Real simulation protocol example of Quantum Espresso generated by GENIUS.The alternative text for this image may have been generated using AI.

The user’s prompt request is displayed at the top, instructing the QE code to perform a geometry optimization for 2D PdS2 in the P21/c space group, using Quantum ESPRESSO with the B3LYP exchange-correlation functional. The generated protocol is provided in two columns for compactness. The framework parses these instructions and automatically generates the valid QE input. The protocol specifies 20% exact-exchange, a plane-wave basis set, smearing for occupation, a mixing parameter for the SCF cycle, and a 7 × 7 × 2 k-points mesh. Additionally, the file includes detailed control parameters for geometry relaxation (via BFGS), pseudopotentials for Pd and S, and the required settings, including ecutwfc, ecutrho, occupations, and spin polarization, as shown in the output.

Fig. 4: Live timeline of a self-healing (AEH) GENIUS job.Fig. 4: Live timeline of a self-healing (AEH) GENIUS job.The alternative text for this image may have been generated using AI.

Each dot marks a log event (y-axis, newest at top) plotted against wall-clock time (x-axis), colors denote status: PENDING (orange), SUCCESS (green), RETRY (gray), ERROR (red). The workflow first parses the user prompt, harvests documentation, builds a parameter graph, and generates a QE input template. After launch, QE crashes once (red); the finite-state loop applies a single retry (gray) within the AEH, resolves the issue, and the simulation reaches steady execution and completion (green) in ≈3 min. The timeline exposes full provenance and illustrates how GENIUS autonomously recovers from runtime failures while streaming real-time status updates.

An overview of the outcomes for the 295 test prompts is presented in Fig. 5, which depicts the distribution between successful and failed runs, the path to success, zero-shot or via specific models in the AEH system, as well as a breakdown according to the initial prompt complexity. Additionally, when prompts are evaluated using only base LLMs, without the GENIUS framework, they yield negligible contributions to generating valid QE input files that contain the correct cards and mutually consistent parameters for a given geometric structure, regardless of their nominal reasoning enhancements. This limitation likely arises because the models do not embed explicit crystallographic information and cannot infer the subtle interdependencies among geometry, namelist keywords, and the card syntax required by QE. We reiterate that Model 1, the first component in the protocol generation hierarchy, produces the initial simulation protocol version. Therefore, a zero-shot success corresponds to cases in which the first output generated by Model 1 is valid and does not require any further correction. Within the GENIUS framework, Model 1 proceeds with the first cycle of automated error-handling retries if this initial attempt fails.

Fig. 5: GENIUS performance benchmark on 295 tested prompts.Fig. 5: GENIUS performance benchmark on 295 tested prompts.The alternative text for this image may have been generated using AI.

The stacked bar chart reports the percentage of successful runs (y-axis) for the zero-shot pass (GENIUS without AEH) and GENIUS using AEH combined with Model 1, Model 2, and referee models. Shaded vertical panels group the bars for systems assessed by the same model. Within every bar, colored segments disaggregate the total success rate by prompt complexity: basic, standard, and complex. The cumulative solved percentage (right y-axis) is overlaid in dark blue, showing the total proportion of prompts solved after each successive attempt.

In Fig. 5, we present the distribution of successful runs that reached the FINISHED state in the workflow after a given number of attempts. The success in zero-shot cases indicates scenarios in which a request was FINISHED using only the workflow’s recommendation system, without invoking the automated error-handling system. This scenario accounts for 17.9%, comprising 9.4% for basic prompts, 7.2% for standard prompts, and 1.3% for complex prompts. A similar distribution pattern can be observed for subsequent attempts. If the initial execution fails, the GENIUS AEH system is triggered. Each retry within an attempt cycle uses the same model that generated the initial protocol. After three attempts per model, the process switches to the next model in the hierarchy if no solution is found. Based on the user’s calculation prompt, the workflow resets from the recommendation system’s output, which serves as a template for generating a simulation protocol. Previous changelog attempts are not provided to the new model during each model exchange. The model receives only the error message, the relevant documentation, the latest version of the simulation protocol, and the original user calculation prompt.

Our results demonstrate that successful cases across attempts comprise a mixture of basic, standard, and complex calculation prompts. This observation shows that prompt complexity (basic, standard, or complex) is not inherently problematic for the framework’s performance. Complex prompts can contain more distinctive instructions, enhancing the framework’s ability to generate valid protocols. The general trend reveals that after the initial attempts with Model 1, the number of successful attempts stabilizes at a baseline level. This initial high success rate is followed by a plateau, which resembles an exponential decay behavior in the number of cases requiring successive attempts. For the model selection hierarchy, we assume a performance ordering of Model 1 < Model 2 < referee. This can be done with any set of language models, but the predefined order characterizes the framework’s behavior, where the referee model was chosen as the SOTA model. The referee model is used to establish the performance baseline. This outcome indicates that the framework itself, rather than just the power of the strongest model, is responsible for handling most cases, as the referee model is not disproportionately utilized, which would otherwise suggest a failure in the preceding stages. This demonstrates that the GENIUS framework can be used with any model (it is model-agnostic) and that its overall performance is attributable to its architecture intelligence, not just the underlying language model’s capabilities. The opposite scenario would manifest as a lack of a baseline, with an increase in successful attempts. Specifically indicating that success relied primarily on the more performative model rather than the framework’s architectural design.

From our total dataset of 295 calculation requests analyzed, 235 successfully produced a valid simulation protocol, with 42 of these succeeding in the zero-shot scenario, which is defined here as the framework converging to a correct protocol on its very first attempt, without invoking any automated error-handling loops. This yields an overall system success ratio of \(P(S)=\frac{235}{295}\approx 0.7966\), a zero-shot, (ZS), success ratio of \(P(ZS)=\frac{42}{295}\approx 0.1424\), and a ratio of success through automated error handling (given zero-shot fails) of \(P({\mathtt{AEH}}| \,{{\mathrm{not}}}\,\,{{\mathrm{ZS}}})=\frac{193}{253}\approx 0.7628\). To characterize how the success rate evolves with successive attempts, we fitted an exponential decay model to the observed success rates (S) across multiple attempts, where x denotes the attempt number. The function takes the form, Eq. 1:

$$S(x)=A\,{e}^{-bx}+C,\,{{{\rm{RMSE}}}}=1.9 \% ,$$

(1)

obtaining A = 11.1% ± 1.0, b = 0.46(1/attempt) ± 0.1, and C = 7.0% ± 0.70. In this parametrization, the initial amplitude, A(%), represents the maximum influence of the zero-shot attempt; the decay rate, b(1/attempt), determines how quickly this initial advantage decreases over successive attempts, and the baseline, C(%) is the asymptotic success probability reached after many retries.

Figure 6 delineates three distinct operational regimes within GENIUS: recommendation-system, maximum workflow utilization, and shallow workflow utilization. The opening recommendation-system regime coincides with the zero-shot pass, highlighting the framework’s ability to successfully generate protocols independently of model switching or fallback mechanisms. Immediately thereafter, the curve plunges into the maximum workflow utilization regime: each early retry unlocks deeper cross-model synergies, yielding rapidly diminishing but still substantive gains. Once the process reaches roughly six attempts, the trajectory flattens into the shallow workflow utilization regime, where the performance asymptotically converges toward the baseline value of C ≈ 7%. Within this regime, further retries contribute marginal benefit; success is governed primarily by the workflow’s inherent competence rather than by additional computation. The slight oscillations superimposed on the fitted curve stem from the design choice to reset the context after every third attempt and switch the model. Inter-block influence is shortened because each reset isolates the subsequent block of attempts. Allowing more consecutive attempts per model would amplify these oscillations, which could be quantitatively captured by extending the fitting function to include an explicit periodic component, thereby requiring finer-grained modeling.

Fig. 6: Exponential decay fit (red curve) applied to the observed fraction of successful runs per attempt number (black points).Fig. 6: Exponential decay fit (red curve) applied to the observed fraction of successful runs per attempt number (black points).The alternative text for this image may have been generated using AI.

A single-parameter exponential, S(x) = 11.1 e−0.46x + 7.0 % (red line), captures the trend. Shaded bands specify the three operating regimes: the opening recommendation system zone (zero-shot wins), the steep maximum workflow utilization zone where early retries yield rapidly diminishing but still substantive gains, and the long-tail shallow workflow utilization zone in which performance plateaus at the 7% baseline. The fit confirms that most recoverable errors are corrected within the first three attempts, after which additional computation yields marginal returns.

The recommendation system (Rec) includes the smart knowledge graph, extracts boundary conditions, and evaluates key parameters for each user query (Q). The workflow is complete if the calculation request is successfully resolved in a zero-shot scenario. Otherwise, the request proceeds to the AEH subsystem, which can be a successful case. Given an effective recommendation system, we can decompose S as in Eq. 2:

$$P(S)=P({ZS})+\left(1-P({ZS})\right)\,P\left({AEH}| \neg {ZS}\right).$$

(2)

To estimate the system’s performance in the absence of Rec, we introduce the scaling factors α and β, which quantify how much Rec multiplies the zero-shot and AEH success probabilities, respectively, assuming α, β ≥ 1. Using Eq. 2 the hypothetical Q-only success probabilities are then,

$$P({ZS}| Q\,{{{\mathrm{-only}}}})=\frac{0.1424}{\alpha },\,P\left({\mathtt{AEH}}| \neg {ZS},\,Q\,{{\mathrm{-only}}}\right)=\frac{0.7628}{\beta },$$

(3)

so that

$$P(S| Q\,{{{\rm{-only}}}})=\frac{0.1424}{\alpha }+\frac{0.7628}{\beta }-\frac{0.1086}{\alpha \beta }.$$

(4)

Setting α = β = γ gives

$$P(S| Q\,{{{\rm{-only}}}})=\frac{0.9052}{\gamma }-\frac{0.1086}{{\gamma }^{2}}.$$

(5)

This dependency of the success probability on the effectiveness of the recommendation system can be considered with a few representative examples: In the limiting case where the recommendation system has no effect (γ = 1), the success probability is 0.7966, which is the same as the overall success probability with the recommendation system. A case where the recommendation system has an effectiveness reduction of 50% (i.e., γ = 1.50), the success rate without the recommendation system drops to 0.56. In the case of double effectiveness (γ = 2), the success rate further declines to 0.43. The sensitivity of the success rate concerning variations in the recommendation system’s effectiveness is shown in the following Eq. 6:

$$\frac{d}{d\gamma }P(S| Q\,{{{\rm{-only}}}})=-\frac{0.9052}{{\gamma }^{2}}+\frac{0.2172}{{\gamma }^{3}},\,\,{{{\rm{for}}}}\,\,\gamma > 1.$$

(6)

This derivative indicates that the reduction in success rate (when the recommendation system is removed) is most sensitive when its effectiveness factor (γ) is close to 1. These results imply that the recommendation system significantly boosts system performance. As γ increases (implying that the recommendation system is even slightly effective), the success probability in the Q-only regime diminishes sharply. The derivative analysis confirms that the decrease in success probability is steepest when γ is near 1. Small enhancements due to the recommendation system can lead to substantial differences in overall performance. The benchmark reported here evaluates protocol executability using the controlled validation procedure described in the “Methods” section. Accordingly, a successful case indicates that GENIUS generated a QE protocol that passed parsing and early runtime validation within the benchmark window. This metric does not by itself establish that the resulting workflow is scientifically optimal, that it is the unique protocol an expert user would have chosen, or that it improves end-user productivity relative to manual practice. These questions require dedicated comparative studies with human users and are therefore left for future work.