An automated parallel genetic algorithm with parametric adaptation for distributed data analysis

Genetic Algorithms (GAs) are known for their versatility and robustness as optimization techniques, showcasing their effectiveness in a diverse array of domains. They have gained widespread recognition for their ability to tackle various challenges, including machine learning tasks such as clustering5,6, regression7, time-series analysis8, and improving the accuracy of classification algorithms9,10. Notably, Genetic Algorithms have demonstrated robust efficacy in the execution of classification tasks11,12.

However, despite their wide-ranging applicability, GAs can sometimes fall short of expectations, often due to the critical factor of parameter selection. This becomes especially vital in complex problems or when optimizing performance is essential. Suboptimal performance frequently arises from the misalignment of GA parameters, including population size, crossover probability, and mutation probability. These parameters, in both their absolute values and relative proportions, determine the GA’s behavior during the search process. Striking a delicate balance between exploiting the current best solution and exploring the entire solution space by combining solutions is essential13.

One common issue encountered in GAs is premature convergence14, a situation where the algorithm prematurely converges to a suboptimal solution. It occurs when GAs settle on a suboptimal solution too early, failing to explore potentially superior alternatives. Researchers have recognized this challenge and in15, they focus on identifying methods to prevent genetic algorithms from prematurely terminating.

The process of selecting suitable parameters for a genetic algorithm often involves trial and error approach, which can be time-consuming and computationally demanding. To address this issue, researchers have proposed three methods: empirical studies, parameter adaptation, and theoretical modeling.

Empirical studies involve conducting analyses on various test functions to observe the relationships between different aspects of a genetic algorithm. While this method provides valuable insights, it may not always yield generalizable results for all cases. Additionally, deriving precise recommendations from these studies can be challenging.

Parameter adaptation methods have also been suggested, focusing on dynamically controlling one or more operators, such as crossover probability or mutation rate. These methods adaptively adjust parameter values based on fitness values obtained from each generation. By continuously monitoring the performance of the algorithm, these adaptive techniques aim to maintain population diversity and improve convergence towards optimal solutions.

Theoretical modeling is another approach that seeks to provide a deeper understanding of the GA dynamics and the influence of parameter values on its behavior. Through mathematical models and analyses, researchers aim to uncover the underlying principles governing the GA’s performance. This theoretical knowledge can guide the selection of parameter values and lead to more informed decision-making. Theoretical studies on GA parameter tuning are based on simplified models of the problem and GA behavior, which may not accurately reflect real-world situations. Additionally, these models often rely on binary coding, which limits their applicability to problems that benefit from more complex representations. As a result, theoretical studies require knowledge of the problem that is not readily available, and their results may not be generalizable to other problems or encoding schemes.

Over the past decade, automated methods for configuring state-of-the-art algorithms have gained widespread recognition as a highly effective alternative. Notable examples of such general-purpose automated algorithm configurators include irace16, ParamILS17, and SMAC18. These automated tools have streamlined the process of fine-tuning algorithms, providing a robust alternative to manual configuration.

The software package irace is dedicated to implementing multiple automated configuration methods16. Specifically, it introduces iterated racing procedures, which have proven successful in automating the configuration of various cutting-edge algorithms. Irace’s primary objective is to streamline the demanding task of fine-tuning optimization algorithm parameters. Furthermore, it can also be applied to discover optimal settings in diverse computational systems, such as robotics and traffic light controllers. However, it’s important to note that the current iterated racing algorithms within irace have some limitations. Most notably, they were primarily designed for scenarios where reducing computation time is not the primary focus.

In a related context,17 presents ParamILS, an approach grounded in Reinforcement Learning (RL). ParamILS is designed to fine-tune RL agents in environments belonging to the algorithm configuration domain. This domain seeks to identify better parameters to enhance overall algorithm performance.

Another method tailored to addressing computationally expensive black box problems is SMAC, the sequential configuration of algorithms based on the search space18. This approach involves adapting a general algorithm to a specific problem, effectively replacing manual parameter adjustments with automatic determination.

One of the primary challenges in automated algorithm configuration is the time required to assess a configuration’s performance and the size of the configuration space, particularly when dealing with complex training dataset. The practice of randomly selecting configurations can lead to timeouts and provide limited valuable insights. While irace employs racing methods based on statistical tests to discard underperforming configurations, it may still result in insufficient information about the configurations’ effectiveness.

In the subsequent sections, we will emphasize the primary approaches for overseeing and automating two crucial GA parameters: the number of iterations and the population size, both of which constitute the central focus of our research.

Number of iterations

Our research, as outlined in3, extensively investigates the impact of the number of iterations on both PDMS and PDMD parallel GA models. Our findings consistently reveal a notable trend: in most scenarios, GA achieves its final optimal solution using only half or even fewer iterations than the default predefined value, rendering the remaining iterations redundant. However, there are instances where the predetermined fixed number of iterations falls short of identifying the optimal solution. This underscores the critical need for an automated approach to regulate the number of iterations, a development that holds substantial potential for enhancing overall efficiency.

Nevertheless, it’s essential to acknowledge that many GA applications lack an automatic termination mechanism. Typically, Evolutionary Algorithms (EAs) cannot autonomously determine when or where to terminate, often relying on users to predefine a termination criteria. The most commonly employed termination criteria are as follows: (1) reaching a maximum number of generations or function evaluations, which is a straightforward yet non-optimal approach and is often problem-dependent, requiring prior knowledge about the test problem; (2) achieving a fitness value that approaches the known global minimum, a rarity in real-world scenarios; and (3) hitting a maximum number of consecutive generations without improvement. For instance, in19, termination occurs after 50 generations without improvement, while20 allows users to decide when to halt the algorithm. It’s worth noting that these termination criteria are generally inapplicable to many real-world problems since they are based on unconstrained test problems with known minima.

Recent studies in the field of EAs have explored various termination criteria, with some notable examples found in21 and22. In21, empirical studies aim to detect the number of generations required for population convergence based on problem characteristics. Meanwhile,22 conducted a survey that categorizes prominent termination criteria in EAs. The study emphasizes the effectiveness of combining direct termination criteria with threshold-based criteria to ensure reliable EA convergence. The prevailing trend, as highlighted in23, leans toward adopting adaptive termination conditions that rely on either genotypical or phenotypical terminations instead of a fixed number of iterations.

In a broader context, recent termination criteria for EAs can be classified into four categories23:

TFit Criterion: This category relies on convergence measures of the best fitness function values across generations.

TPop Criterion: It focuses on convergence measures of the entire population across generations.

The TBud Criterion for computational budget is characterized by utilizing the maximum number of generations or function evaluations, which is dependent on the specific problem.

TSFB Criterion: This category employs search feedback measures to assess the progress of exploration and exploitation processes.

Using only TFit as a termination criterion carries risks, as it might cause EAs to get stuck in local minima, particularly if the algorithm quickly settles into a profound local minimum. Depending solely on TPop for termination isn’t typically suitable either, as driving the whole population or a large part of it to converge genotypically can be computationally expensive.

Using TBud depends on the specific problem, and the user should already know about the addressed problem before employing it. Often, having one individual reach a global minimum is sufficient. Although24 demonstrated that an upper bound for the number of iterations required for convergence can be theoretically estimated based on a desired confidence level,25 pointed out that these criteria are of limited practical interest, primarily because they tend to be excessively large. Finally, while employing TSFB appears to be efficient, it might encounter challenges related to the curse of dimensionality. Additionally, storing and reviewing extensive historical search data presents complexity issues.

As a result, an intelligent termination strategy becomes imperative. In our modified implementation for both PDMD and PDMS models, we’ve made a significant change by replacing the fixed number of iterations used in the original BioHEL. Instead, we’ve introduced a convergence condition that combines TFit and TPop. Under this new approach, the algorithm continually assesses improvements in the best-found solution (fitness) and the average fitness level. If a fixed count of consecutive generations (in our case, 3) passes with no improvement in either the best-found solution (fitness) or the average fitness level, the rule-learning process is terminated. This fixed threshold was chosen based on preliminary experiments conducted across multiple datasets, including KDD and HEPMASS, where we observed that longer periods without improvement rarely resulted in significantly better solutions. Limiting the count to three generations ensures computational efficiency while maintaining solution quality. This approach is further supported by prior studies that recommend short stagnation thresholds to prevent overfitting or unnecessary computational overhead in genetic algorithms [CITATION].

Population size

Since the early days of Genetic Algorithm (GA) research, the influence of population size on solution quality has been a subject of investigation.26. Yet, there’s a compelling argument in favor of considering variable population sizes in GAs, supported by insights documented in27. The author noted that population size significantly impacts algorithm performance and highlighted instances where adaptive population size adjustments improved results. Notable approaches, such as the Genetic Algorithm with Variable Population Size (GAVaPS)28, replace the fixed population size parameter with two properties: age and maximum lifetime. Similarly, the Adaptive Population Size-based GA (APGA)29, a variant of GAVaPS, employs a steady-state GA and maintains the best individual’s lifetime unchanged when individuals age. In30, a growing population size is introduced when individuals exhibit high fitness or during prolonged stagnation periods, effectively decreasing the population size during short stagnation phases.

Collectively, these studies underscore the pivotal role of selecting an appropriate population size to enhance GA efficiency. Notably, one of the most comprehensive GA calibration strategies has been proposed in31. This approach runs multiple populations of varying sizes concurrently, organizing a competitive race among them. These populations progress at different rates, with smaller populations leading in terms of generations completed. For example, at a specific point in time, this GA may feature populations of sizes 256, 512, and 1024. The population of size 256 might be in its 30th generation, while the population of size 512 is in its 6th generation, and the population of size 1024 is still in its 1st generation. Over time, smaller populations are phased out, and larger ones are introduced. These transitions are guided by monitoring average fitness levels within each population. For instance, if the population of size 512 exhibits superior average fitness compared to the population of size 256, the smaller population is terminated, as it’s unlikely to outperform the larger one. This approach operates indefinitely and doesn’t decide when an optimal population size is reached, making it unsuitable for an IRL/sequential covering setting. In our context, we seek an approach that can eventually terminate with confidence, having achieved a high-quality solution derived from an appropriately sized population for the problem at hand. Thus, our approach involves sequential evolution, doubling the population size until further doubling no longer enhances solution quality. This aligns with the parallel case, ensuring that if a population begins to drift, it is promptly terminated due to lack of progress and restarted with a larger population.

The overall outcome of this approach closely resembles a continuous increase in population size over time. Furthermore, our intention is to commence our GA with a modest initial population size. In theory, as long as the initial size is adequate to kickstart the learning process, the specific value becomes less critical because the population can expand dynamically if required. However, when the population starts exceedingly small, such as with just one or two individuals, it may not facilitate an efficient initial phase for addressing large-scale data problems, potentially leading to premature learning termination. To establish an appropriate starting point, we draw inspiration from the micro-genetic algorithm (micro-GA) as outlined by26 and first implemented by32. Krishnakumar employed a population size of 5 and observed that his micro-GA surpassed the standard GA in terms of processing speed.

The following sections will introduce a strategy for managing and automating the number of iterations and population size of BioHEL to enhance the performance of PDMD and PDMS.

An automated parallel genetic algorithm with parametric adaptation for distributed data analysis

Tags: