Optimization model of Spatial layout driven by AI

Figure 1 presents the design of the spatial optimization model, illustrating the entire process from data input to dynamic optimization. At the core of the system is the RL model, which drives layout adjustments based on real-time reward feedback. Each step is supported by segmented descriptions, preprocessing techniques, feature threshold settings, and relevant literature, highlighting the scientific rationale behind the model’s design.

Fig. 1figure 1

Design of spatial optimization model.

Design of Spatial optimization model

To enhance both the exhibition effect and the audience experience, this study proposes a spatial layout optimization model based on AI technology. The model integrates RL and CV to enable automated and dynamic adjustments of exhibition layouts. The optimization process begins by abstracting the exhibition space as a dynamic environment. Visitor behavior data—such as dwell time, navigation paths, and interaction frequency—are treated as inputs representing the state of the environment. The model’s actions include adjusting exhibit positions, reordering content, and redefining exhibition zones. To train the model in selecting optimal layout strategies, a reward function is constructed based on visitor behavior. Layout changes that lead to longer dwell times, more efficient navigation, or increased interaction receive higher rewards. This incentivizes the model to continuously improve the exhibition layout. The DQN algorithm is employed to implement this strategy, using a Q-value function to evaluate state-action pairs and updating its parameters through temporal difference learning. By interacting with the environment over time, the RL model refines its decision-making based on real-time feedback. This approach makes the layout more adaptive and significantly enhances visitor engagement and satisfaction. Within the RL framework, the museum environment is modeled as the agent’s environment, audience behavior as the state input, and layout adjustments as the output actions. The reward mechanism is designed to prioritize configurations that maximize audience interaction and experiential satisfaction. Therefore, this study focuses on key audience behaviors and constructs a reward function accordingly.

1) Dwell Time: Longer visitor dwell time in front of an exhibit typically indicated higher interest and engagement. Therefore, the model assigned positive rewards for increased dwell time, encouraging the system to adjust exhibit positions in a way that captured and sustained audience attention.

2) Visiting Path: A shorter or more coherent visiting path suggested that the spatial layout effectively guided the audience through the exhibition. The model rewarded rational, efficient navigation paths to reduce unnecessary detours and enhance the overall flow and user experience.

3) Interaction Frequency: The frequency of interactions—such as touchscreen use, AR/VR engagement, or physical exhibit interaction—served as a key metric of exhibit attractiveness. Higher interaction frequency received greater rewards, prompting the model to favor more interactive and engaging layout configurations.

4) Emotional Feedback: Using CV and affective computing, the system analyzed facial expressions to detect positive emotional responses such as joy and surprise. Areas that consistently elicited positive emotional reactions were given higher rewards, guiding the layout optimization toward emotionally resonant spatial arrangements.

5) Audience Distribution Balance: Uneven visitor distribution—where some areas were overcrowded while others were underutilized—was considered detrimental to visitor experience. The model rewarded more balanced audience distribution across exhibition zones to enhance space utilization and visitor comfort.

Dwell time is a key indicator of visitor interest and has been widely validated in museum behavior studies40. This study tracked the duration of time visitors spent in front of exhibits in real time. Following the threshold method proposed by Jouibari—where a stay of ≥ 3 s indicates valid interest—the attractiveness of exhibits was quantitatively assessed41. For visiting path optimization, this study referred to the path-planning framework developed by Ntakolia42which identified path coherence and redundancy as critical factors influencing the visitor experience. Interaction frequency was measured using the standards outlined in Wu’s interaction design theory43. A high interaction rate (≥ 80%) was used as a benchmark for participatory and user-friendly exhibit design. Audience emotional feedback was analyzed based on EC theory. Multimodal data—including facial expressions, voice tone, and body posture—were integrated to classify emotional states in real time44. Audience distribution balance was evaluated using the social force model, with the Gini coefficient applied to measure the evenness of crowd density across the space. This approach aimed to prevent the negative impact of local overcrowding on the overall visitor experience45. The goal was not to eliminate natural crowd clustering around popular exhibits, but rather to distinguish it from inefficient congestion using dynamic thresholds. Following Easson’s visitor interest-driven theory46a density threshold (\(\:{\rho\:}_{max}=5people/{m}^{2}\)) was set for exhibit areas. When local density remained at or below this threshold (\(\:\rho\:\le\:{\rho\:}_{max}\)), the system interpreted it as reasonable clustering and applied a base-level reward. However, if density exceeded the threshold (\(\:\rho\:>{\rho\:}_{max}\)), a congestion optimization mechanism was triggered—such as path redirection or minor exhibit repositioning—and a negative reward was assigned to the excess density (\(\:\rho\:-{\rho\:}_{max}\)). This mechanism preserved natural interest clusters (e.g., crowds around iconic pieces like the Mona Lisa) while preventing disruptive congestion. Furthermore, \(\:{\rho\:}_{max}\) was adaptively adjusted based on the type of exhibit. Interactive zones (e.g., VR areas), which required more space for safe and effective engagement, were assigned a lower threshold (3 people/m²), whereas static display zones were allowed a higher limit (6 people/m²).

Realization of DQN algorithm

The DQN was employed as the primary RL algorithm47. In this framework, each spatial layout decision—such as adjusting exhibit positions or redefining exhibition zones—was treated as a distinct state. The reward signal was generated based on real-time audience behavior data, including dwell time and path selection. Through iterative training, the model progressively learned the optimal spatial layout strategy. The overall optimization objective is expressed in Eq. (1):

$$\:{R}_{t}=\sum\:_{i=1}^{n}{\text{{\rm\:Y}}}^{i}\cdot\:{r}_{i}$$

(1)

\(\:{R}_{t}\) represents the total reward at time step t, where γ is the discount factor, \(\:{r}_{i}\) is the immediate reward for the ith behavior, and n is the length of the behavior sequence. This equation indicates that the model’s reward depends not only on the current action but also on the expected future rewards. As a result, the model optimizes the museum’s spatial layout with a long-term perspective rather than focusing solely on immediate feedback.

To more accurately analyze visitor behavior within the exhibition space, CV technology is integrated into the system. Sensors and cameras are deployed to capture real-time behavioral data, including movement trajectories, dwell times, and gaze points. These data not only assist in evaluating the current layout’s effectiveness but also provide critical feedback for ongoing optimization. Figure 2 illustrates a sample spatial layout used in some museums.

Fig. 2figure 2

Some examples of exhibition space layout in museums.

In processing image data, CV technology is used to track visitor behavior through image recognition and analysis. Cameras installed throughout the exhibition space capture facial expressions, eye movements, body posture, and motion. These image inputs are processed using CNNs to extract key behavioral features.

Facial expression recognition, in particular, relies on a CNN-based deep learning model to identify and classify emotional states from captured facial images. High-resolution cameras record these facial images in real time within the museum environment. The raw data then undergoes several preprocessing steps—including image denoising, grayscale conversion, and face detection—to ensure input quality and accuracy.

The core of facial expression recognition involves feature extraction using a CNN. The convolutional layers identify critical facial features, such as the shape, position, and dynamic changes of the eyebrows, eyes, and mouth. Pooling layers reduce data dimensionality and enhance feature robustness. Finally, fully connected layers classify the extracted features into specific emotional categories such as joy, surprise, anger, sadness, and interest.

Dynamic layout optimization

To enhance the accuracy and adaptability of facial expression recognition, a pre-trained neural network model is fine-tuned using an open-access facial expression dataset. This allows the model to better accommodate the diversity and complexity of real-world facial expressions in a museum setting. Data augmentation techniques—such as rotation, translation, scaling, and image flipping—are also applied to improve the model’s robustness under varying environmental conditions.

The results of facial expression recognition are then integrated with other behavioral data. For instance, by analyzing changes in facial expressions, the system can infer visitor interest. Strong positive emotions (e.g., joy or surprise) observed while viewing an exhibit suggest high engagement, prompting the system to prioritize or enhance that exhibit’s location and interactivity. Conversely, negative emotions (e.g., boredom or confusion) may trigger adjustments to the exhibit’s content or presentation to improve visitor engagement and satisfaction.

Facial expression data can also be fused with eye-tracking and body posture information. Eye movement analysis helps identify which exhibits draw the most attention, offering valuable insights for spatial layout decisions. Meanwhile, body posture cues—such as lingering, movement, or hesitation—can reflect the visitor’s level of engagement and intent to interact. This enables the system to dynamically tailor display modes or interactive content to suit different visitor preferences.

Once processed, these image-based behavior features are quantified and fed into the RL model to guide spatial layout optimization. For example, if a visitor lingers in front of an exhibit while displaying high interest, the system may increase that exhibit’s visibility or adjust its location to maximize impact. The CV processing flow is formalized in Eq. (2):

$$\:{D}_{feedback}=f({I}_{input},{W}_{model})$$

(2)

\(\:{D}_{feedback}\) represents the audience behavior data captured from the camera. \(\:{I}_{input}\) is the input image data. \(\:{W}_{model}\) is the weight of the CV model. The spatial layout optimization process is dynamic, with the model continuously adjusting based on real-time interactions with the museum environment and audience feedback. For instance, if the system detects excessive crowd density in a particular exhibition area, it can proactively modify the number of exhibits or adjust the spatial arrangement to alleviate congestion. These changes aim to enhance visitor comfort and engagement. The algorithmic structure of the spatial layout optimization model is illustrated in Fig. 3:

Fig. 3figure 3

AI-driven spatial layout optimization model structure.

As shown in Fig. 3, the process begins with analyzing audience behavior data using CV to extract key features. These features are then fed into the RL model, which makes optimal decisions for spatial layout. The optimized layout is subsequently applied to the museum space, enabling dynamic adjustments. By combining RL and CV, the model can adapt to varying museum environments. Since exhibit locations, exhibition area mobility, and audience preferences constantly change, the model continuously learns and adjusts to enhance the visitor experience.

Exhibition Liquidity is defined as the proportion of exhibits that visitors can efficiently access within a given time. Its calculation follows Eq. (3):

$$\:L=\frac{{N}_{visited}}{{N}_{total}}\times\:\left(1-\frac{{T}_{av{g}_{detour}}}{{T}_{shortest}}\right)\times\:100\%$$

(3)

\(\:{N}_{visited}\) is the number of exhibits actually visited by the audience. \(\:{N}_{total}\) is the total number of exhibits in the exhibition hall. \(\:{T}_{av{g}_{detour}}\) is the average time difference between the actual path of the audience and the theoretical shortest path, and \(\:{T}_{shortest}\) is the total time of the theoretical shortest path. Equation (4) shows the calculation of Path Optimization Rate:

$$\:P=\left(1-\frac{\sum\:_{i=1}^{n}\left({D}_{i}-{D}_{min}\right)}{\sum\:_{i=1}^{n}{D}_{min}}\right)\times\:100\%$$

(4)

\(\:{D}_{i}\) is the actual path length of the \(\:i\)-th audience. \(\:{D}_{min}\) is the theoretical shortest path length of the corresponding exhibit sequence. The balance of crowd density distribution is measured by Gini Coefficient, as shown in Eq. (5):

$$\:G=\frac{\sum\:_{i=1}^{k}\sum\:_{j=1}^{k}|{x}_{i}-{x}_{j}|}{2k\sum\:_{i=1}^{k}{x}_{i}}$$

(5)

\(\:{x}_{i}\) is the density of people in the \(\:i\)-th exhibition area, and \(\:k\) is the total number of exhibition areas. The Frequency of Congested Areas is defined as the proportion of exhibition areas where the daily pedestrian density exceeds the threshold (5 people/m2), as shown in Eq. (6):

$$\:F=\frac{\sum\:_{t=1}^{T}{C}_{t}}{T\times\:k}\times\:100\%$$

(6)

\(\:{C}_{t}\) is the number of congested exhibition areas in the \(\:t\) hour. \(\:T\) is the total observation time (hours), and \(\:k\) is the total number of exhibition areas.

This study referenced standard museum industry guidelines on crowd density, which recommended limiting exhibition areas to a maximum of five people per square meter to ensure visitor comfort. These guidelines were further adjusted to reflect the specific spatial layout and exhibit types of the target museum. A two-week on-site test was carried out to collect visitor flow data across different time periods. Simulations then assessed space utilization and visitor experience under varying density thresholds. The analysis showed that maintaining a density of five people per square meter effectively prevented congestion, maximized space use, and improved the overall visitor experience.

Interactive experience optimization model
Design of personalized recommendation model

To further enhance immersion in the museum space and increase audience engagement, this study also developed an AI-based interactive experience optimization model, as illustrated in Fig. 4, alongside the spatial layout optimization.

Fig. 4figure 4

Optimization model of museum interactive experience.

A key component of the interactive experience optimization model shown in Fig. 4 is an intelligent recommendation system driven by audience behavior data. By analyzing real-time data such as interest points, dwell time, and interaction frequency, the model infers individual preferences and automatically adjusts the displayed content to provide a personalized experience. For example, if the model detects that a visitor shows strong interest in a particular exhibit—indicated by prolonged viewing or frequent interactions—the system dynamically recommends additional information about that exhibit or other exhibits with similar themes. This recommendation process is implemented using a Collaborative Filtering approach.

Hybrid recommendation strategy and cold start processing

This study adopts an item-based collaborative filtering algorithm, a method widely used in product recommendation systems. It analyzes user behavior data—such as points of interest, dwell time, and interaction frequency—to calculate similarities between exhibits and recommend others with similar themes or styles. Unlike traditional product recommendations, museum exhibition recommendations prioritize enhancing visitor immersion and interactive participation. Building on the proven success of collaborative filtering in e-commerce, this study adapts the approach to the museum context by using visitor behavior as implicit feedback. This enables the creation of a personalized recommendation model that not only captures a visitor’s interest in specific exhibits but also uncovers potential preferences across different exhibition content by integrating various behavior data. The model adjusts display content in real time to match these inferred preferences. Moreover, the collaborative filtering algorithm operates in tandem with RL, CV, and emotion computing technologies. This integration enables dynamic optimization throughout the entire process—from data collection to exhibit recommendation and interaction mode adjustment. As a result, the algorithm retains its strength in accurately capturing user preferences while effectively meeting the museum’s higher demands for personalization, interactivity, and immersive experience. By analyzing audience behavior data, including interest points, dwell time, and interaction frequency, the system infers visitor preferences and automatically tailors display content to provide a personalized experience. The algorithm calculates exhibit similarity based on this behavior data and recommends exhibits related to those in which visitors show interest. This approach effectively delivers personalized exhibition recommendations, enhancing the overall visitor experience. To further improve recommendation accuracy and address data sparsity issues, the model incorporates implicit feedback such as browsing history and dwell time. These additional data points help capture visitor preferences more comprehensively, thereby boosting the recommendation system’s performance.

To address the cold start problem in collaborative filtering, this study implements a hybrid recommendation strategy. For newly added exhibits, the system begins by extracting content-based features such as exhibit type, historical period, material, and thematic tags. It then uses cosine similarity to compare these features with those of existing exhibits, generating an initial recommendation list. As interaction data for the new exhibit accumulates, the system gradually transitions to a collaborative filtering-based recommendation approach. For first-time visitors, the system assigns them to predefined audience groups based on demographic information, such as age and cultural background. It then recommends exhibits that have historically been favored by that group. As visitor behavior data becomes available, the model dynamically updates its parameters and shifts toward a personalized recommendation based on collaborative filtering. This hybrid approach effectively balances recommendation accuracy and data availability, ensuring reliable performance during the cold start phase for both new users and new exhibits.

Adaptive adjustment of interactive mode

In the museum’s interactive experience optimization model, adaptive adjustment of interaction modes is incorporated alongside content recommendation. Interactive experiences may include touchscreen displays, AR/VR environments, audio feedback, and other formats. By analyzing real-time audience behavior, the model can intelligently modify the interaction mode to enhance engagement and participation. For instance, if the system detects low user engagement with touchscreen displays—such as infrequent touch activity—it can automatically switch to alternative modes like VR experiences or AR visualizations to improve interactivity.

This adaptive process relies on behavioral analysis and ML algorithms. The system continuously monitors audience behavior, including dwell time, viewing frequency, and interaction rates. If the interaction frequency within a particular exhibit area drops below a threshold, the system dynamically adjusts the display format to better capture attention and increase involvement. For example, introducing AR features to provide richer, more immersive content can re-engage visitors with low initial interest. The decision to change the interaction mode is guided by real-time data analysis, ensuring that the interactive experience remains responsive and tailored to visitor behavior. This optimization process is formally represented in Eq. (7).

$$\:{I}_{t}=f({B}_{t},{A}_{t})$$

(7)

\(\:{I}_{t}\) represents the interaction mode provided by the model for the audience at time t. \(\:{B}_{t}\) represents the behavior data of the audience at time t (such as interaction frequency, residence time, etc.). \(\:{A}_{t}\) represents the display characteristics of the current exhibit (such as exhibit type, interaction mode, etc.). Through this equation, the model can adjust the content and mode of interaction in real time.

Audiences are grouped based on demographic attributes such as age and cultural background. Age categories may include adolescents, young and middle-aged adults, and seniors, while cultural backgrounds can be classified into local and international cultures. These groupings can be established through clustering analysis of historical museum visitor data or by referencing relevant academic research and the specific context of the museum. Before the system is officially deployed, a small group of visitors with known demographic information is invited for testing. This allows observation of whether the initial recommendations based on demographic profiles align with the actual interests of these visitors. Based on the test results, the predefined audience groupings are refined and optimized. As more data is collected, online learning algorithms are employed to continuously update and improve the recommendation model. For visitors who were initially misclassified, the system automatically adjusts future recommendations based on their behavioral data, gradually increasing recommendation accuracy. Additionally, visitors can provide direct feedback on recommendations—such as through likes or bookmarks—which is also incorporated into the system to further enhance recommendation strategies.

When switching interactive modes, the system relies on real-time data to ensure a seamless transition between touchscreen and VR/AR experiences. If the interaction frequency in touchscreen mode is low, the system automatically prompts the audience to switch to VR or AR, offering a more immersive experience. Intelligent algorithms and sensors support this transition, ensuring it occurs smoothly and without disrupting the visitor’s engagement. When a user shows limited responsiveness to touchscreen interaction, the system analyzes their behavior history and current activity to recommend alternative display modes. User identification and behavior tracking technologies are employed to preserve interaction history during mode switches, maintaining a continuous and uninterrupted experience. For instance, when transitioning from touchscreen to VR or AR, the system ensures that previously viewed content is carried over and displayed in the new mode.

This adaptive optimization enables the model to dynamically adjust interactive experiences and flexibly switch between different display modes. As a result, the system delivers a more personalized and immersive museum experience that aligns closely with visitors’ needs and preferences.

Emotional computing module

To further enhance audience immersion, the interactive experience optimization model integrates EC and situational awareness technologies to detect the audience’s emotional state in real time. Based on these emotional changes, the system dynamically adjusts both the exhibit content and the interaction mode. By analyzing physiological and behavioral cues—such as facial expressions, vocal tone, and body posture—the model can assess the visitor’s emotional response and make appropriate adjustments. For instance, if a visitor appears engaged or curious, the system may display more detailed content or introduce interactive elements. Conversely, if signs of fatigue or boredom are detected, the system may shift to more entertaining or stimulating content to sustain interest and participation.

The objective of EC is to predict and interpret the audience’s emotional states using multi-dimensional input signals. This process is modeled as a multi-input, multi-output function, formally expressed in Eq. (8):

$$\:{E}_{t}=g({F}_{t},{V}_{t},{A}_{t})$$

(8)

\(\:{E}_{t}\) represents the emotional state of the audience at time t, and it is a numerical emotional score (such as positive, negative, excited, calm, etc.). \(\:{F}_{t}\) is the audience’s facial expression data, which is extracted by facial expression recognition algorithm, including facial features such as smile, frown and wide eyes. In this study, a CNN is employed to extract facial features and learn the mapping between these features and emotional states using a large set of labeled data. The model recognizes expressions such as smiles and frowns through a deep neural network and maps them to emotional categories (e.g., happiness, anger, sadness). These features serve as indicators of audience emotional changes. When multiple viewers are present simultaneously, the system analyzes each individual’s facial expression data independently. It identifies and classifies each viewer’s facial expressions, determining their emotional state within a specific time frame. The emotional data from all viewers are then aggregated to calculate the overall emotional distribution during that period. This approach avoids the complexity of merging raw facial data and ensures accurate emotional analysis.

\(\:{V}_{t}\) represents the audience’s voice tone data. Voice emotion is analyzed using speech recognition technology, focusing on features such as intonation, speech rate, and pitch. These vocal cues often convey subtle emotional states like happiness, surprise, or anxiety. \(\:{A}_{t}\) refers to body posture data, which captures audience movements and postures (e.g., standing, sitting, waving) through sensors or CV. These physical behaviors can indicate emotional and physiological responses. After collecting data from facial expressions, voice tone, and body posture, a fusion model integrates these multimodal signals to derive a comprehensive emotional assessment. The final emotional score is calculated using a weighted approach, as defined in Eq. (9):

$$\:{E}_{t}={w}_{F}\cdot\:{F}_{t}+{w}_{V}\cdot\:{V}_{t}+{w}_{A}\cdot\:{A}_{t}$$

(9)

\(\:{w}_{F}\), \(\:{w}_{V}\) and \(\:{w}_{A}\) are the weights of facial expression, voice tone and body posture, respectively, indicating the importance of different signals to EC. These weights can be optimized through the training process to learn the contribution of different types of signals to emotional judgment.

In the EC module, the weights for facial expressions, voice tone, and body posture (\(\:{w}_{F},{w}_{V},{w}_{A}\)) are learned through a supervised learning framework. A multimodal dataset comprising 1,000 samples of audience behavior—including facial expressions, audio recordings, and posture videos—is annotated by experts with emotion labels (positive, neutral, or negative). A fully connected neural network integrates these multimodal features, and the weights are optimized using gradient descent to minimize the cross-entropy loss in emotion classification.

The experiment collected emotional data from audiences with diverse cultural backgrounds, including facial expressions, voice tone, and body posture. These data were analyzed using ML algorithms. By comparing the emotion classification accuracy under different weight combinations, the current weight allocation was determined. Additionally, cross-cultural validation experiments were conducted, revealing that the weights exhibit good adaptability and stability across various cultural contexts. However, slight adjustments may be necessary for specific cultures. These findings indicate that the model possesses a certain degree of generalizability while allowing for personalized adaptation based on cultural differences. The final weights, determined through 5-fold cross-validation, are \(\:{w}_{F}=0.52\pm\:0.03\), \(\:{w}_{V}=0.28\pm\:0.02\) and \(\:{w}_{A}=0.20\pm\:0.02\) (mean ± standard deviation). Experiments show that this weight combination achieves an emotion classification accuracy of 87.6%, significantly higher than uniform weights (76.2%) and single-modality models (facial: 79.3%, voice: 68.5%, posture: 62.1%).

A Multi-technology collaborative optimization framework

This study establishes a dynamic optimization feedback loop through a collaborative multi-technology framework, as illustrated in Fig. 5. CV continuously tracks audience behavior, including movement patterns and dwell time. EC analyzes facial expressions, voice tone, and body posture to generate emotional scores—such as interest or confusion. VR devices capture interactions in virtual environments, including hotspot clicks and navigation paths. Data from these three sources are integrated into a RL model, which dynamically adjusts exhibit layouts and interaction logic based on a multi-dimensional reward function. This function assigns weights of 0.4 to dwell time, 0.3 to path efficiency, and 0.3 to emotional feedback. For instance, if EC detects persistently low emotional engagement in a particular area, the RL model can trigger a transition to AR mode, presenting dynamic content such as 3D animations to recapture audience attention. The optimized layouts and interaction modes are implemented via VR devices and physical space sensors. Their effectiveness is evaluated through audience surveys measuring immersion and knowledge acquisition. The results are continuously fed back into the AI processing layer, enabling ongoing optimization. This integrated system is the first to achieve a multi-modal, closed-loop feedback cycle of “behavior–emotion–space,” overcoming the limitations of traditional single-technology approaches.

Fig. 5figure 5

Multi-technology collaborative optimization framework.

Figure 5 illustrates a three-layer multi-technology collaborative optimization framework. The data acquisition layer collects real-time information on audience behavior, emotions, and interactions using CV sensors, EC analysis devices, and VR equipment. The AI processing layer integrates these data—such as movement paths, emotional scores, and VR interactions—through a RL model. This model dynamically optimizes spatial layouts by adjusting exhibit positions and refines interaction modes, including switching between AR and VR. The feedback execution layer applies these optimized results to both physical and virtual environments in real time and assesses audience experience through surveys. Together, these layers create a closed-loop system for continuous improvement.