Model testing and evaluation indicatorsEvaluation indicators

The objective assessment indicators employed in the absence of external references encompass the Lightness Order Error (LOE) and the Natural Image Quality Evaluator (NIQE). The LOE indicator serves to gauge the fidelity of intensity distribution pertaining to musical information following enhancement procedures, thereby assessing the inherent naturalness of the resultant enhancements. The associated computation methodology is delineated within Eqs. (9) to (11):

$$\:LOE=\frac{1}{N}\sum\:_{i=1}^{N}RD\left(i\right)$$

(9)

$$\:RD\left(i\right)=\sum\:_{j=1}^{N}\left(U\right({S}_{d}\left(i\right),{S}_{d}\left(j\right)\left)\right)\oplus\:\left(U\left({S}_{e}\left(i\right),{S}_{e}\left(j\right)\right)\right)$$

(10)

$$\:U\left(i,j\right)=\left\{\begin{array}{c}1,i\ge\:j\\\:0,i

(11)

\(\:{S}_{d}\left(x\right)\) refers to the brightness value at the input data point i. \(\:{S}_{e}\left(x\right)\) is the brightness value at point i of the enhanced music data. U represents the unit step function37.

PSNR measures perceived error. This concept indicates that Mean Square Error (MSE) is the noise power caused by the reduction of data quality, as follows:

$$\:PSNR=10\times\:lg\frac{N\times\:{A}^{2}}{MSE}$$

(12)

In (12), N represents the amount of data. A represents the value of the maximum gray value. The calculation of MSE is shown in Eq. (13):

$$\:MSE=\frac{1}{N}\sum\:_{i=1}^{N}({S}_{e}\left(i\right)-{S}_{r}\left(i\right))$$

(13)

In Eq. (13), \(\:{S}_{e}\) represents the enhancement result. \(\:{S}_{r}\) represents reference data38. The luminance distribution error adopted in this study is a no-reference image quality assessment indicator specifically designed for music visualization scenarios. Its core function is to quantify the matching error between the luminance distribution in visualization results and music emotional features. The calculation of LOE is based on the mapping relationship between the time-domain energy of music signals and the luminance matrix of visualization images. First, the short-term energy sequence of the music signal (reflecting changes in rhythm intensity) is extracted and used as the reference benchmark for the ideal luminance distribution (higher energy corresponds to higher luminance in the corresponding area). Then, the root mean square error (RMSE) between the actual and ideal luminance values of each pixel in the visualization image is calculated; meanwhile, it combines with a spatial weight factor (highlighting the luminance error in the central area of the image, as human vision is more sensitive to the central area), to obtain the final LOE value. Its quantification logic directly corresponds to human visual perception characteristics of luminance changes. Studies have shown that humans are sensitive to the perception of the correlation between luminance uniformity and sense of rhythm in dynamic visual stimuli; for example, chaotic distribution of luminance mutations corresponding to strong rhythm segments can significantly reduce visual comfort. By binding luminance errors to the time sequence of music energy, LOE precisely captures this perceptual correlation; this more aligns with the subjective experience of music visualization than indicators that simply calculate global luminance variance. Compared with commonly used no-reference image quality indicators, LOE has significant advantages in music visualization scenarios. Natural Image Quality Evaluator (NIQE) assesses image quality by training statistical feature models of natural images; its core is to measure the deviation of the image to be tested from the statistical laws of natural images. However, music visualization results often contain unnatural dynamic luminance jumps (such as luminance pulses generated with drumbeats). Such features are considered “abnormal” in natural images and are misjudged as low quality by NIQE. In contrast, LOE is specially designed for music-driven luminance changes and can distinguish between “reasonable jumps consistent with rhythm” and “irregular luminance noise”. Moreover, other indicators, such as Perceptual Image Quality Evaluator (PIQE), focus more on local blurriness and blocking artifacts, and are insufficient in depicting the global coordination of luminance distribution. The core of music visualization is to convey emotional rhythm through luminance changes (such as low luminance and gentle changes corresponding to sadness, high luminance and intense fluctuations corresponding to cheerfulness). By quantifying the error of this “rhythm-luminance” mapping, LOE can more accurately reflect the effect of visualization results in conveying music emotions, proving its applicability and superiority in music visualization quality assessment.

Experimental setup

The experimental hardware configuration encompasses an Intel (R) Xeon (R) E5-2620 v4 central processing unit (CPU), accompanied by four NVIDIA GTX 1080 Ti graphics processing units (GPUs), and a system memory of 64 GB. During the course of the experiment, the network’s training phase is executed utilizing the Graphics Processing Unit (GPU), while the subsequent testing phase is conducted on the CPU.

The network undergoes a training regimen spanning 200 iterations. The initial learning rate is established at 0.002. The cumulative training duration amounts to 30 h. To optimize the network, the Adam algorithm is employed, incorporating parameter settings of b1 = 0.9 and b2 = 0.999.

The dataset of this study is derived from NetEase Cloud Music, and its acquisition and processing strictly follow standardized procedures to ensure the reliability and applicability of the data. Specifically, accessing diverse music chart data is obtained by logging into the platform and visiting the “Ranking” section. Developer tools are used to parse the structure of web page source codes, and Python’s Requests module is employed to send network requests for web page data. Then, based on web page node attributes, the BeautifulSoup library accurately extracts detailed song information nested in div tags, including key contents such as music descriptions, titles, and comment counts. Finally, the data is stored in Comma-Separated Value (CSV) format through Python’s built-in functions, laying a structured foundation for subsequent analysis.

In the data preprocessing stage, a series of refined operations are adopted to ensure data quality. Duplicate samples are excluded through deduplication processing; targeted methods are used to handle missing values to maintain data integrity; data type conversion is performed to ensure uniform formatting. Data standardization is achieved by combining technologies such as regular expression matching and string processing for specific data issues, effectively improving data consistency and accuracy. In the data analysis phase, Pandas tools are used to conduct descriptive statistics, correlation analysis, and cluster analysis to deeply explore the inherent laws of the data. In the visualization stage, tools such as Matplotlib, Seaborn, or Plotly are employed to select appropriate chart types (such as bar charts, line charts, scatter plots, etc.) according to data characteristics. Meanwhile, color and style designs are optimized to enhance the intuitiveness and readability of data presentation, providing a clear basis for model training and analysis.

Based on the above data processing and analysis, the proposed hypotheses regarding the application of multimodal music emotion features and DL algorithms in music retrieval and emotion recognition are strongly supported. Multimodal data can integrate different types of information, such as audio and images, capture music emotion features from multiple dimensions, provide a more comprehensive basis for emotion classification, and help improve recognition accuracy. DL algorithms, especially CNN and RNN, with their strong autonomous learning and feature extraction capabilities, can effectively parse complex emotional correlations in music, improving recognition efficiency while optimizing result accuracy. DL-based multimodal music emotion recognition methods are applied to practical scenarios such as intelligent music systems and music recommendation systems; these methods can accurately meet users’ diverse emotional needs, provide more personalized music search and recommendation services, and fully reflect the application value of this method in practice. By systematically verifying these hypotheses, this study is committed to constructing an efficient and accurate music emotion recognition method. Applying these methods to practical systems can bring users a better music experience and promote the in-depth development of music information retrieval technology in academic research and industrial applications39.

This study constructs a systematic framework for music visualization analysis through the integration of IoT technology and DNN. Its core lies in realizing the full-process upgrading of music emotion from perception to presentation by optimizing device collaboration and data links. The deployment of IoT devices focuses on multi-dimensional data collection, covering audio sensors, motion capture equipment, and intelligent interactive terminals. Audio sensors are used to capture the original waveform and spectral features of music, offering basic materials for subsequent feature extraction by CNN; motion capture equipment synchronously records users’ physical feedback during listening, serving as supplementary information for emotional interaction; intelligent interactive terminals undertake the functions of data transfer and preliminary processing to ensure the integrity and standardization of raw data before transmission to the core model. These devices achieve linkage through a unified IoT platform, which integrates data reception, preprocessing, and distribution functions. It can adopt differentiated transmission strategies according to the characteristics of different data types (such as the timeliness of audio and the real-time nature of actions), providing accurate and timely input for model analysis.

The integration mechanism of IoT and DNN is reflected in the collaboration of data processing. The IoT platform initially integrates the collected multi-source data, eliminates redundant information, and completes format conversion, enabling audio features and user interaction data to be efficiently parsed by CNN. When extracting spectral features of music, CNN dynamically adjusts feature weights by combining real-time interaction data from IoT, making the model’s capture of music emotions more in line with users’ subjective perception. This collaboration enhances the pertinence of feature extraction; it also ensures, through the real-time performance of IoT, that the visual output is synchronized with the rhythm and emotional changes of music, avoiding lag or misalignment.

The integration of IoT technology improves the music visualization process in three aspects. Firstly, the collection of multi-source data breaks through the limitation of single audio analysis, incorporating user feedback into the basis of visualization, making the presented results closer to human emotional experience. Secondly, the real-time transmission capability of IoT ensures the timeliness of data processing and model reasoning, allowing the visualization effect to be adjusted in real time with the dynamic changes of music. Thirdly, intelligent interactive terminals support users to participate in the adjustment of visualization parameters through natural ways (such as gestures, voice); moreover, the IoT platform quickly converts these instructions into signals recognizable by the model, realizing personalized customization of the visualization process. Overall, through device linkage, data collaboration, and real-time interaction, IoT technology provides richer input, more efficient processing links, and more flexible presentation methods for DNN-based music visualization analysis. This enables the visualization of music emotions to accurately reflect the characteristics of music itself and dynamically adapt to user needs, constructing a new analysis paradigm integrating technology and experience40.

In the hyperparameter tuning phase of the experimental setup, the determination of learning rate schedules, regularization strategies, and early stopping mechanisms all takes the stability and generalization ability of model performance as core goals. Meanwhile, they are systematically designed in combination with the characteristics of music emotion recognition tasks (such as unbalanced data distribution and subtlety of emotional features). The learning rate schedule adopts a dynamic adjustment strategy. The initial learning rate is determined through grid search within the range of [1e−5, 1e−3]. Finally, 0.001 is selected as the benchmark value. This value shows sensitivity to changes in music spectral features in pre-experiments, avoiding model convergence oscillations caused by excessively high learning rates and preventing prolonged training cycles due to excessively low ones. A stepwise decay mechanism is introduced during training. When the validation set accuracy does not improve for 3 consecutive epochs, the learning rate is automatically multiplied by a decay factor of 0.5 until it drops to 1e-6 and then remains stable. This adjustment is based on the iterative characteristics of music data, ensuring that the model can gradually refine parameter updates when capturing emotional features (such as spectral mutations at melody turning points).

The design of the regularization strategy aims at the overfitting risk of DNN on small-sample music data, adopting a combination of L2 regularization and dropout. The L2 regularization coefficient is selected through five-fold cross-validation in the interval [1e−4, 1e−2], and finally determined as 0.001. This value suppresses excessively large model parameters while retaining the ability to distinguish emotional features; dropout layers are set between FC layers with a dropout rate of 0.3. By randomly deactivating some neurons, it simulates the feature distribution of different sample subsets and enhances the model’s robustness to noise in music data (such as background noise in recordings). The collaboration of the two strategies controls model complexity and does not excessively weaken its ability to capture subtle emotional features (such as loudness changes at weak beats).

The trigger condition of the early stopping mechanism is directly related to the performance of the validation set, taking the validation set loss as the core monitoring indicator while considering changes in accuracy and F1 score. The patience parameter is set to 10; this means that training is terminated and the current optimal model is saved when the validation set loss does not decrease within 10 consecutive epochs (with a change amplitude less than 1e−4). This design is based on the characteristic of feature stability in music emotion recognition. Emotional features gradually converge during continuous training, and over-training is likely to cause the model to fit accidental fluctuations in training data (such as abnormal spectra of specific songs). The 10-epoch buffer period is sufficient to judge whether performance has entered a plateau, balancing training sufficiency and overfitting risk.

Error bars and confidence intervals of key indicators (such as accuracy and F1 score) are calculated through five-fold cross-validation. Error bars are represented by the standard deviation of results from each fold, reflecting the fluctuation range of model performance; 95% confidence intervals are calculated using the t-distribution with the equation: mean ± 1.96 × (standard deviation/n), where n is the number of folds. This processing method enhances the statistical reliability of experimental results. For example, when reporting emotion classification accuracy, the mean value is presented, and the performance differences on different data subsets are intuitively displayed through error bars, making the results more convincing.

Overall, the hyperparameter tuning process always revolves around the characteristics of music data. Dynamic adjustment strategies and statistical validation methods ensure that the model has sufficient fitting ability. Meanwhile, these methods enable the model to maintain good generalization performance when capturing complex emotional features, providing methodological support for the reliability of experimental conclusions.

Analysis of accuracy

A comparative analysis is conducted between the Multi-task Visual Geometry Group Net (MT-VGGNet) and both MT-CafeNet and the model presented in this study. The results of this comparison, pertaining to the accuracy of music information analysis, are elucidated in Fig. 6:

Fig. 6

figure 6

Accuracy comparison results (a) comparison of positive sample accuracy; (b) negative sample error rate comparison.

In Fig. 6a, when the overlap ratio is 0.7, the positive sample accuracy of the proposed model, MT-VGGNet, and MT-CaffeNet is 97.1%, 96.5%, and 96.5%, respectively. When the overlap ratio is 0.8, the positive sample accuracy of the proposed model, MT-VGGNet, and MT-CaffeNet is 96.3%, 95.5%, and 95.3%, respectively. When the overlap ratio is 0.9, the positive sample accuracy of the proposed model, MT-VGGNet, and MT-CaffeNet is 95.4%, 94.3%, and 94.1%, respectively. The accuracy of data analysis decreases as the overlap ratio increases, and the positive sample accuracy of the proposed model is the highest among the three models. In Fig. 6b, when the overlap ratio is 0.7, the false negative rate of the proposed model is as follows: for the proposed model, MT-VGGNet, and MT-CaffeNet, the false negative rates are 0.03%, 0.04%, and 0.04%, respectively. When the overlap ratio is 0.8, the false negative rates of the proposed model, MT-VGGNet, and MT-CaffeNet are 0.06%, 0.12%, and 0.13%, respectively. When the overlap ratio is 0.9, the false negative rates of the proposed model, MT-VGGNet, and MT-CaffeNet are 0.1%, 0.38%, and 0.4%, respectively. In contrast, the model designed in this study has the lowest error rate among the three models. The false negative rate of negative samples increases with the increase in overlap ratio. Due to the enhancement of network nonlinearity and improved localization accuracy by utilizing 1*1-sized filters, the algorithm developed in this study and design process yields the optimal model.

Analysis of data binary classification results

The outcomes of binary classification achieved by the devised model subsequent to fine-tuning and complete training procedures are graphically depicted in Fig. 7.

Fig. 7

figure 7

Results of two classifications.

In Fig. 7, the classification accuracy of both experiments exceeds 98%, while the recall rate remains the same. The fine-tuned music visualization model achieves a classification accuracy of up to 99.5% for music data. The results of the fine-tuned experiment are significantly superior to those of the fully trained experiment. The data indicate that parameter initialization has a significant impact on training outcomes. The contribution of this discovery lies in revealing the importance of parameter initialization on model training outcomes and enhancing the accuracy of music data classification through fine-tuning techniques.

Model processing time analysis

The temporal processing duration of the algorithmic model proposed in this study is juxtaposed against that of the Retinex-net, Multi-Scale Retinex (MSR), and Local Interpretable Model Diagnostic Estimates (LIME) algorithms, across varying data volume magnitudes. The comparative analysis is presented in Fig. 8:

Fig. 8

figure 8

Comparison of processing time.

In Fig. 8, as the input data volume increases, the processing time of the algorithm inevitably increases. When the data volume changes, LIME’s computation time changes slightly, indicating its good computational efficiency. When the input data volume is 256, 480, and 720, the processing time of the proposed CNN-based music visualization analysis model is 2.923 s, 4.732 s, and 9.324 s, respectively. The algorithm proposed in this study yields more complex results, leading to longer processing times compared to other algorithms.

The comparative outcomes between the CNN-based music visualization analysis model introduced within this manuscript and the Retinex-net approach are illustrated in Fig. 9, considering the indicators of floating-point operations (FLOPs) and network parameter quantities.

Fig. 9

figure 9

Comparison of FLOPs and parameters of different models.

In Fig. 9, the CNN-based music visualization analysis model proposed in this study has FLOPs of 6.38G, and the Retinex-net has FLOPs of 8.34G. The network parameters of the two models are 573,600 and 641,100, respectively. The computational complexity of the proposed algorithm increases, thereby reducing the execution speed of the algorithm. The contribution of this study result lies in the design and evaluation of the music visualization analysis model, along with the quantification of network parameters and computational complexity. Despite the impact on algorithm execution speed, this study provides a valuable reference for the development of the music visualization field by offering a novel method for music data analysis. Additionally, the analysis of model parameters and computational complexity guides future research on optimizing algorithms and improving network structures. Subsequent research can further explore how to balance algorithm accuracy and execution efficiency to achieve the development of more practical music visualization analysis techniques.

Model generalization capability test

To assess the model’s capacity for generalization, a subset of 50 music data samples is randomly extracted from the network dataset and utilized for testing purposes in accordance with the approach outlined within this study.

Fig. 10

figure 10

Test results of different algorithms on datasets.

In Fig. 10, the CNN-based music visualization analysis model proposed in this study achieves a PSNR score of 16.893 and an SSIM score of 0.712. Compared to other state-of-the-art studies, the proposed CNN-based music visualization analysis model demonstrates favorable results in SSIM and is close to LIME in PSNR. Therefore, the proposed model exhibits good generalizability.

As a core field of interdisciplinary research, music emotion recognition faces core challenges in the subjectivity of emotional perception and the ambiguity of data annotation. From a technical perspective, the multidimensional nature of music emotions (including pleasure, arousal, and dominance dimensions) presents inherent modeling complexities. Combined with dataset scale limitations (e.g., DEAM’s 40-fragment constraint), these factors restrict current models to 70%-85% accuracy ranges. PSNR and SSIM, as image quality assessment indicators, lack a theoretical basis in audio processing. Their abnormally high values indicate fundamental problems in experimental design, such as data leakage, misapplication of indicators, or overfitting. Therefore, it is necessary to systematically verify the scientific nature of this study by comparing it with mainstream baseline methods. The comparison results of the proposed model are exhibited in Table 1.

Table 1 Comparison results of models.

In Table 1, from the simulated comparison data, the proposed CNN-based music visualization analysis model outperforms unlisted baseline models across all core all core indicators. Its > 10% accuracy advantage fully reflects the proposed model’s technical superiority. The proposed model achieves consistently high positive sample accuracy (97.1%, 96.3%, 95.4%) across all overlap rates (0.7, 0.8, or 0.9). This performance significantly surpasses traditional models, with SVM (82.5%, 80.3%, 78.1%) and basic MLP (80.7%, 77.9%, 75.3%) showing substantially lower results. This gap stems from the “deep and small” convolutional layer design of the proposed model. Through multi-layer continuous convolution and pooling operations, the proposed model can accurately extract multi-dimensional features such as spectrum and rhythm in music data. However, traditional models rely on manual feature engineering and are difficult to capture complex emotional correlation features, resulting in a remarkable lag in accuracy. Regarding false negative rate, the proposed model is only 0.1% when the overlap rate is 0.9, much lower than other models (2.7%-3.8%); this indicates that the model has stronger stability in distinguishing music emotion categories. Traditional models are prone to category confusion when data overlap increases due to limited feature extraction capabilities. In contrast, the proposed model enhances nonlinear expression through 1 × 1 convolution kernels, effectively reducing the risk of misclassification. The gap in accuracy after fine-tuning is more significant (99.5% vs. 83.6%-87.8%), which benefits from the proposed model’s sensitivity to parameter initialization and the feature learning ability of DNN. Simplified CNN or single LSTM models are difficult to break through performance bottlenecks even after fine-tuning due to insufficient network layers and a lack of feature fusion mechanisms. This further highlights the rationality of the proposed model architecture design. In summary, compared with traditional and simplified models not mentioned in the text, the proposed model has an overwhelming advantage in accuracy, stability, and generalization ability. Especially, this model has an accuracy gap of more than 10%, fully proving its technical breakthrough and application value in music emotion analysis.

The study by Zou and Ergan reveals that research in the field of music visualization spans various aspects, including signal processing, image processing, and DL. The authors introduced a neural network-based music visualization method that employs spectrum features and color mapping for a visual representation of audio41. However, the performance of this method is not satisfactory on certain indicators. In contrast to this study, this study presents a CNN-based music visualization analysis model that leverages deeper levels of CNN architecture to achieve more accurate and detailed music visualization results. Experimental results demonstrate that the proposed model performs well in terms of SSIM and PSNR, showcasing high visualization quality and better generalizability.

Therefore, the contribution of this study lies in the proposition of a CNN-based music visualization analysis model, showcasing its advantages in visualization indicators through comparison with other advanced methods and demonstrating its strong generalizability. This study offers novel insights and methodologies for the advancement of the music visualization field, providing a valuable reference for both academic and industrial domains.