Experimental setting

To validate our method, we utilized the LMT-108 dataset46 containing 108 different objects which are divided into 9 categories: 1) Meshes; 2) Stones; 3) Glossy; 4) Wood Types; 5) Rubbers; 6) Fibers; 7) Foams; 8) Foils and Papers; and 9) Textiles and Fabrics. The LMT-108 dataset provides various forms of data, including acceleration, friction, image, metal detection, IR reflection and sound. We have applied image, sound and acceleration data to our experiments. Acceleration signals are recorded by a three-axis ADXL335 accelerometer (±3g, 10kHz), and sound is captured using a CMP-MIC8 microphone (44.1kHz). Images have a resolution of 320\(\times\)480. The dataset contains 108 objects, each with 10 samples (totaling 1080). For evaluation, we randomly split each object’s samples in half for training and testing, ensuring no object overlaps between sets.The samples from the LMT-108 dataset are shown in Fig. 3.

To verify the robustness and adaptability of DRDL, we need to conduct experiments on another more challenging dataset. We utilized the SpectroVision dataset–a multimodal collection of 14,400 paired samples capturing near-infrared (NIR) spectral measurements and high-resolution texture images (1,600 \(\times\) 1,200 pixels) from 144 household objects47. Data was gathered non-invasively using a PR2 mobile manipulator equipped with a SCiO spectrometer (740–1,070 nm range) and a 2MP endoscope camera with 12-LED ring lighting for consistent illumination. The objects spanned eight material categories: ceramic, fabric, foam, glass, metal, paper, plastic, and wood. Each object underwent 100 randomized interactions at diverse surface points/orientations (vertical: height/roll variations; horizontal: planar position sampling), ensuring real-world generalizability. This dataset enables robust material recognition without physical contact. Select 4 objects from each material for testing, totaling 32 unseen object data.

Feature representation via data dimension reduction

Due to the small number of samples, in order to improve the generalization ability of our model, it is necessary to reduce the dimension of acceleration signal and extract features of trimodal data using some conventional methods. The following four methods48 are used to reduce the dimensionality of the acceleration signal:

  1. 1.

    SA-x/SA-y/SA-z: Single Axis (SA) is the simplest method, which only respectively represent the acceleration signals measured on the corresponding x, y and z axes.

  2. 2.

    SoC: The implementation of Shadow o fClustering (SoC) method is relatively simple, which only adds the data from each direction.

  3. 3.

    Mag: Compared with the former two methods, the complexity of Magnitude (Mag) method is slightly increased, and its implementation process is roughly to find the sum of squares of the data measured by different methods, and then open the root. It is a commonly used method, but it will change some negative data into positive data after processing, which will make part of the process is roughly to find the sum of squares of the data measured by different methods, and then open the root. It is a commonly used method, but it will change some negative data into positive data after processing, which will make part of the original information missing.

  4. 4.

    PCA: Principal component analysis (PCA) is a widely used dimension reduction method. The main idea of PCA is to map ndimension features to k-dimension, which is a new orthogonal feature, also known as principal component, and a k-dimension feature reconstructed from the original n-dimension features.

In order to select the most appropriate method for dimension reduction, we choose the most effective method SA-z for dimensionality reduction of acceleration signal, combining with our previous work and the comparative experiment of each method in48.

Feature extraction

Features of the corresponding three modal data are extracted, which is also a key step. Based on the analysis of previous experimental results in previous work in46, we use the optimal feature extraction algorithm to extract LBP features from the image, and extract Mel-Frequency Cepstral Coefficients (MFCC) features from the sound signal and acceleration signal.

Performance comparison

In order to give a performance comparison of our proposed method with the traditional and common methods, ten methods are applied to our experiments.

  1. 1.

    K-SVD: The dictionary is updated by K-SVD. Orthogonal matching pursuit (OMP) is used to solve the sparse coding problem.

  2. 2.

    Support vector machine (SVM): It is a class of generalized linear classifiers that classify data by supervised learning. Its decision boundary is the maximum margin hyperplane of learning samples.

  3. 3.

    MLP: A classic feedforward neural network with at least three layers of nodes that uses backpropagation for training.

  4. 4.

    Convolutional Neural Network based on Vision (CNN-V): CNN has been used as a common method for texture recognition task in recent years due to its powerful ability to extract texture features. Considering the problem of multimodal fusion, we take the visual modal information as the input in CNN method.

  5. 5.

    Greedy Deep Dictionary Learning (GDDL)49: A deep dictionary learning method using greedy iteration without any fusion method. Experiments were carried out with visual, sound and acceleration signals as inputs.

  6. 6.

    Twin-incoherent Self-expressive Latent Dictionary Pair Learning (SLatDPL)50: A dictionary learning model integrates feature extraction and coding coefficient representation, and introduces two non-coherent local constraints with self-expression and self-adaptability.

  7. 7.

    Robust Adaptive Projective51 Dictionary Pair Learning (RA-DPL): A dictionary pair learning model retains the local neighborhood relationship of intra-class sparse coding, and the learned coding has discriminative ability.

  8. 8.

    Relaxed Block-diagonal Dictionary Pair Learning with a Locality Constraint (RBD-DPL)52: Relaxed block diagonal structure is introduced to improve the discriminability of the dictionary.

  9. 9.

    One-Shot Learning Method for Texture Recognition (OSL)48: Multimodal fusion and dictionary learning are combined to solve the problem of texture recognition. The model uses only one sample as the training sample, which solves the disadvantage that the model depends on enough samples.

  10. 10.

    DRDL: We set up a 3-layer deep dictionary learning model, and get the corresponding experimental results through the test set.

The classification accuracy of the ten experimental methods is shown in Fig.4. By comparing the classification accuracy of ten texture recognition methods, the accuracy of DRDL is higher than that of any method, and the effectiveness of the proposed DRDL is proved. To provide a more explicit comparison, our analysis is as follows:

Comparison with Single-Layer Methods: Methods like K-SVD are based on traditional single-layer dictionary learning. While effective at capturing shallow features, they lack the ability to learn deeper, more abstract representations, which limits their accuracy to around 89%. Our DRDL model, by incorporating a deep architecture, significantly outperforms these methods.

Comparison with Standard Deep Learning: The CNN-V model, a standard deep learning approach, achieves around 90% accuracy using only visual data. This demonstrates the power of deep feature extraction. However, our DRDL model surpasses it by not only using deep learning principles but also by fusing multimodal data, providing a richer source of information.

Comparison with Other Deep Dictionary Learning Methods: The GDDL model, another deep dictionary learning approach, shows poor performance when used with single modalities. This highlights a key problem that our model solves: traditional deep dictionary learning can lose important features. Our dictionary reconstruction method explicitly addresses this by fusing features from all levels, which is critical for high performance.

Comparison with a Strong Fusion-Based Model: The OSL method, which also uses multimodal fusion, is a very strong competitor. However, it is still based on single-layer dictionary learning. Our DRDL model gains its final performance edge by combining both early multimodal fusion and a deep architecture with multi-level feature fusion. This unique combination of strategies allows our model to achieve the highest accuracy of 97.7%, demonstrating that both rich input data and a comprehensive feature hierarchy are essential for state-of-the-art performance.

Statistical performance validation

To rigorously validate our performance comparisons, we conducted statistical significance tests. We performed a 10-fold cross-validation to obtain 10 accuracy scores for the top 4 comparison methods. We then used the Wilcoxon signed-rank test to compare our DRDL model against each competing method individually, with a significance level (\(\alpha\)) of 0.05.

The results of the statistical analysis are presented in Table 1. The p-values for the comparison between DRDL and all other methods were below 0.05. This indicates that the performance improvement of our proposed model is statistically significant and not due to random chance. This provides strong evidence for the superior feature fusion and dictionary reconstruction capabilities of the DRDL framework.

Additional verification

To further validate the robustness and generalizability of our proposed DRDL model, we conducted experiments on the more challenging SpectroVision dataset. We extract LBP features from texture images and only perform PCA dimensionality reduction on spectral data. As shown in Fig.5, our DRDL method once again achieved state-of-the-art performance, with an accuracy of 89.4%. It surpassed all other compared methods, including deep learning approaches like CNN-V and dictionary learning methods like OSL and K-SVD. Notably, the performance gap between DRDL and the next-best method was 3.3%, which is more pronounced than on the LMT-108 dataset. This demonstrates our model’s superior ability to learn discriminative features from complex and varied texture data.

Influence of layer count on performance

The number of model layers will directly affect the performance of our whole model. When the number of layers is set to 1, the model is a normal single-layer dictionary learning model. We set number of layers to 1, 2, 3 and 4 respectively, and carry out experiments. Additionally, we also set up another experiment, in which only the features learned in the deepest part are utilized without using Dictionary Reconstruction in the 3-layer model. The results are shown in the Table 2.

The Table 2 shows that the 3-layer DRDL model has the best performance in this experiment. With the increase of the depth of the model, the performance is improving, but the performance of the 4-layer model is inferior to that of the 3-layer model. The reason is that the increase of the depth will cause the redundancy of the fused features.

It is found that increasing the number of model layers appropriately improves the accuracy of classification. As the number of layers increases, the accuracy improvement decreases. Therefore, it is particularly important to select the appropriate number of layers. Experimental results show that if only considering deep features we fail to achieve the best results, and our proposed DRDL is superior. Because the learning characteristics of all layers are considered in the reconstruction stage, the depth of the model can affect the training speed of the model. As the number of model layers increases, so does the training time. According to the comprehensive consideration of the recognition performance and computational cost of the algorithm, the number of dictionary layers is set to 3 in our following experiments.

Material comparison

The LMT-108 dataset contains objects of nine different materials. Therefore, we have conducted further experiments to test the effect of applying the methods of this paper on surface texture classification of different materials.

In53, the author proposed an ensemble learning method (ELM) with optimized features for multimodal surface material recognition. The method proposed in this paper is compared with the method proposed in53. In this comparative experiment, in order to be consistent with the sample number setting in the experiment in53, the sample number in both the training set and the test set is set to 1080.

The experimental results of the material identification task are shown in Fig. 6, and the recognition accuracy is 0.978. The comparison of the experimental results of the two methods is shown in Table 3. For all materials, our proposed DRDL for material identification accuracy rate is higher than EML, except for meshes.

Influence of parameters

We analyzed the effects of parameters of the results, as shown in Fig. 7. We performed a joint analysis of parameters \(\lambda\) and \(\mu\), set from 0.0001 to 100, respectively. In Fig. 7, the \(Z\)-axis represents accuracy, while the \(X\) and \(Y\) axes are \(\log (\lambda )\) and \(\log (\mu )\). When it comes to the other experiments, \(\lambda\) and \(\mu\) are set to 0.15 and 1.8, respectively, and the dictionary sizes of the three-layer model are set to 196, 196 and 256, respectively.

Ablation study

We performed three ablation sub-experiments, including Experiment A, Experiment B and Experiment C. The details are as follows: 1) Experiment A: We separate the multi-modal fusion method from the DRDL, and individually took three kinds of data modes, including images, sound and acceleration, as the input of the model, which were marked as Experiment A-V, Experiment A-S and Experiment A-A respectively. 2) Experiment B: We separate dictionary reconstruction and fine-tuning stages from the DRDL. 3) Experiment C: We separate the fine-tuning stages from the DRDL. The final results are shown in Table 4.

Computational efficiency analysis

DRDL takes 10.57s to classify the test set containing 540 samples. The execution time of all compared methods for classifying test samples is shown in Table 5.

Compared with CNN-V, MLP and SVM, our DRDL involving dictionary learning requires longer time to classify test samples, because it uses CVXPY to solve a convex optimization problem for dictionary learning. Compared with the single-layer dictionary learning method i.e., OSL and K- SVD, the deep dictionary learning method needs to consider the deep-level characteristics, which leads its longer execution time. Compared with GDDL, a greedy iteration method without any fusion, our DRDL needs longer time because it considers characteristics of multiple layers, which achieves a big increase of accuracy, as shown in Fig. 7.

Future work will focus on the following detailed strategies to improve its efficiency and make it more suitable for real-time applications.

Faster Solver with ADMM: Instead of the general-purpose CVXPY, we’ll use the Alternating Direction Method of Multipliers (ADMM). ADMM splits the lasso problem into smaller subproblems that can be solved (often analytically) in parallel, cutting down runtime significantly.

Dictionary Pruning: Once the dictionary D is trained, many atoms may be redundant. We’ll rank atoms by how often and how much they reduce reconstruction error, then drop the weakest ones. A smaller dictionary speeds up sparse coding by reducing the number of variables.

Quantization for Deployment To save memory and boost speed on limited hardware, we’ll convert 32?bit floats in the dictionary and features to lower precision (e.g. 16?bit floats or 8?bit ints). This slashes storage needs and taps into faster, hardware?accelerated matrix operations.

In our experiments, all methods are implemented in Python 3.8 on a computer platform (2.6-GHz CPU and 16-G RAM).

Limitations and Generalizability

Although our proposed DRDL model demonstrates state-of-the-art performance, it is important to acknowledge its limitations and consider the context of its evaluation.

(1)Model limitations

The primary limitation of the DRDL model is its computational complexity. The multi-stage process, which includes layer-by-layer pre-training, dictionary reconstruction, and fine-tuning, is inherently more complicated than end-to-end deep learning models or single-layer dictionary methods. The use of CVXPY to solve the convex optimization problem for sparse coding is a significant bottleneck. While this trade-off yields higher accuracy, the model’s current implementation may not be suitable for real-time applications where low latency is critical.

(2) Potential biases in the data

While our evaluation used two established texture recognition benchmarks, their controlled laboratory acquisition may limit the model’s robustness in real-world settings with variable lighting, background clutter, or dynamic contact conditions.