The proposed SA-HCNN deep learning approach efficiently and accurately classifies pathologies in gigapixel digital histology slides. This approach can process pathologies of any size while also being able to learn how to filter out irrelevant pixel data. High F-scores were achieved on all classes across all tissue types, even when the positive class was severely underrepresented. By utilizing a highly efficient training generator, the neural network can query only the necessary number of samples from the majority class, resulting in a minimized training time and improved performance on datasets with extreme class imbalance.
The approach also includes a super-pixel annotation generator which takes trained models and a tissue slide as input to construct super-pixel predictions that can be viewed in QuPath in just a matter of minutes. Using a statistical approach inspired by the researchers at Skinner labs, the deep learning method also produces WSI predictions at the level of a human pathologist, with the exception of the negligible false-positive rate (can be seen in negative class predictions of supplementary material). Given the human-level segmentation abilities of the deep learning model and the speed of the inference procedure and super-pixel generator, the proposed approach can be very helpful for biomedical researchers using histopathological slides to make a diagnosis. Not only would using this approach save significant amounts of time and money, it would deliver a consistent level of precision that tends to diminish over periods of time with human annotators, as annotating gigapixel tissue images is painstaking work that takes a long time. Moreover, the SA-HCNN model can complete the same task orders of magnitude faster than humans and at higher accuracy.
In conclusion, the use of deep learning in histopathology diagnostics holds significant promise for the advancement of medicine and biotechnology. With the speed and accuracy that deep learning provides, researchers will be able to better understand the environmental effects on biological systems, including humans, and develop technologies and medicines to combat the negative effects of these interactions. The findings from this research will contribute to the fight against cancer and other gene-related diseases. The potential impact of this technology is immense and presents an exciting opportunity for researchers to push the boundaries of medical research and improve human health.
MethodsTiling procedure
Tiling the gigapixel tissue slides was done by using a sliding window of size 256 × 256 pixels. The motivation for this tile size stems from the standard input size of pretrained models (traditionally 244 × 244 in older models) and from previous studies of deep histology where the authors either use a tile size of 128 × 128 or 256 × 256. This tile size is the nearest power of 2 that is optimal for minimizing computational overhead and maximizing information perception. The increment size, or the amount by which the tile window is moved, differs between training and testing. For training, a sliding window increment of 1/4 window size (64px) was used to build a large dataset. For inference, a sliding window increment of 1/2 window size (128px) is used for a 4x speedup in processing time. A smaller increment size can be used to build the dataset used for training since runtime is less of a concern during training, whereas a 1/2 overlap is all that is needed to make accurate super pixel predictions during inference. Additionally, less computation is needed during inference as a result of this.
For building the training dataset, a piecewise condition is used to determine which class a tile belongs to:
$$\:{L}_{i,j}=\left\{\begin{array}{l}1\:\:\:if\:\:\:Intersection\left({p}_{j},\:{T}_{i}\right)>\varphi\:*Area\left({T}_{i}\right)\\\:1\:\:\:if\:\:\:Intersection\left({p}_{j},\:{T}_{i}\right)=Area\left({p}_{j}\right)\\\:0\:\:\:otherwise\end{array}\right.$$
In the piecewise condition shown above, \(\:{L}_{i,j}\) is the class label associated with the tile \(\:T\) at index \(\:i\) and pathology instance \(\:p\) at index \(\:j\). \(\:p\) represents any pathology instance which overlaps \(\:T\). \(\:i\in\:[0,\:I]\) where \(\:I\) is the total number of tiles in a tissue slide (could be tens of thousands of tiles in one image) and \(\:j\in\:[0,J]\) where \(\:J\) is the total number of pathologies (classes) that a classifier is trained to predict. \(\:Intersection\) is a function that returns the area in pixels computed from the intersection between pathology instance \(\:{p}_{j}\) and tile \(\:{T}_{i}\). \(\:Area\) is a function that simply returns the area in pixels of a polygon. Note, there can be multiple instances of any given pathology in one image, that is why each tile is given its own label. Finally, \(\:\varphi\:\) is a scalar that is used to control how much overlap is needed for a tile to be considered a positive example for any given pathology where \(\:\varphi\:\in\:[0,\:1.0)\). The second constraint in the piecewise function ensures that a positive label will be assigned if the entire instance of a pathology is present inside of a tile. This constraint is necessary for small pathologies that would result in \(\:Intersection\left({p}_{j},\:{T}_{i}\right)\ll\:\varphi\:*Area\left({T}_{i}\right)\).
In order to find an optimal value for the overlap parameter \(\:\varphi\:\), several values were tried ranging from 0.1 to 0.6. Figure 4 shows the F-score for each overlap value for each tissue type. The results indicated that an overlap of \(\:\varphi\:=0.3\) yields good overall performance across all tissue types. An overlap of 30% is large enough that the model would be able to see a reasonable amount of a pathology to make an accurate classification but small enough to be able to make a larger training dataset. An overlap value of \(\:\varphi\:=0.3\) is used in all results presented in this paper, including results on other datasets.
A smaller overlap constraint will result in a larger dataset for training given that more examples from each class will be present since there is less of a constraint for positive class examples. On the contrary, for a large overlap parameter, say \(\:\varphi\:=0.9\), a tile will only be taken into consideration as a positive example if a pathology is overlapping close to the entire tile. In more traditional sliding window semantic segmentation algorithms, a class label for any given tile would be decided by the center pixel of the tile, i.e., the class label for a tile would be equal to the class that the center tile overlapped. This method works well but is costly given that it generally only works with a small sliding window increment size, i.e., 1/8 or 1/16 of the tile size. The reasoning behind this is in the case of small class instances. Assuming that the tile size being used is 256 × 256 pixels and there exists a pathology that is ~ 10 pixels in diameter on average, the model is unlikely to see many instances of this pathology if a sliding window increment of 1/1 or 1/2 was used for inference given the center pixel condition.
Deep learning model with spatial awareness
The main issue with classifying an entire tile as opposed to the center pixel method for tile labeling is that spatial information could be lost since the class-relevant information is likely close to the edge of the tile or outside the tile boundary. For instance, when a model is being trained to classify what class the center pixel of a tile belongs to, the model can see 128 pixels in all directions to help understand what to predict. In the case of the overlap method for tile labeling, some of that important information could be just outside of the tile that is being classified, especially if the overlap \(\:\varphi\:\) is small. This is where introducing additional spatial information could become useful. Inspired by the glimpse-net34, spatial awareness is implemented in the deep learning model through pyramid tiling. This is the process of zooming out from the original tile by powers of 2 and inputting all images into the model to be processed in parallel. All zoomed out images are resized to be equal to the model’s input size, in this case 256 × 256. Pyramid tiling can be seen in Fig. 5. Pyramid tiling combined with the tile overlap approach constitutes the Pyramid Tiling with Overlap (PTO) approach used in SA-HCNN.
Table 2 shows the macro F-score for each tissue type for SA-HCNN with PTO tiling, SA-HCNN with Center tiling, and the baseline network EfficientNetV2 with Center tiling. The results for the SegFormer with PTO tiling and Center tiling are also included for comparison. Results show that PTO is superior to the Center tiling approach overall and that the HCNN model is superior to the baseline EfficientNetV2 model. Furthermore, the SegFormer method performed worse than all HCNN. Thus, SA-HCNN with PTO tiling is the recommended approach for histology in gigapixel images.
The model used for tile classification (see Fig. 6) was built using the latest standard practices for high precision image classification. Google’s EfficientNetV2 models15,16 were implemented for the convolutional backbone to the neural network. The EfficientNetV2 architecture provides the best performance while also minimizing the number of trainable parameters. The exact model used is EfficientNetV2B2 since the input size is 260 × 260 which is the closest to the tile size of 256. To utilize the effectiveness of transfer learning, we use the pretrained ImageNet 35 weights instead of training from scratch, as that has been shown to significantly improve training time and classification performance. Additionally, there are seven total convolutional blocks in EfficientNetV2B2. The first three blocks are left frozen while the last four layers are re-trained so that the convolutional filters learn to be more representative of the tissue data that is being passed through the model. For the output of the model, \(\:N\) classifier blocks are used for prediction. The classifier blocks take the output of the CNN backbone where attention is used, followed by global average pooling and finally a fully connected layer used for prediction.
Sigmoid activations are used for the classification rather than softmax since multiple classes can be present in one tile making this a multilabel problem, not a multiclass problem. Additionally, we have found empirically that using sigmoid activations is the best way to handle training a model on heavily imbalanced data. Since the class predictions are independent of each other, training in this fashion effectively creates a set of one-vs-rest classifiers, which are prone to overfitting due to a data imbalance since they are binary classifiers. Furthermore, we use binary cross entropy to compute gradients for the model rather than categorical cross entropy. Binary cross entropy allows each classifier to learn independently, meaning that minority class learners will not be over saturated with gradients from the classes with high support. The traditional method of using categorical cross entropy is problematic for learning models on highly imbalanced data. Models trained using SoftMax activations paired with categorical cross entropy tended to be unstable during training, which could lead to model collapse where the model would only predict a single class. In most cases, this was the majority class.
Training procedure
The ensemble method used for training is Bootstrap aggregation, also known as bagging. The purpose behind using bagging for training is to be able to use the entire training data to train the model while simultaneously having large validation sets to accurately monitor the model performance and further use dynamic learning rates for training. Additionally, bagging is a standard machine learning technique which has proven to improve accuracy with more stable predictions36. During training, \(\:N\) datasets are created by randomly sampling from the whole dataset with replacement to create the training data for each set. The number of samples taken for each set is equal to the number of images in the whole dataset. By randomly sampling with replacement, the number of unique images in each bagged dataset is approximately \(\:1-\frac{1}{e}\approx\:63\%\) of the original dataset, meaning that each classifier block has \(\:\sim37\%\) of the whole dataset to be used for validation which is better than the standard 10% traditionally used. By combining what each classifier block learns independently, the data seen by the entire classifier is \(\:1-\frac{1}{{e}^{N}}\) where \(\:N\) is the ensemble number. If \(\:N=7\) for example, then \(\:99.91\%\) of the whole dataset is used for training across all classifier blocks while still maintaining \(\:\sim37\%\) of the data for validation on each block. This method turns out to be incredibly beneficial for training stable models which avoid overfitting and makes the use of dynamic training parameters more accurate due to the large validation size. For this reason, learning rate decay is used during training by starting out the model at a learning rate of \(\:{10}^{-4}\) and decaying by a factor of \(\:0.1\) when the validation loss converges.
During each epoch of training, a randomly sampled batch of images is sampled from each dataset 100 times before computing validation statistics and moving on to the next epoch. Batches are generated by randomly sampling tiles in such a way that the number of samples per class is equal. For example, if there are 4 classes and the batch size is 128, 32 random tiles are sampled from each class to allow the model to train on balanced data batches which leads to more stable training. This method is only applied during training, batches are sampled completely at random during validation. Since we do not move through the training data linearly, i.e., random batches are trained on each step, the number of steps in each epoch is effectively arbitrary. The only effect it has on training is how often the validation statistics are computed which are used to adjust the dynamic training parameters. Adam optimizer is used along with the loss being computed via binary cross entropy, since the model is multi-label.
Part of the novelty of the SA-HCNN method is within the algorithmic implementation of the training procedure. The training data generator is implemented such that memory is minimized without any increase in processing time. For instance, all tile batch samples are collected in real time via array slicing (live tiling) which adds negligible processing time to the training procedure. This means that the tile for training via the tiling procedure can be created using any sliding window increment less than the tile size. For example, if the sliding window increment is 1/4 of the tile size, then the total number of tiles will take ~ 16 times as much memory than one WSI, since the increment is inversely proportional to the quadradic increase in tiles generated from an image. Using live pyramid tiling, an arbitrarily large virtual training dataset can be used while only needing as much memory as required to load the WSIs into memory. This results in the ability to train SA-HCNN on a set of gigapixel images using any high-end desktop rather than a computing cluster. Additionally, this can be used to quadratically increase the number of unique tiles in each class, resulting in low probability of overfitting to the minority classes. Lastly, since the only information about the tiles stored in memory during training are the coordinate locations, the memory efficiency of live tiling is extended to the bagging method by creating \(\:N\) bagged datasets of tile indices, not tiles. So, the training algorithm uses constant memory with respect to the sliding window increment of the tiling procedure and the ensemble size without any significant increase in training time.
Inference procedure
During inference on a new image, a sliding window increment of 1/2 (128-pixel stride) is used. A set of blank canvases of all zeros with the same shape of the input image is created on which to overlay the tile prediction. The number of blank canvases in the set is equal to the number of classes that the model predicts. The prediction value for each tile/class pairing is added to all the corresponding pixels on the respective canvas for all tiles in an image. Recall that the value of a prediction is a probability meaning that it is a scalar between 0 and 1. The pixel values of the canvases are then normalized since the maximum pixel value is equal to the inverse of the sliding window increment squared. For example, with an increment of 1/2 the maximum value of any given pixel is 4 given the quadratic increase in tile overlap. Finally, all pixel values are rounded to the nearest whole number, resulting in pixel groupings with values of 1 and zeros everywhere else. This is how pixel masks are created for prediction. The predicted pixel masks are jagged since they are created from patched tile predictions, but the images are so large that the mask predictions are sufficient for the segmentation task. Figure 7 shows two examples of pixel mask predictions overlayed on the input image. The images shown are snippets from an annotated testis image. The circular annotations are the perimeter of the ground truth. The polygonal annotations are the perimeter of the model prediction super-pixel masks. The green and light pink annotated regions show the manual and predicted annotations for two different pathologies. There is no prediction for the yellowish annotation since that annotation is a part of the negative class. The long dark blue line denotes the predicted boundary of the tissue. Overall, the model predictions show good alignment with manual annotations, and the pixel mask images can be used to visually confirm predictions and direct the human pathologist to the locations of different pathologies. Additional pixel mask predictions are shown in Supplementary Fig S5 (Testis), Fig S6 (Prostate), Fig S7 (Female Kidney), Fig S8 (Male Kidney), and Fig S9 (Ovary).
Evaluation and WSI classification
Given the ability to accurately classify individual tiles in a whole tissue slide, a method analogous to the manual instance counting was devised to allow us to determine if an animal is diseased or not. Since the model prediction results allows counting the number of predicted tiles for each class, the ratio of the total area of each pathology to the total area of tissue can be computed. This is better than simply using the tile counts to compute statistics because the size of the images may vary, so this is the normalization process to keep track of the relative amount of pathology in an image. For a given pathology \(\:p\), the area ratio \(\:{R}_{p}\) represents the normalized area which can be used to compute group statistics. In the manual counting approach, the determining factor for an animal being diseased depends solely on the mean of any amount of pathology being 1.5 standard deviations higher than the mean of the same pathology count in the control group. The same procedure can be replicated using the area ratios of the pathologies rather than the instance counts. The mean \(\:\mu\:\) and standard deviation \(\:\sigma\:\) of the area ratios of the control group are computed for each pathology. The area ratios of the experimental group can then be standardized by subtracting each area ratio by \(\:\mu\:\) and then dividing by \(\:\sigma\:\). Now that the experimental values are standardized relative to the control group, the classification can be made to determine if any whole slide tissue sample taken from the experimental group belongs to a diseased animal as follows:
$$\:Diseased\left(P\right)=\left\{\begin{array}{l}True,\:\:\frac{\left({R}_{p}-{\mu\:}_{p}\right)}{{\sigma\:}_{p}}\ge\:\delta\:,\:\:\exists\:p\in\:P\:\:\\\:False,\:\:otherwise\end{array}\right.\:\:$$
In the equation above, \(\:P\) represents the set of pathologies that belong to any one image. \(\:\delta\:\) is the threshold on the number of standard deviations which must be met or exceeded by standardized area ratio for the statement to become true. In other words, if the area ratio \(\:{R}_{p}\) for any pathology \(\:p\) in the given image is \(\:\delta\:\) standard deviations above the mean of the control group, then the entire image belongs to the diseased class. In this case, \(\:\delta\:=2\) resulted in final predictions that aligned more closely to the manual counts that used \(\:\delta\:=1.5\). This is likely due to noise in the deep learning predictions due to a constant but low false-positive rate. For this reason, the higher WSI threshold is justified.