VN September 2025

Vetnuus | September 2025 31 instances. (6) This process is repeated until we exhaust the budget on the number of labels that can be provided. Since we are also interested in settings with multiple data modalities, we develop a modification of the above procedure to accommodate an ensemble of models, each of which is trained on its own data modality. Specifically, we modify the way in which the prediction for each instance is calculated. Above, the prediction was simply extracted from the output of a single model, where the output contains the probabilities that the instance belongs to each of the possible classes. In the multimodal setting, we have the outputs of multiple models with possibly differing accuracies, so we want to weight their predictions accordingly. In particular, we assign each instance a score for each of the possible classes given by a weighted sum of the models’ outputs. If there are M models, then the score for an instance belonging to class i is calculated using Equation 1. Then, as in the unimodal case, the prediction for the instance is obtained by sampling from a multinomial distribution with the class scores as inputs. After being initialised to 1/M, the weights are updated in each subsequent round according to the number of queried instances that the models have classified correctly. The weight for model m is given by Equation 2, where correctm is the number of instances queried so far that were classified correctly by model m. By construction, the weights sum to 1, so the resulting scores can be interpreted as probabilities. Intuition for Ranking Idea Having described our active learning algorithm, we now present additional analysis that provides intuition for our choice of ranking metric for our dataset. Note that the following analysis is not necessary to use the algorithm and is solely for explanatory purposes. For our setting, we have chosen the maximum thermal pixel value as the metric to be used for ranking in descending order. This sets the target as the maximum pixel value across all of the thermal images, exploiting our knowledge that the sought-after rhino middens are warm. To demonstrate that this ranking technique effectively prioritises midden images for labelling, we compute the probability of a thermal image containing a midden given that its maximum pixel value (MPV) is no less than a threshold t. To calculate this, we first apply Bayes’ rule, shown in Equation 3. Let m be the total number of middens and mt be the number of midden images with MPV no less than t. The first factor in the numerator is then P(MPV ≥ t|midden) = . Plugging this into Equation 3 gives Equation 4. We plot Equation 4 in Figure 5, which shows that the probability of an image containing a midden tends to increase as its maximum pixel value nears the target value, demonstrating the utility of the ranking method. More generally, we expect any dataset with an appropriately chosen informative metric and target to obey a similar pattern: the probability of being a positive instance drops with increasing distance from the target value. Our ranking-based active learning query strategy is designed for any such dataset. Results In this section, we present the performance of models passively trained on thermal, RGB, LiDAR, and fused imagery and show that the trained models are able to detect middens in a held-out test set with high accuracy. We also display the performance of our MultimodAL active learning algorithm in comparison to several baselines. We demonstrate our method efficiently selects images for labelling and achieves fast midden retrieval in a large, imbalanced, and initially unlabelled dataset. All error bars in Subsections 5.1 and 5.2 show one standard error of the mean in each direction. All models are trained for 10 epochs, and a threshold of 0.5 is used on the models’ sigmoid output for test image classification. Each experiment is run 30 times. Detecting Middens with Passive Learning We passively train models with images in different modalities to establish that neural networks are capable of accurately detecting rhino middens in remotely sensed imagery. To train our passive learning models, we assume that the system has access to all of the images’ labels from the start. We split this labelled data into training and test sets with the following random selections. First, we add 80% of the midden images (71) to the training set and leave 20% (18) for the test set. We then add 18 empty (non-midden) images to the test set, yielding a balanced test set of 36 images. Of the remaining empty images, we randomly sample 71 and add them to the training set to balance it, yielding a balanced training set of 142 images. For each trial, we train the model on the thermal, RGB, LiDAR, or fused imagery and then record the accuracy on the test set at the end, graphed in Figure 6, where each trial has a different random assignment of images to the training and test sets. We also report the mean and standard errors of the accuracy, precision, recall, and F1 score across the passive trials in Table 1. We observe that the Thermal+RGB Fused model achieves the best performance on accuracy and precision, and the Thermal+LiDAR Fused model achieves the best recall and F1 score. Among the three data modalities, the Thermal model significantly outperforms the RGB and LiDAR models on our dataset. Figure 5: Probability that an image contains a midden tends to increase with its maximum thermal pixel value 1 2 3 4 Figure 6: Mean accuracy for the passive models across 30 trials after training for 10 epochs. Models are ranked in descending order by accuracy Article >>>32

RkJQdWJsaXNoZXIy OTc5MDU=