VN September 2025

Vetnuus | September 2025 29 While remote sensing technology is increasingly being used in ecological research [Marvin et al., 2016], the rate at which large volumes of data are collected often outpaces processing and analysis, preventing crucial insights from being gained rapidly and at scale [Tuia et al., 2022]. To overcome this challenge, we develop machine learning models utilising thermal (heat), RGB (colour), and LiDAR (light detection and ranging) imagery of a site in Kruger National Park with identified middens. We believe all three modalities can play an important role in rhino midden detection. Due to their warm temperature, middens often show up as bright areas in thermal imagery, as seen in the left image of Figure 2(a). In RGB imagery, middens often appear brown, as seen in the right image of Figure 2(b). LiDAR imagery is expected to be more helpful in other sites with high numbers of termite mounds because, although both middens and mounds tend to be warm, the latter often have a higher slope. First, we consider whether passive (i.e., supervised) deep learning techniques are able to detect rhino middens in multimodal imagery. Second, we determine which data modalities and combinations thereof are most informative for automatic midden detection efforts, which is salient because of the limited resources available for conservation and ecosystem monitoring. However, due to geographic differences between ecological sites, a deep learning model that performs well on one site may not perform well on another [Beery et al., 2018], creating a cumbersome labelling burden. Our third contribution is therefore to develop active learning methods that strategically select images to be labelled by an expert in order to find rhino middens in an unlabelled dataset in which most images are empty. The goal is to reach an accuracy that competes with passive learning methods despite having far fewer labelled data points. However, when the dataset has extreme class imbalance, predominant active learning methods [Lewis and Gale, 1994; Kellenberger et al., 2019] are unlikely to query rare positive samples, impeding the model’s learning. To overcome this challenge, we introduce the MultimodAL active learning system, which leverages information about the signal of interest to rank the instances. We then prioritise querying those most likely to be rare positive samples, accelerating the model’s learning. Within MultimodAL, we also introduce an ensemble active learning strategy that dynamically weights the predictions from several models to query instances more likely to be positive samples. Our methods apply to the general problem of identifying a rare signal of interest about which we have knowledge in an imbalanced dataset for which complete annotation is impractical. We train and evaluate our methods using 9,772 images of a site in Kruger National Park captured in three modalities in collaboration with South African National Parks. We perform image classification to identify midden and non-midden images and map middens geographically for the first time. We train a passive neural network on each of our data modalities (thermal, RGB, and LiDAR) as well as on fused combinations of these data types. For the middens in this site, thermal imagery is the most informative, RGB provides a slight boost in accuracy when fused with thermal, and fusing thermal with LiDAR improves recall. Next, we design and implement a novel multimodal active learning system, MultimodAL, that exploits the fact that middens are warm. We compare the performance of our query strategies against several standard baselines. MultimodAL achieves statistically identical performance to the best passive learning model with 94% fewer labels, greatly easing the labelling burden on domain experts. Finally, mapping the rhino middens in this site reveals that they are not distributed randomly across the region but rather form clusters, so ranger patrols ought to be targeted at the areas with high midden densities. Thus, we have provided actionable information for rhino conservation as a result of our endeavour to map rhino middens rather than rhinos directly, and our methods facilitate scaling these insights to additional rhino habitats. Related Work We discuss related multimodal deep learning and active learning methods in this section. Several studies have fused thermal and RGB data to improve the performance of deep learning models. Alexander et al. [2022] and Speth et al. [2022] utilised thermal and RGB fusion in a deep learning method to detect cracks in civil infrastructure and locate civilians in disaster zones, respectively. In our setting, we consider fusions of thermal, RGB, and LiDAR imagery. While the above works consider only passive learning settings, we also investigate multimodality in active learning environments. For our active learning models, we both evaluate performance when fusing several types of imagery beforehand and when allowing distinct models trained on different image modalities to form a “committee” for the active learning system. Active learning algorithms are generally distinguished by their strategy for evaluating how informative an unlabelled sample is [Settles, 2009]. One of the most common active learning techniques is uncertainty sampling [Lewis and Gale, 1994], wherein the model requests labels for the images about which it is most uncertain. This method does not explicitly prioritise a particular class, so it is not designed to find extremely rare positive samples in a highly imbalanced dataset. In order to address this mismatch, Kellenberger et al. [2019] introduced positive certainty sampling, instead prioritising images for labelling that are likely to be positive samples. Both uncertainty and positive certainty sampling were designed for a single data modality. Zhang et al. [2021] developed an active learning algorithm for thermal and RGB data that prioritises images that are classified differently by the separate thermal and RGB models. While this method accommodates two data modalities, like uncertainty sampling, it is not designed for imbalanced datasets. We introduce a form of multimodal positive certainty sampling, which prioritises images that an ensemble of models (one for each modality) determines are likely rare positive samples. Because of the difficulty in identifying these rare samples, we need to make the query method as powerful as possible. An example of a method that uses expert knowledge to identify images likely to contain an object of interest is described in De Oliveira and Wehrmeister Figure 2: Each pair shows the thermal (left) and RGB (right) images of the same area containing a midden. Green boxes outline middens. Red boxes outline areas that falsely appear to be middens. In (a), the midden is more obvious in the thermal image than in the RGB image, and in (b), the reverse is true Article Figure 3: Thermal (left), RGB (middle), and LiDAR (right) orthomosaics comprising the dataset under study

RkJQdWJsaXNoZXIy OTc5MDU=