Weakly Supervised Fruit Counting for Yield Estimation Using Spatial Consistency
Abstract - Fruit counting is a fundamental component for yield estimation applications. Most of the existing approaches address this problem by relying on fruit models (i.e., by using object detectors) or by explicitly learning to count. Despite the impressive results achieved by these approaches, all of them need strong supervision information during the training phase. In agricultural applications, manual labeling may require a huge effort or, in some cases, it could be impossible to acquire fine-grained ground truth labels. In this letter, we tackle this problem by proposing a weakly supervised framework that learns to count fruits without the need for task-specific supervision labels. In particular, we devise a novel convolutional neural network architecture that requires only a simple image level binary classifier to detect whether the image contains instances of the fruits or not and combines this information with image spatial consistency constraints. The result is an architecture that learns to count without task-specific labels (e.g., object bounding boxes or the multiplicity of fruit instances in the image). The experiments on three different varieties of fruits (i.e., olives, almonds, and apples) show that our approach reaches performances that are comparable with SotA approaches based on the supervised paradigm.
Among the multitude of agricultural processes that draw the attention of computer science and robotics researchers, an important role is certainly played by yield estimation. An accurate estimation of the yield of a culture facilitates the farmer in planning for harvesting operations and crop sales. However, the standard practice to yield estimation often relies only on coarse measurements and direct inspection, a practice that has high costs and low accuracy. To address this problem, we have recently witnessed a widespread adoption of camera-equipped robots in agricultural fields. The use of automated vehicles combined with SotA computer vision techniques achieves the benefits of both reducing costs and increasing yield estimation accuracy.
In order to build more reliable systems to achieve fruit counting, many vision-based strategies have been proposed by the research community. Most of the existing approaches exploit three fundamental paradigms:
- counting by using specific object detectors
- counting by estimating density maps
- counting by explicitly training an object counter.
Despite the impressive results presented in these works, to be trained, all of the proposed strategies require detailed supervision, in the form of object bounding boxes, density maps or instance multiplicity. This procedure is very burdensome and error prone and, in many agricultural scenarios, it may be impossible to acquire a sufficient number of labelled samples to achieve consistent performance that are robust to image noise or other forms of covariate shift.
Ideally we want to remove this label and have a network that is able to learn from the images what and how to count. Stated like this the problem is ill-posed, because we should give the network at least a slight hint about what to learn. For this reason, we introduce the multi-branch counting CNN (MBC-CNN) that operates on different image sub-windows at different levels. More precisely, it works on three scales, the whole image, the image divided into quadrants, and the image divided into 16 parts. While we do not use the correct number of fruits in the image as a supervisory signal, we impose the constraint that at each scale the total count regressed on the corresponding tiles must be consistent with the total count of other levels. Since these labels cannot be used naively to train a counting network, we introduce an image level binary classifier, which will be referred to as PAC (Presence-Absence Classifier), anduse it to train the actual counting network. The key intuition, in addition to the counting consistency, is to force consistency between the output of each counting branch and the prediction of the PAC. If the classifier predicts the presence of object instances, the counter should output a number greater than zero. Conversely, when the absence of fruits is estimated, the count must be zero.
To evaluate the performance of both the baseline methods and our approach, the count estimates are compared with the ground truth value by using the RMSE metric. We start our discussion by commenting the RMSE obtained by each model over the whole test sets for each fruit dataset. The results are shown in the right-most column of Results table.
It can be observed that the best performance are achieved by RCNNCount . This is to be expected, since their model is specifically trained with the most informative labels (bounding boxes on instances).The S-COUNT and MBS-COUNT networks give higher, but still comparable, errors with respect to RCNNCount, showing that end to end counting on total number of instances is effective. The most important result is that WS-COUNT, despite being trained in a weakly supervised manner, achieves performances that are close to the supervised baselines. It is also important to observe that the errors obtained by the MB-PAC-Only baseline are considerably higher than WS-COUNT, which proves that the combination of the classifier consistency and the spatial consistency losses gives the network a better capability to count the fruit instances.
Dataset download link: sira.diei.unipg.it/