Design Specification

This pipeline is meant as a plug-and-play system, meaning it can be applied to varying similar agricultural problems with few necessary changes. Each step described below could be swapped out for alternatives based on the specific task at hand: we've implemented some of these alternatives already.

The stalk detection is packaged as a ROS node, which registers a ROS service (allowing a request-reply communication between two nodes), which takes as parameters a number of images to process and a timeout, and returns the final stalk information.

Installation instructions and more usage documentation can be found in the code:

Code

2D Detection

Dataset Collection and Labeling

Video data was collected in multiple regions of a corn field from a mobile phone, capturing diverse lighting, weather conditions, viewpoints, and plant age. The phone was positioned low to the ground to mimic the approximate viewpoint of the camera on the manipulator. One frame each second was then extracted from the videos, and extraneous images were removed.

The remaining 2667 images were labeled using a tool which uses Meta's Segment Anything model. Masks were drawn for all stalks in the foreground, extending from the bottom of the visible stalk to the first occlusion by the plant's leaves (which, for our purposes, marks the approximate top of the stalk). This produced 7681 stalk instance annotations. The images were then split randomly into 80% training, 10% validation, and 10% testing data.

Manually select points within the desired mask (blue) and outside the mask (red). Meta's Segment Anything model then produces a mask

Training

We use the Mask-R-CNN architecture, implemented by Facebook AI Research's Detectron2 library with the ResNet-50 backbone and Feature Pyramid Networks for segmentation. Using Detectron2's benchmark model for this architecture, Mask-R-CNN was fine-tuned on our training data for 40000 iterations, with a maximum learning rate of 0.00025 and the "WarmupMultiStepLR" scheduler, and a batch size of 4. Training took 6.7 hours on an NVIDIA GeForce RTX 4080. Our trained model can be found here (TODO).

Model metrics are below.

Grasp Point Estimation

Stalk Analysis

We treat stalks as 3D lines extending from some point on the ground to some point in space—this simplified model is a tradeoff between 3D accuracy and computational complexity. Let's take a look at how a single frame is processed (a frame has a color and depth image):

First, our segmentation model produces masks for the color image. The masks are then partitioned into vertical slices (we found 10 pixels to be a sufficient vertical distance between slices to balance performance and accuracy, but this hyper-parameter can be adjusted to account for varying stalk height). We call the centerpoint of each slice its feature point. In each slice, the median depth measurement among all points is extracted from the depth image and assigned to the slice's feature point.

The 3D feature points are then transformed to the robot's base frame. RANSAC line fitting and Least Squares for refinement are then run on the feature points of each stalk to get lines, which are then extended to the world's ground plane. The grasping point of the stalk is simply the point along this line at the specified Z height (a hyper-parameter) above the ground.

Input Image

Input

Slices

Detections

Feature Points

Feature Points

3D feature points are fed to ransac and least squares for line fitting.

Best Stalk Determination

We've now produced a number of 3D grasping points for each frame, so we need to decide which stalk is best to grasp:

Based on the manipulator dimensions and shape, stalks which are not graspable—those where another stalk is blocking the predicted path the manipulator would take to insert the sensor, or where the manipulator would not be able to reach—are eliminated. Stalks with unrealistic positions, caused by sporadic segmentation output, misalignment between the depth map and image, or mis-measurements in the depth map, are eliminated as well. The best stalk is then determined from among those remaining based on a heuristic weighting function using the stalk's height and width, the confidence of the segmentation model's prediction, and the distance of the stalk from the optimal grasping position for the arm. This weighting function favors stalks which are more easily graspable and which have a higher chance for successful insertion; the precise weighting of these factors could be adjusted for a different task based on the most favorable conditions for the manipulator.

Consensus Among Frames

This process is repeated for a number of frames, after which the best stalks from all frames are clustered and the largest cluster is taken to be the consensus stalk. A representative stalk from this cluster is then chosen, and the grasping point for this stalk is determined to be the final grasping point.

If no valid stalks are detected among any frame—which may occur if there are no stalks in view, or if the stalks are too difficult to process—a reposition response is given, indicating to the robot that the current viewpoint is insufficient to estimate a reasonable insertion pose.

Qualitative Evaluation with Mock Setup

With synthetic and real corn grown with the indoor greenhouse, we tested the detection pipeline.

Interpolate start reference image. Total plot