In this blogpost, we explore our AI assistant, which is responsible for monitoring, alignment, and welding in our process using Meta's latest Segment Anything 2 (SAM-2). Our previous blogpost discussed our camera calibration system, capable of steering the robot precisely using keypoints from various camera sources. To further automate our process, we need to accurately segment the camera input and take action based on specific scenarios. When the robot introduces the feeding wire, we perform several checks:
Wire straightness: Creating straight wires is a complex process. We need to verify if the feeding wire is straight and conforms to our theoretical model using multiple cameras. We discard any wire that falls outside our acceptable error margin.
Contact with the structure: Due to slight variations that occur during welding or cutting, the wire might not make contact with the structure, requiring us to move it closer.
Current Implementation Using Dichotomous Image Segmentation (DIS)
To address this challenge, we fine-tuned a Dichotomous Image Segmentation (DIS) neural network on 1,000 annotated images. This method works well for background removal. We require highly accurate segmentations, a clear distinction between the incoming wire and the surrounding structure, robustness for different lighting conditions and background objects, and high processing speed—making this a viable solution.
This approach worked well for monitoring the wire's straightness, as we can perform this check before moving to the actual target, ensuring a clear background with no structure in view. However, once close to the structure, we encountered issues with leaking masks and inaccurate predictions. For example, the following case demonstrates a much more complex situation, with occlusion from the rod holder in the top-left corner, a complex node, and a grainy image. The image quality is relatively low because the robot was still moving slightly. While we could prevent this, doing so would slow down our process.
We attempted to solve this problem by segmenting the incoming mask before moving to the structure, then using OpenCV's template matching to track the rod tip (our primary area of interest). This solution worked but wasn't as robust as we had hoped. That's why we were excited when Meta released Segment Anything Model 2 (SAM-2) a month ago, as it offers some interesting features.
SAM-2
We initially attempted to use SAM-1 to solve this problem. However, it presented several issues that led us to abandon it in favor of DIS:
Speed: It couldn't perform real-time image segmentation on a "normal" GPU, especially since we need to process 4 different cameras simultaneously. While a "FastSAM" variant claims to be 50x faster with similar quality, we found that the mask quality wasn't crisp enough for monitoring wire quality.
Lack of tracking support: Ideally, we track a wire during robot movement, as it can become partially occluded when moving in.
Segmentation quality: It fell short of our expectations. Many situations still resulted in mask leakage or incorrect combinations.
These factors led us to train a simpler and faster model, as previously explained. However, SAM-2 appears to have addressed all these issues! The demo is impressive and showcases the new model's capabilities. It offers a 6x speed increase, better accuracy, and streaming memory capabilities, enabling real-time object tracking in video. This last feature makes the model particularly interesting for tackling our problem. After integrating it into our stack, we noticed immediate improvements. For example, the following image shows the results without using the tracking mechanism, with a single prompt roughly at the wire location. What's especially impressive is that this is not fine-tuned to our data, suggesting potential for even better performance with customization.
The results become even more impressive when we utilize the tracking mechanism built into the model itself.
This feature accurately tracks the rod frame by frame in real-time across multiple videos. We use an Nvidia RTX 4090 graphics card with 24GB of memory for inference. By loading four models in memory (one for each camera), we can track the wire in real-time on all cameras simultaneously.
By adopting Meta's open-source model, we transitioned from experiments to production within a week without needing to fine-tune any models. This efficiency allows us to focus our time and resources on building our French Deep-Tech company.
Merci beaucoup, Yann LeCun!
P.S. If you're interested in working on exciting projects, check out our website and job openings on LinkedIn.