Skip to main content

Vision YOLO: Object Detection

This tutorial demonstrates real-time object detection using YOLO on your robot's camera feed. Detected objects are annotated with bounding boxes and republished for visualization.

Overview

The vision YOLO script:

  • Subscribes to camera images (/camera/image_raw)
  • Runs YOLO object detection (CPU-only, no GPU required)
  • Draws bounding boxes and labels on detected objects
  • Publishes annotated images to /camera/image_annotated

Prerequisites

  • Active TensorFleet VM with robot + camera simulation
  • Image Panel open in VS Code sidebar
  • Additional dependencies for vision processing

Additional Dependencies

The JavaScript version uses ONNX Runtime for inference:

# Dependencies included in package.json
bun install

Running the Script

bun robot:vision
# or
bun src/vision_yolo.js

Viewing Detection Output

  1. Open the Image Panel from the TensorFleet sidebar
  2. In the dropdown, select /camera/image_raw to see the raw camera feed
  3. Run the vision script
  4. Switch to /camera/image_annotated to see detections with bounding boxes

Expected Output

Connected to rosbridge.
Subscribing to '/camera/image_raw' (sensor_msgs/Image) and republishing to '/camera/image_annotated'.
Waiting for images...
Received image message on '/camera/image_raw'.
Running YOLO inference on image ...
YOLO inference finished.
Processed image 1 (640x480), published to /camera/image_annotated
Image #1: 3 detections from YOLO (first: { classId: 0, label: 'person', score: 0.89 })

How It Works

Image Pipeline

┌──────────────────────────────────────────────────────────────┐
│ Vision Pipeline │
├──────────────────────────────────────────────────────────────┤
│ │
│ /camera/image_raw ┌──────────────┐ /camera/image_ │
│ ─────────────────▶ │ YOLO Model │ annotated │
│ (sensor_msgs/Image) │ │ ─────────────────▶ │
│ │ - Decode │ (sensor_msgs/Image) │
│ │ - Detect │ │
│ │ - Annotate │ │
│ └──────────────┘ │
└──────────────────────────────────────────────────────────────┘

Subscribing to Camera Images

const subscriber = new ROSLIB.Topic({
ros,
name: IMAGE_TOPIC,
messageType: IMAGE_MESSAGE_TYPE,
queue_length: 1 // Only keep latest frame
});

const publisher = new ROSLIB.Topic({
ros,
name: ANNOTATED_IMAGE_TOPIC,
messageType: IMAGE_MESSAGE_TYPE
});

subscriber.subscribe(async (msg) => {
// Process each frame
const detections = await runYoloOnImageMsg(msg);
const annotatedMsg = annotateImageMessage(msg, detections);
publisher.publish(annotatedMsg);
});

Decoding ROS Images

function decodeRosImage(msg) {
let height = msg.height;
let width = msg.width;
const encoding = (msg.encoding || "rgb8").toLowerCase();
const dataField = msg.data;

let buffer;

// rosbridge sends image data as base64 string
if (typeof dataField === "string") {
buffer = Buffer.from(dataField, "base64");
} else {
buffer = Buffer.from(dataField);
}

return {
buffer,
meta: { height, width, encoding }
};
}

Running YOLO Detection

// Uses ONNX Runtime for CPU inference
async function runYoloOnImageMsg(msg) {
const { buffer, meta } = decodeRosImage(msg);

// Preprocess image for YOLO input
const inputTensor = preprocessImage(buffer, meta);

// Run inference
const outputs = await yoloSession.run({ images: inputTensor });

// Post-process to get bounding boxes
const detections = postprocessYoloOutput(outputs, meta);

return detections;
}

Drawing Annotations

function annotateImageMessage(msg, detections) {
if (!detections || detections.length === 0) {
return msg;
}

const { buffer, meta } = decodeRosImage(msg);

const colors = [
{ r: 0, g: 255, b: 255 }, // Cyan
{ r: 255, g: 0, b: 255 }, // Magenta
{ r: 255, g: 255, b: 0 }, // Yellow
{ r: 0, g: 255, b: 0 }, // Green
{ r: 255, g: 128, b: 0 } // Orange
];

detections.forEach((det, idx) => {
const color = colors[idx % colors.length];

// Draw bounding box
drawRectOnBuffer(buffer, meta, det, { color, thickness: 4 });

// Draw label
const label = `${det.label} ${Math.round(det.score * 100)}%`;
drawLabelOnBuffer(buffer, meta, det, label, { color });
});

return encodeRosImage(buffer, msg, meta);
}

YOLO Model Options

ModelSizeSpeedAccuracy
yolov8n6 MBFastestGood
yolov8s22 MBFastBetter
yolov8m52 MBMediumHigh
yolov8l87 MBSlowHigher
yolov8x137 MBSlowestHighest

For real-time robotics on CPU, yolov8n (nano) is recommended.

Configuration

VariableDefaultDescription
IMAGE_TOPIC/camera/image_rawInput camera topic
ANNOTATED_IMAGE_TOPIC/camera/image_annotatedOutput annotated topic
NO_YOLOfalseSet true to disable YOLO (passthrough)
MAX_IMAGES0Limit frames processed (0 = unlimited)

Detectable Objects

YOLO is trained on the COCO dataset with 80 classes including:

Common in robotics:

  • person, car, truck, bus, bicycle, motorcycle
  • chair, couch, bed, dining table
  • bottle, cup, bowl
  • cat, dog, bird

For simulation, consider using vision_colors.js which detects colored shapes that are easier to place in Gazebo.

Performance Tips

  1. Use queue_length=1: Only process the latest frame
  2. Resize images: Smaller images = faster inference
  3. Use nano model: yolov8n is fastest for CPU
  4. Skip frames: Process every Nth frame if needed
// Skip frames for better performance
let frameCount = 0;
subscriber.subscribe(async (msg) => {
frameCount++;
if (frameCount % 3 !== 0) return; // Process every 3rd frame
// ... process
});

Common Issues

IssueCauseSolution
No images receivedTopic mismatchCheck IMAGE_TOPIC matches simulation
Slow inferenceModel too largeUse yolov8n (nano) model
No detectionsObjects not in COCOUse color detection for simulation
Memory issuesToo many frames queuedSet queue_length=1

Alternative: Color Detection

For simulation environments, vision_colors.js / color detection may work better since you can place distinctly colored objects in Gazebo:

bun robot:vision:colors

Next Steps

  • Combine vision with obstacle avoidance for smarter navigation
  • Train a custom YOLO model for your specific objects
  • Implement object tracking across frames

Computer vision enables robots to understand their environment beyond geometric sensor data. YOLO provides a robust starting point for object detection.