💡 Press Cmd+P (Mac) or Ctrl+P (Windows) to save as PDF. This banner hides when printed.
MissionRobo · Career roadmap
Perception / Computer Vision Engineer Skydio · Anduril · Dedrone · Apptronik · 20 weeks
The 20-week path from "I can run YOLO on a webcam" to "I ship robust perception on a robot." Targets perception roles at the top autonomy companies, where the bar includes multi-view geometry, sensor fusion, and real-time deployment.
Advanced · ~20 weeks · 12 topics · 15 resources
01. CV foundations Image formation, multi-view geometry, classical features.
Camera models, intrinsics, extrinsicsRequired Pinhole + distortion, what camera calibration actually computes.
You will read camera_info.yaml every day. Know what every field means.
Classical features (SIFT, ORB, FAST)Recommended Pre-deep-learning detection. Still used inside many SLAM stacks.
ORB-SLAM3 is built on classical features. You cannot debug it without knowing them.
Epipolar geometry + triangulationRequired How two cameras give you depth. The math behind stereo and SfM.
Hartley & Zisserman chapters 9-12 are the canonical reference. Heavy but worth it.
02. Modern deep-learning vision Detection, segmentation, depth, NeRF, transformers.
Deep learning for vision (Stanford CS231n)Required CNNs, attention, modern training tricks.
The Stanford course is still the best free intro. Lectures + assignments take ~6 weeks if you do them properly.
Object detection (YOLO → DETR)Required YOLO for speed, transformer-based detectors for quality.
YOLOv8/v9 is the workhorse on real robots. DETR variants are the research direction.
Segmentation (Mask R-CNN, SAM 2)Recommended Per-pixel labeling for scene understanding.
SAM 2 (Meta, 2024) is the default for new pipelines. Mask R-CNN still ships in legacy stacks.
Depth estimation (stereo + mono)Recommended Stereo block matching, monocular depth networks (MiDaS, Depth Anything).
Stereo is the reliable workhorse; mono is the cheap option that's gotten surprisingly good.
Depth Anything V2 FREE — Current best free monocular depth model. Plug-and-play for many use cases. NeRF and 3D reconstructionOptional Neural radiance fields, Gaussian splatting, novel view synthesis.
The hot research area. Real production use is still emerging but worth being fluent for senior interviews.
03. Sensor fusion + calibration Camera × LiDAR × IMU — what separates demo perception from production.
Multi-sensor calibration (Kalibr)Required How to align camera + IMU + LiDAR coordinate frames.
Kalibr is the open-source standard. Reading the source is half the job.
Camera + LiDAR fusionRecommended Project LiDAR points onto images, project image features into 3D.
Autonomous vehicles have done this for a decade. Robotics is catching up.
04. Production deployment TensorRT, ONNX, real-time pipelines, edge inference.
TensorRT for edge inferenceRecommended NVIDIA's optimizer for running models fast on Jetson.
Senior perception engineers can convert a PyTorch model to TensorRT in a day. Junior engineers can't. Be the senior.
ONNX as the interchange formatRecommended Convert between PyTorch, TF, JAX, edge runtimes.
ONNX is the duct tape that holds modern MLOps together for vision.
Generated from missionrobo.com/roadmaps/perception-cv-engineer · Updated 6/10/2026