Pose Estimation
许可协议: Unknown


We propose a benchmark for 6D pose estimation of a rigid object from a single RGB-D input image. The training data consists of a texture-mapped 3D object model or images of the object in known 6D poses. The benchmark comprises of: i) eight datasets in a unified format that cover different practical scenarios, including two new datasets focusing on varying lighting conditions, ii) an evaluation methodology with a pose-error function that deals with pose ambiguities, iii) a comprehensive evaluation of 15 diverse recent methods that captures the status quo of the field, and iv) an online evaluation system that is open for continuous submission of new results. The evaluation shows that methods based on point-pair features currently perform best, outperforming template matching methods, learning-based methods and methods based on 3D local features.

Data Collection


LM (a.k.a. Linemod) has been the most commonly used dataset for 6D object pose estimation. It contains 15 texture-less household objects with discriminative color, shape and size. Each object is associated with a test image set showing one annotated object instance with significant clutter but only mild occlusion. LM-O (a.k.a. Linemod-Occluded) provides ground-truth annotation for all other instances of the modeled objects in one of the test sets. This introduces challenging test cases with various levels of occlusion.


IC-MI (a.k.a. Tejani et al.) contains models of two texture-less and four textured household objects. The test images show multiple object instances with clutter and slight occlusion. IC-BIN (a.k.a. Doumanoglou et al., scenario 2) includes test images of two objects from IC-MI, which appear in multiple locations with heavy occlusion in a bin-picking scenario. We have removed test images with low-quality ground-truth annotations from both datasets, and refined the annotations for the remaining images in IC-BIN.


It features 30 industry-relevant objects with no significant texture or discriminative color. The objects exhibit symmetries and mutual similarities in shape and/or size, and a few objects are a composition of other objects. TLESS includes images from three different sensors and two types of 3D object models. For our evaluation, we only used RGB-D images from the Primesense sensor and the automatically reconstructed 3D object models.


This dataset (a.k.a. Rutgers APC) includes 14 textured products from the Amazon Picking Challenge 2015, each associated with test images of a cluttered warehouse shelf. The camera was equipped with LED strips to ensure constant lighting. From the original dataset, we omitted ten objects which are non-rigid or poorly captured by the depth sensor, and included only one from the four images captured from the same viewpoint.


Two new datasets with household objects captured under different settings of ambient and directional light. TUD-L (TU Dresden Light) contains training and test image sequences that show three moving objects under eight lighting conditions. The object poses were annotated by manually aligning the 3D object model with the first frame of the sequence and propagating the initial pose through the sequence using ICP. TYO-L (Toyota Light) contains 21 objects, each captured in multiple poses on a table-top setup, with four different table cloths and five different lighting conditions. To obtain the ground truth poses, manually chosen correspondences were utilized to estimate rough poses which were then refined by ICP. The images in both datasets are labeled by categorized lighting conditions

Data Format

Directory structure

The datasets have the following structure:

  • models[_MODELTYPE] - 3D object models.

  • models[_MODELTYPE]_eval - "Uniformly" resampled and decimated 3D object models used for calculation of errors of object pose estimates.

  • train[_TRAINTYPE]/X (optional) - Training images of object X.

  • val[_VALTYPE]/Y (optional) - Validation images of scene Y.

  • test[_TESTTYPE]/Y - Test images of scene Y.

  • camera.json - Camera parameters (for sensor simulation only; per-image camera parameters are in files scene_camera.json - see below).

  • - Dataset-specific information.

  • test_targets_bop19.json - A list of test targets used for the evaluation in the BOP Challenge 2019/2020. The same list was used also in the ECCV 2018 paper, with exception of T-LESS, for which the list from test_targets_bop18.json was used.

MODELTYPE, TRAINTYPE, VALTYPE and TESTTYPE are optional and used if more data types are available (e.g. images from different sensors).

The images in train, val and test folders are organized into subfolders:

  • rgb/gray - Color/gray images.
  • depth - Depth images (saved as 16-bit unsigned short).
  • mask (optional) - Masks of object silhouettes.
  • mask_visib (optional) - Masks of the visible parts of object silhouettes.

The corresponding images across the subolders have the same ID, e.g. rgb/000000.png and depth/000000.png is the color and the depth image of the same RGB-D frame. The naming convention for the masks is IMID_GTID.png, where IMID is an image ID and GTID is the index of the ground-truth annotation (stored in scene_gt.json).

Training, validation and test images

If both validation and test images are available for a dataset, the ground-truth annotations are public only for the validation images. Performance scores for test images with private ground-truth annotations can be calculated in the BOP evaluation system.

Camera parameters

Each set of images is accompanied with file scene_camera.json which contains the following information for each image:

  • cam_K - 3x3 intrinsic camera matrix K (saved row-wise).
  • depth_scale - Multiply the depth image with this factor to get depth in mm.
  • cam_R_w2c (optional) - 3x3 rotation matrix R_w2c (saved row-wise).
  • cam_t_w2c (optional) - 3x1 translation vector t_w2c.
  • view_level (optional) - Viewpoint subdivision level, see below.

The matrix K may be different for each image. For example, the principal point is not constant for images in T-LESS as the images were obtained by cropping a region around the projection of the origin of the world coordinate system.

Note that the intrinsic camera parameters can be found also in file camera.json in the root folder of a dataset. These parameters are meant only for simulation of the used sensor when rendering training images.

P_w2i = K * [R_w2c, t_w2c] is the camera matrix which transforms 3D point p*w = [x, y, z, 1]' in the world coordinate system to 2D point p_i = [u, v, 1]' in the image coordinate system: s * p_i = P_w2i * p_w.

Ground-truth annotations

The ground truth object poses are provided in files scene_gt.json which contain the following information for each annotated object instance:

  • obj_id - Object ID.
  • cam_R_m2c - 3x3 rotation matrix R_m2c (saved row-wise).
  • cam_t_m2c - 3x1 translation vector t_m2c.

P_m2i = K * [R_m2c, t_m2c] is the camera matrix which transforms 3D point p*m = [x, y, z, 1]' in the model coordinate system to 2D point p_i = [u, v, 1]' in the image coordinate system: s * p_i = P_m2i * p_m.

Meta information about the ground-truth poses The following meta information about the ground-truth poses is provided in files scene_gt_info.json (calculated using scripts/, with delta = 5mm for ITODD, 15mm for other datasets, and 5mm for all photorealistic training images provided for the BOP Challenge 2020):

  • bbox_obj - 2D bounding box of the object silhouette given by (x, y, width, height), where (x, y) is the top-left corner of the bounding box.
  • bbox_visib - 2D bounding box of the visible part of the object silhouette.
  • px_count_all - Number of pixels in the object silhouette.
  • px_count_valid - Number of pixels in the object silhouette with a valid depth measurement (i.e. with a non-zero value in the depth image).
  • px_count_visib - Number of pixels in the visible part of the object silhouette.
  • visib_fract - The visible fraction of the object silhouette (= px_count_visib/px_count _all).

Acquisition of training images

Most of the datasets include training images which were obtained either by capturing real objects from various viewpoints or by rendering 3D object models (using scripts/

The viewpoints, from which the objects were rendered, were sampled from a view sphere as in the paper by recursively subdividing an icosahedron. The level of subdivision at which a viewpoint was added is saved in scene_camera.json as view_level (viewpoints corresponding to vertices of the icosahedron have view_level = 0, viewpoints obtained in the first subdivision step have view_level = 1, etc.). To reduce the number of viewpoints while preserving their "uniform" distribution over the sphere surface, one can consider only viewpoints with view_level <= n, where n is the highest considered level of subdivision.

For rendering, the radius of the view sphere was set to the distance of the closest occurrence of any annotated object instance over all test images. The distance was calculated from the camera center to the origin of the model coordinate system.

3D object models

The 3D object models are provided in PLY (ascii) format. All models include vertex normals. Most of the models include also vertex color or vertex texture coordinates with the texture saved as a separate image. The vertex normals were calculated using MeshLab as the angle-weighted sum of face normals incident to a vertex.

Each folder with object models contains file models_info.json, which includes the 3D bounding box and the diameter for each object model. The diameter is calculated as the largest distance between any pair of model vertices.

Coordinate systems

All coordinate systems (model, camera, world) are right-handed. In the model coordinate system, the Z axis points up (when the object is standing "naturally up-right") and the origin coincides with the center of the 3D bounding box of the object model. The camera coordinate system is as in OpenCV with the camera looking along the Z axis.


  • Depth images: See files camera.json/scene_camera.json in individual datasets.
  • 3D object models: 1 mm
  • Translation vectors: 1 mm


       author = {{Hodan}, Tomas and {Michel}, Frank and {Brachmann}, Eric and {Kehl}, Wadim and {Glent Buch}, Anders and {Kraft}, Dirk and {Drost}, Bertram and {Vidal}, Joel and {Ihrke}, Stephan and {Zabulis}, Xenophon and {Sahin}, Caner and {Manhardt}, Fabian and {Tombari}, Federico and {Kim}, Tae-Kyun and {Matas}, Jiri and {Rother}, Carsten},
        title = "{BOP: Benchmark for 6D Object Pose Estimation}",
      journal = {arXiv e-prints},
     keywords = {Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Robotics},
         year = 2018,
        month = aug,
          eid = {arXiv:1808.08319},
        pages = {arXiv:1808.08319},
archivePrefix = {arXiv},
       eprint = {1808.08319},
 primaryClass = {cs.CV},
       adsurl = {},
      adsnote = {Provided by the SAO/NASA Astrophysics Data System}
Research Scientist at Facebook Reality Labs