The PubFig dataset is divided into 2 parts:

  1. The Development Set contains images of 60 individuals. This dataset should be used when developing your algorithm, so as to avoid overfitting on the evaluation set. There is NO overlap between this list and evaluation set, nor between this set and the people in the LFW dataset.
  2. The Evaluation Set contains images of the remaining 140 individuals. This is the dataset on which you can evaluate your algorithm to see how it performs.

Due to copyright issues, we cannot distribute image files in any format to anyone. Instead, we have made available a list of image URLs where you can download the images yourself. We realize that this makes it impossible to exactly compare numbers, as image links will slowly disappear over time, but we have no other option. This seems to be the way other large web-based databases seem to be evolving. We hope to periodically update the dataset, removing broken links and adding new ones, allowing for close-to-exact comparisons.

Data Format

Almost all datafiles follow a "tab-separated values" format. The first two lines are generally like this:

# PubFig Dataset v1.2 - filename.txt -
#    person    imagenum    url    rect    md5sum

The first line identifies the name and version of the dataset, the filename, and has a link back to this website. The second line defines the fields in the file, separated by tabs ('\t'). In this example (similar to the dev_urls.txt and eval_urls.txt files), there are 5 fields: person, imagenum, url, rect, and md5sum. The first two are common to many of the datafiles and are the name of the person and an image index number used to refer to a specific image of that individual. Note that image numbers are not necessarily sequential for each person -- there are "holes" in the counting.

Subsequent lines contain one entry per line, with field values also separated by tabs.


The database is made available only for non-commercial use. If you use this dataset, please cite the following paper:

