graviti
产品服务
解决方案
知识库
公开数据集
关于我们
MegaFace
2D Classification
Face
|...
许可协议: Research Only

Overview

In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities. On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely cropped version of the data set. We found that females accounted for 41.1% of subjects while males accounted for 58.8%. The median gender variance within identities was 0. The average age range to be 16.1 years while the median was 12 years within identities. The distributions can be found in the supplementary material. A trade off of this algorithm is that we must strike a balance between noise and quantity of data with the parameters. It has been noted by the VGG-Face work, that given the choice between a larger, more impure data set, and a smaller hand-cleaned data set, the larger can actually give better performance. A strong reason foropting to remove most faces from the initial unlabeled corpus was detection error. We found that many images were actually non-faces. There were also many identities that did not appear more than once, and these would not be as useful for learning algorithms. By visual inspection of 50 randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than twice in their respective Flickr accounts. In a complete audit of the clustering algorithm, the reason for throwing out faces are follows: 69% Faces which were below the < 3 threshold for identity 4% Faces which were removed from clusters as impurities 27% Faces which were part of clusters which were still impure even after purifification.

Data Collection

To create a data set that includes hundreds of thousands of identities we utilize the massive collection of Creative Commons photographs released by Flickr. This set contains roughly 100M photos and over 550K individual Flickr accounts. Not all photographs in the data set contain faces. Following the MegaFace challenge, we sift through this massive collection and extract faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces, we only saved the crop plus 2 % of the cropped area for further processing. After collecting and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio (70%). As the Flickr data is noisy and has sparse identities (with many examples of single photos per identity, while we are targeting multiple photos per identity), we processed the full 100M Flickr set to maximize the number of identities. We therefore employed a distributed queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we save locally. A second collection process aggregates faces to a single machine. In order to optimize for Flickr accounts with a higher possibility of having multiple faces of the same identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30 face photos). The crops of photos take over 1TB of storage. As the photos are taken with different camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px). In total the distributed process of collecting and aggregating photos took 15 days.

Data Annotation

Labeling million-scale data manually is challenging and while useful for development of algorithms, there are almost no approaches on how to do it while controlling costs. Companies like MobileEye, Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally, people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test and validate further adding to costs. We thus look to automated, or semi-automated methods to improve the purity of collected data.

There has been several approaches for automated cleaning of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as aquadratic programming problem with constraints enforcing assumptions that noise consists of a relatively small portion of the collected data, gender uniformity, identities consistof a majority of the same person, and a single photo cannot have two of the same person in it. All those methods proved to be important for data cleaning given rough initial labels, e.g., the celebrity name. In our case, rough labels are not given. We do observe that face recognizers perform well at a small scale and leverage embeddings to provide ameasure of similarity to further be used for labeling.

Citation

Please use the following citation when referencing the dataset:

@inproceedings{nech2017level,
title={Level Playing Field For Million Scale Face Recognition},
author={Nech, Aaron and Kemelmacher-Shlizerman, Ira},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2017}
}

License

By downloading the dataset you must agree to the following terms:

[RESEARCHER_FULLNAME] (the "Researcher") has requested permission to use the MegaFace database (the "Database") at the University of Washington. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:

  1. Researcher shall use the Database only for non-commercial research and educational purposes.
  2. University of Washington makes no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
  3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the University of Washington, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
  4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
  5. The University of Washington reserves the right to terminate Researcher's access to the Database at any time.
  6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
  7. The law of the State of Washington shall apply to all disputes under this agreement.
数据概要
数据格式
image,
数据量
4700K
文件大小
65GB
发布方
MegaFace
The MegaFace dataset is the largest publicly available facial recognition dataset with a million faces and their respective bounding boxes. All images obtained from Flickr (Yahoo's dataset) and licensed under Creative Commons.
| 数据量 4700K | 大小 65GB
MegaFace
2D Classification
Face
许可协议: Research Only

Overview

In total, once clustered and optimized MF2 contains 4,753,320 faces and 672,057 identities. On average this is 7.07 photos per identity, with a minimum of 3 photos per identity, and maximum of 2469. We expanded the tight crop version by re-downloading the clustered faces and saving a loosely cropped version. The tightly cropped dataset requires 159GB of space, while the loosely cropped is split into 14 files each requiring 65GB for a total of 910GB. In order to gain statistics on age and gender, we ran the WIKI-IMDB models for age and gender detection over the loosely cropped version of the data set. We found that females accounted for 41.1% of subjects while males accounted for 58.8%. The median gender variance within identities was 0. The average age range to be 16.1 years while the median was 12 years within identities. The distributions can be found in the supplementary material. A trade off of this algorithm is that we must strike a balance between noise and quantity of data with the parameters. It has been noted by the VGG-Face work, that given the choice between a larger, more impure data set, and a smaller hand-cleaned data set, the larger can actually give better performance. A strong reason foropting to remove most faces from the initial unlabeled corpus was detection error. We found that many images were actually non-faces. There were also many identities that did not appear more than once, and these would not be as useful for learning algorithms. By visual inspection of 50 randomly thrown out faces by the algorithm: 14 were non faces, 36 were not found more than twice in their respective Flickr accounts. In a complete audit of the clustering algorithm, the reason for throwing out faces are follows: 69% Faces which were below the < 3 threshold for identity 4% Faces which were removed from clusters as impurities 27% Faces which were part of clusters which were still impure even after purifification.

Data Collection

To create a data set that includes hundreds of thousands of identities we utilize the massive collection of Creative Commons photographs released by Flickr. This set contains roughly 100M photos and over 550K individual Flickr accounts. Not all photographs in the data set contain faces. Following the MegaFace challenge, we sift through this massive collection and extract faces detected using DLIB’s face detector. To optimize harddrive space for millions of faces, we only saved the crop plus 2 % of the cropped area for further processing. After collecting and cleaning our fifinal data set, we re-download the fifinal faces at a higher crop ratio (70%). As the Flickr data is noisy and has sparse identities (with many examples of single photos per identity, while we are targeting multiple photos per identity), we processed the full 100M Flickr set to maximize the number of identities. We therefore employed a distributed queue system, RabbitMQ, to distribute face detection work across 60 compute nodes which we save locally. A second collection process aggregates faces to a single machine. In order to optimize for Flickr accounts with a higher possibility of having multiple faces of the same identity, we ignore all accounts with less than 30 photos. In total we obtained 40M unlabeled faces across 130,154 distinct Flickr accounts (representing all accounts with more than 30 face photos). The crops of photos take over 1TB of storage. As the photos are taken with different camera settings, photos range in size from low resolution (90x90px) to high resolution (800x800+px). In total the distributed process of collecting and aggregating photos took 15 days.

Data Annotation

Labeling million-scale data manually is challenging and while useful for development of algorithms, there are almost no approaches on how to do it while controlling costs. Companies like MobileEye, Tesla, Facebook, hire thousands of human labelers, costing millions of dollars. Additionally, people make mistakes and get confusedwith face recognition tasks, resulting in a need to re-test and validate further adding to costs. We thus look to automated, or semi-automated methods to improve the purity of collected data.

There has been several approaches for automated cleaning of data. O. M. Parkhi et al. used near-duplicate removal to improve data quality. G. Levi et al. used age and gender consistency measures. T. L. Berg et al. and X. Zhang et al. included text from news captions describing celebrity names. H.-W Ng et al. propose data cleaning as aquadratic programming problem with constraints enforcing assumptions that noise consists of a relatively small portion of the collected data, gender uniformity, identities consistof a majority of the same person, and a single photo cannot have two of the same person in it. All those methods proved to be important for data cleaning given rough initial labels, e.g., the celebrity name. In our case, rough labels are not given. We do observe that face recognizers perform well at a small scale and leverage embeddings to provide ameasure of similarity to further be used for labeling.

Citation

Please use the following citation when referencing the dataset:

@inproceedings{nech2017level,
title={Level Playing Field For Million Scale Face Recognition},
author={Nech, Aaron and Kemelmacher-Shlizerman, Ira},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2017}
}

License

By downloading the dataset you must agree to the following terms:

[RESEARCHER_FULLNAME] (the "Researcher") has requested permission to use the MegaFace database (the "Database") at the University of Washington. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:

  1. Researcher shall use the Database only for non-commercial research and educational purposes.
  2. University of Washington makes no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
  3. Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the University of Washington, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
  4. Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
  5. The University of Washington reserves the right to terminate Researcher's access to the Database at any time.
  6. If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
  7. The law of the State of Washington shall apply to all disputes under this agreement.
0
立即开始构建AI
graviti
wechat-QR
长按保存识别二维码,关注Graviti公众号

Copyright@Graviti
沪ICP备19019574号
沪公网安备 31011002004865号