We select one million celebrities, who are real persons in the world and have/had public attentions. The steps for selection are described in details in the following paragraphs. First, we select a subset of entities from a knowledge base called freebase  based on the information within freebase. In freebase, each entity is identifified by a unique key (called machine identififier, mid in ), and associated with rich properties. More specififically, we select the entities of which the properties satisfy all the three following conditions.
• The object type of the entity is defifined as “people.person” in freebase. This condition means that we select entities which are claimed (by freebase) to be real persons in the world. We don’t include movie characters since their appearance is not strictly defifined, especially when a classic movie is retaken.
• The entities are required to have at least one of the properties unique for human beings, such as “person’s name”, “place of birth”, “date of birth”, “person’s professions”. This condition removes the entities which have too sparse information for us to collect and label images. This condition also helps us to remove some of the entities of which the object type are mislabeled as “people.person” in freebase.
• If the date of birth is available for a given entity in freebase, this entity can not be selected if he/she was born before the mid-nineteenth century. The reason for this condition is as follows. The fifirst roll-fifilm specialized camera “Kodak” was invented in 1888  and started to get popular in late nineteenth century. We can not rely on drawings or sculptures to recognize people’s faces, since whether they are visually similar to the actual person could be subjective and arguable. An interesting example is that the sculpture of John Harvard in Harvard university is claimed to be inspired by a Harvard student Sherman Hoar rather than Harvard himself, since no one knew what John Harvard had looked like .
In the second step, we rank all the entities in the above sub set according to the frequency of their occurrence on the web. Then, we select the top one million entities to form our one mil lion celebrity list and provide their entity keys (mid) in freebase. The occurrence frequency for a given entity is obtained by count ing how many documents contain this entity in a large corpus with billions of documents from the web.