COX Face Dataset is designed for the problem of Video-to-Still (V2S)， Still-to-Video (S2V) and Video-to-Video (V2V) face recognition. The dataset contains 1,000 subjects, with each subject 1 high quality still image and 3 video sequences captured simulating video surveillance scenario. Specifically, in this dataset, the still images are collected under controlled environment, thus of high quality and resolution, in frontal view, with normal lighting and neutral expression. On the contrary, the video frames are of low resolution and low quality, with blur, and captured under poor lighting, in non-frontal view.
Welcome to COX Face Database, a database of still and video faces designed for studying three typical scenarios of video-based face recognition: Video-to-Still (V2S), Still-to-Video (S2V) and Video-to-Video (V2V) face recognition as shown in Table I. This database collects both still images taken by DC with 1,000 seated subjects and surveillance-like videos captured by camcorders with 1,000 walking subjects. The still images are taken under controlled environment, thus of high quality and resolution, in frontal view, with normal lighting and neutral expression. On the contrary, the video frames are often of low resolution and low quality, with blur, and captured by three different camcorders under poor lighting, in non-frontal view.
The setting of this COX Face database simulates the real-world V2S, S2V and V2V matching conditions for providing researchers a solid and challenging experimental data. It must be admitted that our dataset is not suitable for studying each individual problem of face recognition field (such as different poses or illuminations) but for exploring all the usual problems coming together in real-world scenario. By releasing this database to the research community, we hope to encourage the exploration of V2S, S2V and V2V face matching issues.
For each subject, still face images were taken with the DC. To capture ID photo like images, the DC is mounted on a tripod about 3 meters away from the subjects, who was asked to sit on a chair with face upright and neutral expression. The photographing room was set up with standard indoor lighting, and the flash of the DC was always used to alleviate shadows due to top lighting.
As shown in the following figure, to simulate surveillance video, we took video of every subject when hee was walking. To include more variations in the facial appearance, we elaborately pre-designed the walking route, as well as the mounting of the cameras. Every subject was asked to walk freely from the starting point to the end point, roughly along the S-shape route. Three camcorders, Cam1, Cam2 and Cam3, were placed at 3 fixed locations, respectively capturing video of the subject when walking on the route marked in red, green, and blue. The radius of the two semicircles in the S-shape is 3 meters.
We de-interlaced the videos by a commercialized tool, Aunsoft Final Mate, and ran a commercialized face detector, OKAO, to detect the faces in the video clips. However, as the face detector is not perfect, it might generate some inaccurate or even incorrect detection. For the convenience of later process, we exploit a simple tracking-like strategy to remove possible outlier detections. Simply speaking, if the center of the detected face in one frame is too far away from those in its previous frames, the face in this frame will be removed as an outlier. This processing unavoidably leads to loss of a small number of video frames, which we think does no hurt to the evaluation. The following three figures show the number of frames in the videos respectively for the three cameras, from which we can see that most of the video clips have more than 100 frames per subject and especially video clips from Cam3 mostly have 170 frames.
As defined in Table I, in V2S scenario, the target set contains still images of the persons with known identities, while the query samples are video clips of faces to be recognized, generally by matching against the target still face images. Therefore, for this scenario, we designed the protocol with the training and testing data configured as in Table IV. As shown in the table, the videos taken by three camcorders form three separated experiments, i.e., V1-S, V2-S, and V3-S. The 10 random partitions of the 300/700 subjects for training and testing are given in the "V2S partitions" folder of the released database.
Compared with V2S scenario, the target set of the S2V scenario conversely contains videos while the queries are still face images. Therefore, as shown in Table V, we can also form three different experiments, i.e., S-V1, S-V2, and S-V3, according to the source (/camcorder) of the video in the target set. Similarly, the 10 random partitions of the 300/700 subjects for training and testing are given in the "S2V partitions" folder of the released database.
To form V2V evaluations, for either target set or query set,we have 3 videos per subject respectively from Cam 1, Cam 2, and Cam 3. Therefore, they can mutually form 6 experiments, as shown in Table VI. Alternatively, we can also setup more experiments by taking one or two of the three videos to formthe target set while keeping the remaining as the queries, which is not considered in this work in order to alleviate the evaluation burden. Similarly, the 10 random partitions of the 300/700 subjects for training and testing are given in the "V2V partitions" folder of the released database.
Qi Wang (email@example.com), Institute of Computing Technology, Chinese Academy of Sciences
Shiguang Shan (firstname.lastname@example.org), Institute of Computing Technology, Chinese Academy of Sciences
Ruiping Wang (email@example.com), Institute of Computing Technology, Chinese Academy of Sciences
The COX face database is released to universities and research institutes for research purpose only. To request a copy of the COX face database, please do as follows: