What's the current state-of-the-art method for determining if an image contains one or more people, and their locations in the image?

EDIT: I've seen some similar work using OpenCV's Haar classifier. Can anyone comment on how this fares compared to other methods?

asked Sep 21 '11 at 15:35

Cerin's gravatar image


edited Sep 22 '11 at 13:29

It's not state-of-the-art any more, but look at "Histograms of oriented gradients for human detection" for an enormously influential paper that describes a system for human detection (with other applications too)

(Sep 22 '11 at 01:55) Jacob Jensen

While I agree that it is definitely worth looking at the paper, this is pedestrian detection, not human detection. This is a quite restricted setting.

(Sep 22 '11 at 03:44) Andreas Mueller

About your edit: I have never used the Haar classifier in OpenCV. I think this is what people use if they want very very fast recall and don't want to think about implementing something themselves.

I have not seen haar features being used in any recent research. I would expect the performance to be quite poor.

(Sep 24 '11 at 07:15) Andreas Mueller

One Answer:

There are several methods that can do this, depending on what you want to achieve. First of all, what does location mean? Is is a bounding box or a pixel-level segmentation of the image? What kind of scenes do you have? Are people in all sorts of positions? And do you just have single still images?

If you can assume that the faces of the people are visible, you might just want to try a face recognition algorithm. Finding a face is a much easier problem than finding a person. A place to start here would be the classic Viola and Jones algorithm.

I recommend you look at the outcome of the Pascal VOC challenge. There are both detection (bounding box), segmentation (pixel level) and human action recognition challenges.

One possibility would be to use some variant of Felzenszwalbs "Deformable parts model". This is a generic object detector that works very well in practice and I think code is available.

If you want a stick figure of the people, you should go for the action recognition works. They focus more on people and return more than a bounding box.

For segmentation, I can't really say if there is one approach I'd totally recommend.

edit: How could I forget the "poselets" work by Malik's group. Maybe check that out first :)

This answer is marked "community wiki".

answered Sep 22 '11 at 03:51

Andreas%20Mueller's gravatar image

Andreas Mueller

edited Sep 22 '11 at 03:56

By location, I was thinking of just the bounding box. Most of my scenes would include only single people, but I'd like to handle scenes with multiple people. My plan is to use the bounding box to focus additional classifiers to analyze and identify each detected person.

(Sep 22 '11 at 13:28) Cerin

check out this one: http://citeseerx.ist.psu.edu/viewdoc/download?doi= I haven't read it but it seems to address your problem and it's a recent paper using an approach that is pretty successful as far as I can tell.

Btw: do you have arbitrary scenes or do you have some controled setup?

(Sep 22 '11 at 13:34) Andreas Mueller
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.