 |

4
 |
 |
| I4 |
|
 |
United Kingdom |
University of Oxford -
Department of Engineering Science - Active Vision Group |
|
 |
|
|
 |
 |
 |
 |
 |
|
|
|
| Navigation |
 |
|
 |
|
Activities
The Active Vision Group seeks to advance knowledge in computational vision,
particularly in the areas of detection and tracking of moving objects, and
structure recovery from calibrated and partially calibrated imagery.
The Group works on applications for surveillance, wearable and assistive
computing, cognitive vision, augmented reality, human motion analysis,
teleoperation, and navigation.
|
|
|
|
Projects
|
|
|
|
This research integrates the fields of activity
recognition with active sensing. A particular focus lies on a fusion of the
data acquisition process at varying levels of resolution with pro-active
sensing, which includes higher level reasoning.
The topics described next bring together techniques in visual tracking,
activity recognition, and intelligent control of pan/tilt/zoom devices in
order to be able to reason about visual scenes, infer causal relationships,
and detect unusual or otherwise interesting behaviour.
Behaviour from Head Pose
The aim of the project is to automatically identify the direction in which
people are facing from a distant camera in a surveillance situation to
provide input to higher level reasoning systems. The direction in which
somebody is facing provides a good estimate of their gaze direction, which
can be used to infer familiarity between people or interest in surroundings.
It can be seen as closing the gap between a coarse description of humans
from a distance and a more detailed motion of limbs, usually obtained from a
closer view. The work is partly funded by HERMES, located in work package 3
and 4.
Active Scene Exploration
Effective use of resources is an underlying theme of this project. The
resources in question are a set of cameras which overlook a common area from
varying viewing angles. These cameras are heterogenous and have different
parameters for control, e.g. some are static, some are pan, tilt and zoom
cameras. Information theoretic measures are used to choose the best
surveillance parameters for these cameras, whereas best can be defined by
higher level reasoning, or human operators. Currently, the work concentrates
on objective functions from information-theory and the use of sensor data
fusion techniques to make informed decisions.
As part of the HERMES project, the goal is to establish a perception/action
cycle with specific consideration of varying zoom levels. The distributed
camera system can be interpreted as an abstract sensor which is content with
higher level objectives as input.
The coarsest scale of an agent representation is considered to track agents
and note their trajectories, together with other coarse scale features, that
will be useful for action and intention recognition. The aim is then to
generate behaviour and conceptual descriptions about the agent itself and
its relationship with respect to other agents and predefined objects in the
scene.
Cognitive Computer Vision
Recent work in visual tracking and camera control has looked at the issues
involved in activity recognition using parametric and non-parametric belief
propagation in Bayesian Networks, and begun to touch on the issues of
causality. The current research takes all of these areas forward. The
ultimate goal will be to combine these techniques to produce a pan/tilt/zoom
camera system, and/or network of cameras, that can allocate attention in an
intelligent fashion via an understanding of the scene, inferred
automatically from visual data.
The topic is directly related to the EU project HERMES, which is in the
exciting and socially relevant area of intelligent visual surveillance. The
aim of the research is to develop cameras systems that could be considered
to exhibit emergent cognitive behaviour, through developing algorithms and
ontologies for understanding of visual scenes.
|
|
|
|
|
|
|
|
Recent work in visual tracking and camera control has
looked at the issues involved in activity recognition using parametric and
non-parametric belief propagation in Bayesian Networks, and begun to touch
on the issues of causality. The current research takes all of these areas
forward. The ultimate goal will be to combine these techniques to produce a
pan/tilt/zoom camera system, and/or network of cameras, that can allocate
attention in an intelligent fashion via an understanding of the scene,
inferred automatically from visual data.
The topic is directly related to the EU project HERMES, which is in the
exciting and socially relevant area of intelligent visual surveillance. The
aim of the research is to develop cameras systems that could be considered
to exhibit emergent cognitive behaviour, through developing algorithms and
ontologies for understanding of visual scenes.
In this context, solutions based on fuzzy temporal logic are initially
investigated, connecting fuzzy inference to action in an attempt to control
pan/tilt/zoom cameras in real-time. Algorithms are to be tested on a network
of camera nodes, each one provided with a computer unit for local processing
and low-level control of the actuators.
Another important aspect of the project is to investigate new solutions for
cognitive vision with an emphasis on intelligent surveillance devices. More
specifically, to conduct research into causal reasoning from video, and
integrate this with work on activity recognition, visual tracking
algorithms, and active control of pan/tilt/zoom devices and other techniques
applicable to the broad problem of creating intelligent visual surveillance
devices.
|
|
|
|
|
 |
|
 |
|
|
|
|
|
|
|
|
Possibly the most salient large-scale behavioural
indicator that can be measured optically is a person's head pose. Knowing
the direction in which a person is facing provides a strong indication of
the direction in which they are looking, which is an important cue for
higher level behavioural analysis. The focus of an individual's attention
often indicates their desired destination whereas mutual attention between
people indicates familiarity, and any single object or person receiving
attention from a large number of people is likely to be worthy of further
investigation. In systems controlling dynamic cameras, a pose estimation
from a low resolution head image can be used to determine whether or not a
close-up from a dynamic camera would provide a face image that is suitable
for identification. The aim of this project is to estimate the head pose of
individuals in a surveillance situation for use in higher-level reasoning.
|
|
|
|
|
 |
|
 |
|
|
|
|
We have developed algorithms to estimate head pose
through a novel use of randomised fern classifiers. Instead of measuring the
head pose of an image directly, classifiers are used to categorise images
into groups according to the head pose. For a head pose estimator to be
effective in real-world situations it must be able to cope with different
skin and hair colours as well as wide variations in lighting direction,
intensity and colour. Most existing classifiers are susceptible to these
variations and require examples with different combinations of lighting
conditions and skin/hair colour variations in order to make an accurate
classification. The approach that we have taken effectively learns a model
of the skin and hair colours for each new person that is observed, making it
largely invariant to lighting and the individual characteristics of the
people in the video. The result is a classifier which works in very low
resolution video where heads have a diameter of just 10 pixels.
|
|
|
|
|
 |
|
 |
|
|
|
|
|
|
|
|
To work at video rate, the maps that monocular SLAM
builds are bound to be sparse, making them sensitive to the erroneous
inclusion of moving points and to the deletion of valid points through
temporary occlusion. This system provides the parallel implementation of
monoSLAM (monocular simultaneous localization and mapping) with a 3D object
tracker, allowing reasoning about moving objects and occlusion. The SLAM
process provides the object tracker with information to register objects to
the map's frame, and the object tracker allows the marking of features,
either those moving features on moving objects, or those pseudo-features
created by their occluding edges, or those occluded by objects. While a
traditional monoSLAM, assuming a rigid environment, degrades performance,
sometimes terminally, when moving features are included, the combined system
is more robust to dynamic environments. In addition, knowledge that some
static features are occluded rather than unreliable avoids the need to
invoke the somewhat cumbersome process of feature deletion, followed later
perhaps by unnecessary re-initialization, allowing the lifetime of occluded
static features to be extended.
The object tracker is done using a modified version of Harris' RAPiD
tracker. The identification and pose initialization are at present done by
hand. The videos are presented to verify the recovered geometry and to
indicate the impact on camera pose in monoSLAM of including and avoiding
moving features. The system without the object tracker gives the incorrect
camera's pose due to moving features, but still survives until the end of
the video. The system with the object tracker, on the other hand, estimates
more correct camera's pose through the image sequences.
|
|
|
|
|
 |
|
 |
|
|
|
|
|
|
|
|
The purpose of this project is to provide a natural
and intuitive way to interact with a computer, by interpreting hand
movements and gestures in real time. A cost effective way of obtaining this
in a non invasive way is to use visual sensing from cameras.
The core of this project is an algorithm that integrates segmentation, 3D
pose estimation of a human hand by use of a simplified 3D hand model and a
mapping of the pose parameters into a latent space. In order to be able to
track in 3D a non rigid articulated object (like a human hand) it must first
be able to track in 3D rigid non articulated objects.
The algorithm we have been working on involves adding 3D shape information
to a tracking algorithm for 2D rigid object tracking developed inside the
Active Vision Group by Charles Bibby and Ian Reid, in their paper, Robust
Real-Time Visual Tracking using Pixel-Wise Posteriors. The algorithm treats
the image as a bag of pixels (the position of the pixels in the image is
considered a random variable), then evolves a level set function by use of
pixel-wise posteriors, rather than likelihoods. This approach works in real
time on standard hardware. We are working on adding a new prior: the norm of
the difference between the rendering of the 3D object model, with properly
adjusted pose parameters, and the segmented region of the image, defined by
the level set function. This region will then evolve towards the projection
of the 3D object.
While the algorithm presented above should work in the case of optimizing
towards the pose parameters of a rigid object it will probably be too slow
for obtaining the pose of a non rigid, articulated, object. To this end we
are looking at using a Gaussian Processes Latent Variable Model mapping
between the high dimensional pose space and a low dimensional latent space.
The system uses a custom 3D engine. Traditional 3D rendering engines (like
OpenGL or DirectX) lose the relation been 3D points before a transform
(rotation, translation and projection) and their resulting 2D projections,
during the rendering process. Out engine is able to keep this relation and
render a 3D object in wireframe, filled and outline only mode, apply a
Scharr filter and compute the distance transform in only a couple of
milliseconds. This level of performance is achieved using parallel
algorithms developed for the NVIDIA CUDA framework.
|
|
|
|
|
|
|
A system has been developed which combines
single-camera SLAM (Simultaneous Localization and Mapping) with established
methods for feature recognition. Besides using standard salient image
features to build an on-line map of the camera's environment, it is capable
of identifying and localizing known planar objects in the scene, and
incorporating their geometry into the world map. Continued measurement of
these mapped objects improves both the accuracy of estimated maps and the
robustness of the tracking system. In the context of hand-held or wearable
vision, the system's ability to enhance generated maps with known objects
increases the map's value to human operators, and also enables meaningful
automatic annotation of the user's surroundings. The presented solution lies
between the high order enriching of maps such as scene classification, and
the efforts to introduce higher geometric primitives such as lines into
probabilistic maps. The object detection is done using SIFT. A database of
known objects are compared to scene images and when a match is found the 3D
location of the object is calculated using a homography and placed in the
SLAM map with a high level of accuracy.
The video compares the monocular SLAM system
running with and without object detection in a spit-screen view. The system
without the object detection looses track due to insufficient features, and
at this point the video is slowed down to highlight this. The system with
the object detection continues and at the end of the video it has
successfully detected all five objects and accurately localized them in the
world.
|
|
|
|
|
 |
|
 |
|
|
|
|
|
|
|
A relocalisation module for a single-camera SLAM system has been
devised. When the camera is occluded or the motion is fast then
the system recognises it has become lost and attempts to
relocalise. The camera pose is found relative to the map of point
features in the world which has already been created by the SLAM
system. Potential matches to these features are found in the
image and then RANSAC is used with a three-point-pose algorithm
to robustly determine the pose from these matches. In this video, the potential matches
are found using correlation between the stored feature patches
and the detected corners in the image. In an improvement of the method, we have used increased the
speed of feature matching using a modification of randomized
tree search. This allows the system to recover quickly even in
large maps.
At the end of the video we demonstrate how single-camera SLAM
can be used for augmented reality applications.
|
|
|
|
|
 |
|
 |
|
|
|
|
|
| |
|
|
|
|
|
|
 |
|
 |
|

6
 |
 |