Date post: | 02-Jan-2016 |
Category: |
Documents |
Upload: | donna-tran |
View: | 13 times |
Download: | 0 times |
Seeing Action: Part 2
Statistics and/or Structure
Aaron [email protected]
School of Interactive Computing College of Computing
Georgia Tech
Continuing from the lower middle???
Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion
"Bending down", "(door) rising up", Swinging a hammer
Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a
particular context.
Activity – a behavior or collection of actions with a purpose/intention.
"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events
Maybe Actions are movements in context??
Context
Structure and Statistics (Old and new)
Grammar-based representation and parsing– Highly expressive for activity description – Easy to build higher level activity from reused low level
vocabulary.
P-Net (Propagation nets) – really stochastic Petri nets– Specify the structure – with some annotation can learn
detectors and triggering probabilities
Statistics of events – Low level events are statistically sequenced – too hard to
learn full model.– N-grams or suffix trees
"Higher-level" Activities: Known structure, uncertain elements
Many activities are comprised of a priori defined sequences of primitive elements.– Dancing, conducting, pitching, stealing a car from a
parking lot.– The states are not hidden.
The activities can be described by a set of grammar-like rules; often ad hoc approaches taken.
But, the sequences are uncertain:– Uncertain performance of elements– Uncertain observation of elements
The basic idea and approach
Low-level primitives with uncertain feature detection (individual elements might be HMMs)
High-level description found by parsing input stream of uncertain primitives.
Extend Stochastic Context Free Grammars to handle perceptually relevant uncertainty.
Idea: split the problem into:
Approach:
Stochastic CFGs
Traditional SCFGs have probabilities associated with the production rules. Traditional parsing yields most likely parse given a known set of input symbols.PIECE -> BAR PIECE | [0.5]
BAR [0.5]
BAR -> TWO | [0.5]
THREE [0.5]
THREE -> down3 right3 up3 [1.0]
TWO -> down2 up2 [1.0]
Thanks to Andreas Stolcke’spriori work on parsing SCFGsusing efficient Earley parser.
Extending SCFGs (Ivanov and Bobick, PAMI)
Within the parser we handle:– Uncertainty about input symbols
Input is multi-valued string (vector of likelihoods)– Deletion, substitution, and insertion errors
Introduce error rules – Individually recognized primitives typically temporally
inconsistentIntroduce penalty for overlap.Spatial and temporal consistency enforced.
Need to define when a symbol has been generated. How do we learn production probabilities? (Not many
examples.) Make sure not too sensitive to them.
Enforcing temporal consistency
Output of one HMM parsing backwards
Time
P(p
rimiti
ve) - Output event
Video Sample
Event Grammar and Parsing
Tracker generates events: ENTER, LOST, FOUND, EXIT, STOP. Tracks have properties (e.g. size) and trajectories.
Tracker assigns class to each event, though only probabilistically.
Parser parses single stream that contains interleaved events: (CAR-ENTER, CAR-STOP, PERSON-FOUND, CAR-EXIT, PERSON-EXIT)
Parser enforces spatial and temporal consistency for each object class and interactions (e.g. to be a PICK-UP, the PERSON-FOUND event must be close to CAR-STOP)
Spatial and temporal consistency eliminates symbolic ambiguity.
Advantages of SCFGs
What grammar can do (simplified):CAR_PASS -> CAR_ENTER CAR_EXIT |
CAR_ENTER CAR_HIDDEN CAR_EXIT
CAR_HIDDEN -> CAR_LOST CAR_FOUND | CAR_LOST CAR_FOUND CAR_HIDDEN
Skip allows concurrency (and junk):PERSON_LOST -> person_lost | SKIP person_lost
Concurrent parse:Events: ce pe cl cf cs px pl cxPICKUP -> ce pe cl cf cs px pl cxP_PASS -> ce pe cl cf cs px pl cx
Parsing System
Parse 1: Person-pass- through
Parse 2: Drive-in
Parse 3: Car-pass-through
Parse 4: Drop-off
Advantages of STCFG approach
Structure and components of activities defined a priori and are the right levels of annotation to recover (compare to HMMs).
FSM vs CFG is not the point. Rather explicit representation of structural elements and uncertainties.
Often many (enough) examples of each primitive to support training, but not of higher level activity.
Allows for integration of heterogeneous primitive detectors; only assumes likelihood generation.
More robust than ad-hoc rule based techniques: handles errors through probability.
No notion of causality, or anything other than (multi-stream) sequencing.
Advantages of STCFG approach
Structure and components of activities defined a priori and are the right levels of annotation to recover (compare to HMMs).
FSM vs CFG is not the point. Rather explicit representation of structural elements and uncertainties.
Often many (enough) examples of each primitive to support training, but not of higher level activity.
Allows for integration of heterogeneous primitive detectors; only assumes likelihood generation.
More robust than ad-hoc rule based techniques: handles errors through probability.
No notion of causality, or anything other than (multi-stream) sequencing.
Some Q's about Representations…
Scope and Range:– thoughts???
"Language" of the representation– Grammar of explicit symbols
Computability of an instance:– Quite easy. Given the input string the parsing is both the
computation and the matching Learnability of the "class":
– Inside-outside algorithm for learning CFGs but lets be serious…
Stability in face of perceptual uncertainty– Explicitly designed to handle this uncertainty
Inference-support– Depends on what you mean by inference. No notion of
real semantics or explicit time.
P-Nets (Propagation Networks) (Shi and Bobick, ’04 and ’06)
Nodes represent activation intervals
– Active vs. inactive: Token propagation
More than one node can be active at a time!Links represent partial order as well logical constraintDuration model on each link and node:
–Explicit model on length of activation –Explicit model on length between successive intervals
Observation model on each node
Conceptual Schema
Logical relation– Autonomous assumption: logic constraint only exists at
start/end points of any intervals– Condition probability function can represent any logical function
Examples of logic constraint
Propagation Net – Computing
Computational SchemaA DBN style rollout to compute corresponding
conceptual schema
Experiment: Glucose Project
Task: monitor an user to calibrate a glucose meter and point out operating error as feedback.
Constructed 16 node P-Net as representation 3 subjects with total of 21 perfect sequences,
10 missing_1_step sequences and 10 missing_6_steps sequences
D-Condensation
Initiate 1 particle at dummy starting nodeRepeat
For each particlegenerate all possible consequent statescalculate the probability for each states
EndSelect n particles to survive
Until the final time steps is reachedOutput the path represented by the particle with
highest probability
Experiment: Glucose Meter Calibration
Experiment: Classification Performance
Experiment: Label individual frames
Labeling individual nodes Labels on Node J: Insert
And now some statistics…
Problem: the higher level world of activity is not usually a P-Net or an FSM or an HMM or …
Two possible solutions:1. Understand what's really going on…
…another time.
2. Lose the structure
Stochastic sequencing (Hamid and Bobick)
A priori define some low-level "actions"/events that can be stochastically detected in context – e.g. Door opening
Collect training data (streams of events) of activities – making a delivery, UPS pick-up, trash collection
Collect histograms of N-tuples and do both activity discovery and recognition
–Later can focus on anomalies
Advantages: cheat where easy, learn the hard stuff, exploit the context
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Barnes & Nobles Loading Dock
Two levels in the representation
Low Level: Events (computer vision problem)– Background subtraction and Foreground
extraction (better “modeling”)– Classifying (per frame) each foreground object
as either• Person• Vehicle (what type if possible)• Package• Tool used to move packages• Miscellaneous object
– Tracking people, vehicles, packages, tools, and miscellaneous objects over multiple frames
Two levels in the representation
Higher Level: Statistical characterization of subsequences– Instances of same activity class have certain common
subsequences.– But, partially ordered will typically rearrange subsequences
within the sequence. – Find a “soft” characterization of the statistics of the
subsequences – Deifne similarity measure for such characterization.
• Allows discovery of activity classes • Allows for detection of anomalous examples
Caveats: – We provide the events – whether it’s manually or specifying
the detector doesn’t really matter (except for publication)– Training needs pre-segmentation
Stochastic sequencing: n-grams
Experimental Setup – Loading Dock
Barnes & Noble Loading Dock Area
One month worth of data:–5 days a week–9 a.m. till 5 p.m.
Event Vocabulary – 61 events
–Hand-labeled for testing activity labeling, noise sensitivity.–Training detectors for these events
Bird’s Eye View of Experimental Setup
B&N Processing Video
Activity-Class Discovery
Treating activities as individual instances
Activity-class discovery – finding maximal cliques in edge weighted graphs
Need to come up with:– Activity Similarity Metric– Procedure to group similar activities
Activity Similarity
Two types of differences–structural differences–frequency differences
Sim(A,B) = 1 – normalized difference between the counts of non-zeros event n-grams
Properties–additive identity–is commutative–does not follow triangular in-equality
Activity-Class Discovery
A graphic theoretic problem of finding maximal cliques in edge-weighted graphs [Pavan, Pelillo ‘03]
Sequentially find maximal cliques in edge weighted graph of activities
Activities different enough from all the regular activities are anomalies
Activity-Class Discovery – Dominant Sets
Anomaly Detection
Compute the within-Class similarity of the test activity w.r.t. previous class members
Learn the detection threshold from training data – can be done using an R.O.C.
Anomaly "Explanation"
Explanatory features – their frequency has high mean and low variance
Explanation based on features that were:
– Missing from an anomaly but were frequently and consistently present in regular members
– Extraneous in an anomaly but consistently absent from the regular members
Results
UPS Delivery VehiclesFed Ex Delivery VehiclesDelivery Trucks – multiple packages deliveredCars and vans, only 1 or 2 packages deliveredMotorized cart used to pick and drop packagesVan deliveries – no use of motorized cartDelivery trucks – multiple people
General Characteristics ofDiscovered Activity Classes
Few of the detected Anomalies
(a) Back door of delivery not closed(b) More than usual number of people involved in unloading(c) Very few vocabulary events performed
Results
Are the detected anomalous activities ‘interesting’ from human view-point?
Anecdotal Validation:– Studied 7 users– Showed each user 8 regular activities selected
randomly– Showed each user 10 test activities, 5 regular and 5
detected anomalous activities– 8 out of 10 activity-labels of the users matched the
labels of our system– Probability of this match happening by chance is 4.4%
Some Q's about Representations… (more discussion)
Scope and Range: – A monitored scene with pre-designed detectors
"Language" of the representation– Histograms and other statistics of feature n-gram occurrences
Computability of an instance:– Given detectors, easy to compute
Learnability of the "class":– Full power of statistical learning. Even allowing notion
of outlier detector. Stability in face of perceptual uncertainty
– Fair. Needs to be better. Inference-support
– Distance-in-feature space reasoning only.