Agenda • Vision • Current State • Our Approach - towards three main Areas
o Dynamic Gesture Recognition • Using Machine Learning
o Modeling 3D objects • Building the environment • Modeling styles • Other features
o Improving Accuracy and Interface • IMU/Kinect integration • Mobile interface
• Conclusion • Demo • Future Directions • References
Vision
“Device newer interfaces and techniques, which will provide a total immersive experience for modeling 3D
objects in real time.”
Current State • Wacom & AutoCAD
o It is the industry standard to create 3D objects in AutoCAD using a Wacom pen tablet.
o Problem is that we think and draw in 3D, but the tablets interface is 2D. We can do better by moving towards natural and 3D gestures which takes the imagination of artist to the next level.
o No commercial application exists which solves this problem. It is still in research. Major problems are Accuracy, Precision and Control.
Approach
Recognition Modeling Improving
Recognition Modeling Improving
Sensor Library
Application ML (Training)
Sample
Sample
Data
Sensor
Application
ML (Recognition)
Data Classification
Sample
Fig. 1: Training Fig. 2: Classification
Basic
Recognition Modeling Improving
Fig. 3: Training
Training of a new gesture
Kinect Depth Sensor
Library (last k frames)
Application Artificial Neural
Network (Training)
Sample
Depth Frames
Depth Frames
OpenNI 3D Coordinates
Supervise
Recognition Modeling Improving
Fig. 4: Classification
Classification of gesture
Kinect depth sensor
Application
Artificial Neural Network
(Recognition)
Classification
3D Coordinates OpenNI
Depth frames
Depth frames
Recognition Modeling Improving
ANN
Fig. 5: ANN Net
x y
z
x y
z
x y
z
. . .
3 x k inputs
t
t-‐‑1
t-‐‑k
h1
h2
h3
h4
G1
G2
G3
Gp
. . .
‘p’ outputs
hn
. . .
• k -> discrete time constant, over which gesture is performed.
• p -> number of gestures. • n -> number of neurons in hidden layer. • Complete Graph between input and hidden layer,
and between hidden layer and output layer. • Tangent sigmoidal function f(x) used at neurons
• Learns using Back propagation of errors
Recognition Modeling Improving
ANN-Features
!(!) != !!! − !!!!! + !!! !
Recognition Modeling Improving
ANN-Limitations
• The net rebuilds itself upon addition of a new gesture to the library.
• Needs comparatively large data sets for training. • Detection rate drops if considering both hands.
Recognition Modeling Improving
Results
0 10 20 30 40 50 60 70 80 90
0 5 10 15 20 25
Detection Rate
Number of data sets for a single gesture Fig. 6: Detection rate graph
Approaching 79% with no false positives
Recognition Modeling Improving
Approach
“The vision behind is to device newer interfaces and techniques, which will provide a total immersive experience for modeling 3D objects in real time.”
Recognition Modeling Improving
Create Visualize Share
Model Change color
Change texture
Stop & Save Pause
Load object
Place Object
Create Visualize Share
Recognition Modeling Improving
Visualize
6 DOF
Rotate
Zoom
Fork and Create
Stop
Load object
Create Visualize Share
Recognition Modeling Improving
Share PDF
DXF
Save in Real time
Create Visualize Share
Recognition Modeling Improving
Recognition Modeling Improving
Application
OpenNI NITE OpenGL Voice Recognition
Voce (Sphinx4)
Render scene
Voice control
Set perspective, bgcolor, etc.
Detect hand location
Plot spline points for Sculpture
Draw balls, spheres, etc
OpenGL
Display / Save
Recognition Modeling Improving
Recognition Modeling Improving
Results
Recognition Modeling Improving
Approach
“The vision behind is to device newer interfaces and techniques, which will provide a total immersive experience for modeling 3D objects in real time.”
Recognition Modeling Improving
Limitations of our recognizer
• Trade off between accuracy and latency. • Recognition using Kinect sensor ~ 20 pps • As high amount of visual data is fed for processing
(which consists of pre-processing, feature extraction and classification) in real-time the latency of such a system is high.
Recognition Modeling Improving
Solution
• Analogy with a GPS/IMU system used in Airplane navigation. o GPS is the – Kinect Depth Sensor o Inertial Measuring Unit (IMU) – smart phone sensors. o Dead reckoning
• We apply Data Fusion of data from Kinect and Smart phone sensors. o Both have complementary data streams, which helps us to better
estimate the current state.
• The user is holding a smart phone while hand recognition.
Gyroscope
Accelerometers
Integrate
Rotate accelerometers into local level
navigation frame
Remove effect of gravity
Double integrate
Position Orientation
Recognition Modeling Improving
IMU / Smart phone sensor – Position Estimation
!!
!!
Recognition Modeling Improving
Limitations of a IMU
• Major problem is drifting; more in low-cost sensors (like in a smart phone).
• If one of the accelerometers has a bias error of just 0.001 m/s2, the reported position output would diverge from the true position with an acceleration of 0.0098 m/s2—i.e. after a mere 30 seconds, the estimates would have drifted by 4.5 meters!
Recognition Modeling Improving
Fusion
Errors estimates
Visual Tracker
IMU Fusion Kalman Filter
Corrected Position
Time update
xk-‐=Axk-‐1+Buk-‐1
P-‐k=APk-‐1AT+Q
Measurement update
Kk=Pk-‐HT(HPk-‐HT+R)-‐1
xk=xk-‐+Kk(zk-‐Hxk-‐)
Pk=(I-‐KkH)Pk-‐
Recognition Modeling Improving
Fusion contd.
Android phone
Kinect
Sensors
Fusion
Display
Application
Interaction
Recognition Modeling Improving
Results
Fig. : A quick loop simulation. Using only Kinect (left). Using Kinect and IMU (right)
Recognition Modeling Improving
Results contd. • Kinect’s position was fed at the rate of 10Hz and
check against the original 20Hz. • As compared to a usual Linear Interpolation, our
IMU assisted system improved the location estimates by 1.37 times.
• Further tuning of initial parameters according to application can decrease the errors.
Conclusion
• An architecture was presented to recognize
dynamic gestures from a depth camera using neural networks, which expanded the ways of interaction with 3D objects. ML
• We also developed new set of gestures (e.g. pottery style) for 3D modeling. It resulted in structures of actual significance, which could be imported and used in other applications. HCI
• Finally to improve location estimates, we showed how to integrate data from inertial sensors and Kinect to obtain high quality results. Data Fusion
Short Video Demo
Tools used
• Hardware o PC (4GB RAM, min 1 GB free space for environment, 2.3 GHz dual core) o Microsoft Kinect Sensor o Android phone (With accelerometers, Gyroscope, OS ver >= 2.3)
• Software o OpenNI/NITE o Point cloud Library (PCL) o OpenCV, OpenGL o Processing, Eclipse IDE o Voce Voice Recognition o Neuroph o Apache Common Math Library (for Kalman filter)
Future Directions
Our system is still far from its original vision, but in current state can be used for initial abstract designs.
• More 3D Interaction Techniques o Survey done on interaction by Chris Hand[2] can be used as a start point
to develop more interaction techniques for non-ambiguous set of gestures which helps users create sculptures in 3D.
• Deep Learning o Deep learning has proved to have more classification accuracy than
tradition ANN techniques, but it requires large data sets and hence computing power. But once trained it performs way better.
• Others o Many other features such as network based multi-user interaction,
enhanced brush and color support, integration of physics engine, increased interactivity with the user, multi-user support is possible. We can also involve multi-users to draw at the same instance, using robust distributed computing.
References
1) Roope Raisamo: “Multimodal Human- Computer Interaction:
a constructive and empirical study”, Academic Dissertation, University of Tampere, 1999
2) Chris Hand, “A Survey of 3D Interaction Techniques”, Volume 016, (1997) number 005 pp. 269–281, Wiley, 1997
3) Christoph Arndt and Otmar Loffeld, “Information gained by data fusion”, SPIE Conference Volume 2784, 1996
4) Dipen Dave, Ashirwad Chowriappa and Thenkurussi Kesavadas, “Gesture Interface for 3D CAD Modeling using Kinect”, Computer-Aided Design & Applications, 9(a), 2012.
5) Gabrielle Odowichuk, Shawn Trail, Peter Driessen, Wendy Nie“, Sensor Fusion: Towards a Fully Expressive 3D Music Control Interface”, University of Victoria, 2011
References
6) Matthew Tang, “Recognizing Hand Gestures with Microsoft’s
Kinect”, Stanford University, 2011 7) Gabrielle Odowichuk, Shawn Trail, Peter Driessen, Wendy Nie,
“Sensor Fusion: Towards a Fully Expressive 3D Music Control Interface”, University of Victoria, 2012
8) Rufeng Meng, Jason Isenhower, Chuan Qin, Srihari Nelakuditi ,“Can Smartphone Sensors Enhance Kinect Experience?”, MobiHoc’12, June 11–14, 2012
Thank you!
Questions?
End