Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page1of35
3
1
1
11
5
4
2
1
1
1
6
5
1
4
DOCUMENT SCORE
93ISSUESFOUNDINTHISTEXT
54PLAGIARISM
Checkingdisabled
ContextualSpelling 5
ConfusedWords
MixedDialectsofEnglish
MisspelledWords
Grammar 25
DeterminerUse(a/an/the/this,etc.)
FaultySubject-VerbAgreement
IncorrectVerbForms
IncorrectNounNumber
WrongorMissingPrepositions
PronounUse
IncorrectPhrasing
Punctuation 11
CommaMisusewithinClauses
PunctuationinCompound/ComplexSentences
SentenceStructure 1
IncompleteSentences
Style 12
PassiveVoiceMisuse
of100
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page2of35
4
3
1
WordySentences
ImproperFormatting
PoliticallyIncorrectorOffensiveLanguage
Vocabularyenhancement Noerrors
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page3of35
Abstract
MoreandmoredatasetsmadeforHumanActivity
Recognition(HAR)havebeenmadeavailableforpublics
inrecentyears.AndHumanActivityRecognitionhasgain
attentionduetoitswiderangeofapplicationfrom
surveillance,medicalpersonal assistedtool,roboticto
theinteractionbetweenhumanandmachine.Andwith
deeplearningtechnicsappliedrecentlyespeciallyfor
imageclassificationresearchershaveswitch andfocus
moreandmorefrom traditionalprocessingtodeep
learningtechnics.Although, extractingthecorrect
featuresforfurtherprocessingstillachallenge,traditional
technics stillbeenusedforinHARtoavoid
computationalcomplexitythatcomewithdeeplearning
methodologies.Understandinghumanbehaviorsisa
challengingproblemincomputervision,wehave
witnesses recentlysignificantadvanceswithproposed
novelmethodologies fortracking,poseestimation, and
movementrecognition.Thissurveyisasuccinct
descriptionofdifferentexistenttechnicsandmethods
applyinHAR,followingprevioussurvey andpapers.
Keywords:Humanactionrecognition,Activity
recognition,featureextraction
1.Introduction.
SincetheearlyfourteenhundredwithDaVinciworkand
studieswhichwas interestedinHumanAppearancesto
helphisstudentdrawingperfectly Humanactionsuchas
peopleclimbing,goingupstairsorgoing
downstairs[https://www.slideshare.net/zukun/cvml2011-
human-action-recognition-ivan-laptev-9017571].Withhis
work,oneofwelldocumented researchinearly
HumanActionRecognitionDaVinciinsistthatapainter
1
2
3
4 5
6
7
8 9
10
11
12
13 14
1
Possiblyconfusedword:personal
2
[switch switched]→3
[morefrom moreon]→
4
[Although ],5
Unusualwordpair
6
[technicshas]
7
Possiblyconfusedword:witnesses8
Repetitiveword:methodologies9
[estimation ],
10
Repetitiveword:survey
11
[was were]→
12
Overusedword:perfectly
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page4of35
shouldbefully awareofthebodystructure(nerves
system,musclesandbonesstructures,etc.)tounderstand
variousmotions.
Intelligentenvironment(intelligent home,intelligent
electronicdevices)exploitdatacollectedfromusersand
anticipatetheprobabilityoftheendresult whetherbad
orworstcasescenario.Thesystemisableto getthe
information,interpreteditandthentakeanaction or
suggestanaction.Asweareintheeraofintelligent
automatesystem . Andcommon tasks:walking,
standing,running,sleeping,etc.arebeingstudy and
interpretedbycomputer system.
Identifyhumansfromvideosourceshasattracted
increasingattentioninseveralapplicationdomains,such
asforcontent-basedvideoannotationandretrieval,video
surveillance,andotherapplications[1]–[3],butgiving
semanticmeaningtohumanactionorbehaviorisso
challenging,infactitnotnecessarilyeasytounderstand
whatanaction really mean. Thiscomplexityis
source ofchallengesfromanacademicpointofview.In
fact,thereisnobetterwaytocategorizedresearchdueto
itscomplexity,butmainlyfollowing[4]wecan
categorize inthreetype: Surveillance,Control and
Analysis.
Peoplecountingorcrowdflux,flow,andcongestion
analysisinpublicarea suchastrain,busstationor
mall[5]canbegrouped inSurveillanceapplications ,
HumanComputerInterfaces[6]orvirtualrealitycanbe
grouped inControlapplicationsandDiagnosisofpatient
canbegrouped assuchinAnalysisapplications of
HumanActionRecognitionorComputervisionfield.
Thepotential amountofapplications ,thespeed and
priceofcurrenthardwareespeciallyinpoorcountries
andthefocusonsecurityissueshaveintensifiedthework
withinthecomputervisioncommunitytowardsretrieving,
collectingandanalyzinghumanbehavior.Furthermore,the
15
16
17
18
19
21 20 22
23
24
26 27 25
28
29 30 31
33
34 35
36
38 37 39
32
40 41 42
43
13
[welldocumented well-documented]→14
Unusualwordpair
15
Overusedword:fully
16
Unusualwordpair
17
[endresult result]→18
[isableto can]→19
[takeanaction takeaction]→20
Sentencefragment21
[thesystem]22
Overusedword:common
23
[ ]24
[acomputer or thecomputer]
25
Wordiness
26
Repetitiveword:action27
Overusedword:really28
[asource or thesource]
29
Repetitiveword:categorize30
[type: type:]→31
[Control ],32
Wordiness33
[area areas]→34
Passivevoice35
arebeingstudy arebeingstudied→
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page5of35
riseofterrorismandsecuritiesissueshastremendously
increase theresearchfieldespeciallyinsecurity[7],
meansinsurveillance.
MajorapplicationsofHARarefoundinsecurity,
medical,entertainment,interaction.Thankstoprevious
studiescounterterrorismteamcandetectandpredictfrom
acertainnumberofpatternsandtechnicsasuspicious
behavior.Inmedical,personaldevicescanhelpprovide
liveandaccuratehealthstatusofapatient(inparticular
oldpeople)assuchprovideagooddirectandquick
responsefromthedoctor.Inentertainment,theHAR
methodsappliedcanhelpidentifyandevenpredicta
playernextmoveandinInteractiontheapplicationof
HARmethodsprovidegoodroboticssystemthatcome
closetotheperfectionofexpressing,understandingand
reflectinghumanbehavior.Soaccordingtothecomplexity
ofthefacingsituationcategoriesmaybedeterminelike:
actionbehavior,gesturesbehaviorandinteractions
behavior[8]asinFigure1bellow.
Anactionit’saformofexpressionwithiscomposeof
differentgestures:running,climbingareexamplesof
commonactionsandhasvariabletiming.AGestureit’sa
non-vocalformofcommunicationwheretheactorexpress
andexchangeinformationviaonepartoracombinationof
somepartofthebodymostlyhands,foot,andhead.Often,
thegesturedoesnotexistinalongperiodtime.Andan
Interactionit’sanactionduringwhichactors(humansor
inhuman)exchangeinformationorinteractsuchin
hugging,scanningQRcodeusingonedeviceoveranother
device.
DuetochallengesandissuessurroundingHuman
ActivityRecognition:intra-classvariations,viewpoint
variations,environmentalcomplexities,occlusions,and
more.Currentsystem,stillnotworkingwithaccuracy
result.ThestudiesinHARremotetoearlydecades,
researchersarestilltryingtocomeclosetohumannature
44
[applications Applications]→36
Passivevoice
37
Passivevoice38
Repetitiveword:grouped39
[applications Applications]→
40
Unusualwordpair41
Repetitiveword:applications42
[speed ],43
Possiblepoliticallyincorrectlanguage
44
[increase increased]→
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page6of35
ofgettingfewitemseriesandcategorizeitwhichwillbe
calledfilterortrainingsetlaterandfromthesefilterbeing
abletoclassifyanyotherelementthattheymaybefacing.
So,incomputervisionresearcheraretryingtomatchthat
humanparticularity.But,wemustacknowledgethatgreat
significantadvanceshavebeenmadesofareventhoughit
stillcan’tmatchhumanvisionsystem.
Therearemethodswithmanualdesignfeaturesanddata
drivingbasedapproacheswhicharedistinctivebytheway
classificationisappliedsuchas:HistogramofOriented
Gradients(HOG),LocalBinaryPattern(LBP),Scale-
InvariantFeatureTransform(SIFT),Hessian3D,and
EnhancedSpeeded-UpRobustFeatures(ESURF)applied
inmanualdesignfeaturesanddatadrivingbased
approachesmostlyusingdeeplearningwherethefeature
aredetect,interpretandprocessautomaticallybythe
systemcomparetooldapproacheswherethefeatureare
chosenbythehuman.
Ingeneral,traditionalapproachesapplybottom-up
methodologyin3stepsforeground,featureextractionand
finallyclassificationFigure2.Aspreviouslynoted,
multiplesurveys,reviewshavebeenpublishedwith
differenttaxonomyandapproachtodealwiththeHuman
ActionRecognition.[8]classifyHARintotwocategories
singlelayeredapproachandhierarchicalapproachwhere
singlelayeredfocusongestureandactionorinotherword
lowlevelhumanactivitiesincontrasttohierarchicalthat
focusonmorecomplexactivitiesorhighlevelhuman
activitiessometimescalledsub-events.Withsubcategories
ofspacetimeapproachandsequentialapproachforsingle
layeredmethodandstatistical,syntacticanddescription
basedforhierarchicallayeredapproach.
[9]presentedavailableresources,datasetsandlibraries
andchallengesofHARtodealwithproblemsof
backgroundsubtraction:changedetectionandsalient
motiondetection.Otherresearchersstudyvideobase
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page7of35
representationwiththeparticularityof[10]categorizing
globalandlocalfeaturesextractionwherebackground
construction-basedmethodsandforegroundextraction-
basedmethodswasusedintheresearch.[11]and[12]
respectivelyprovidingareviewcoveringstagesprocessof
HARfromlow-levelprocessingstagestohigh-level
featureprocessingapplicationswithafocusonhealthcare
andlastprovidingvariousobjectsegmentation,image
processingandactivityrecognitionbybriefingonsensor-
basedvision-based,HiddenMarkovModel(HMM)also
PrincipalComponentAnalysis(PCA).
Occlusion,variationinexecutionrate,anthropometry,
cameramotion,andbackgroundclutteraresomeof
challengesasmentionedearly,facedinHARasnotedin
[13].Mid-Levelfeaturerepresentationbyapplyingsparse
classifierfordiscriminativepartsselectionwasstudyin
[14]similarly[15]studyconfidentbasedinHARby
proposingamethodofmakingchoicebetweentheDense
Trajectories(DT)featurelevelandthehigh-levelpose
features.AliteraturereviewonsemanticbasedHAR
systemusingsemanticfeaturesispresentedin[16].
Acquiringdataisoneofthemostrequirestepin
computervisionandcanbeobtainfrommultiplesource.
Assuch,theoverallfunctionalityofthesystemis
impactedbytheuseofappropriatetool.Andconvincing
improvementhavebeenmadetowardtheseend[17][18].
Dependingonthedimensionalityandthedepththedata
obtainfromthesedevicesareclassifyinto2Dand3Dtool.
Whenacquiringdatainto2Dform,thereisalossof
informationfromonedimensionbecauseinrealitydataare
in3Ddimensionfrequently.Whichimplytoothatsystem
applying3Dapproacharemoreaccuratethan2Dsystem.
ExistentreviewsandsurveysexistonHARbutduetothe
popularitythatthefieldisgainingthosedocumentsare
gettingoutdated,intrinsicallywritingareviewinafield
whichimprovementaretooubiquitousischallenging.In
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page8of35
thispaperwecontributewithdiscussionandcomparison
ofmethodsapplyinginHARtherestofthesurveyis
organizedasfollowsfollowingtheintroduction:Section2
discussmanualdesignfeaturesapproach,Section3
discussDataDrivingBasedapproach(deepandnon-deep
learning),Section4somediscussionSection5introduce
someexistentdatasetendinginSection6withthe
conclusion.
2.Manualdesignfeatures
ManualdesignfeaturesapproachappliedinHARhas
accomplishimpressiveresultovertheyearsofit
application.Theapproachusefeaturedetector(globalor
localfeature)incaseoflow-levelfeatureorhigh-level
featurepassingmiddle-levelfeaturetoextractimportant
features(portionpropertyoftheoverallimageorsequence
ofimages).Then,itclassifiesbytrainingclassifierlikethe
SupportVectorMachine(SVM)[19][20][21][22];the
approachincludesspace-timebased,spacetimevolumes,
spacetimetrajectories,spacetimefeatures,appearance-
based,shapebased,motionbased,hybrid,localbinary
patterns,andfuzzylogic-basedtechniquesasshownin
Figure2withaccentuationonlow-levelfeatures,mid-level
features,andhigh-levelfeatures[23]spatio-temporal
featuresasinspiredindatamodelof[24][25]andmany
morewhichhaveattainedgoodresultforaction
recognition.
Thereputationofhumanactionrecognitionorhuman
behaviorrecognitionhasledtonumerouspublished
articlesandpapers[6],[26]–[31].Thesearticlesfocuson
differentfeaturesandclassifiersusedinhumanbehavior
recognition.Inpracticeconsiderablehardwareresources
andvisionalgorithmsarerequiredtocomputethedata
(acquiring,saving,processing2D,3Dfixandmovingdata
inputs).And3Ddatacanbeobtainedthroughmostlytree
componentscategories:marker-basedmotioncapture
systemsMoCap[http://mocap.cs.cmu.edu/]it’sthe
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page9of35
perfectillustration,thenwehavestereocamerasand
finallyrangeordepthsensorssuchasMicrosoftKinect.
Despitethefactthatvision-basedactionrecognition
continuestogrowth,variouschallengesstillnotresolve
completely:variousactions,moodoftheactor,occlusion,
cameraposition,backgroundetc.Wher eassome
researchershaveutilizedwearableinertialsensors
includingaccelerometersandgyroscopes(mostly
smartphone)[32]–[37]tosolvetheseissues.Evenif,there
aremanypapersrelatedtoHumanBehaviorRecognition
usingwhetherdepthsensororinertialsensors,thepurpose
ofthissurveyit’stoinformonthecurrentstateof
applicationincomputervisionfield.
Acquiring3Ddatarequiretools,thebasiconwhichis
almostaffordabletoallistheKinect(MicrosoftorxBox
)butthecheapandeasytoolisthesmartphonewiththe
latests tatisticreporting2.32billionuser’sFigure3
worldwide[https://www.statista.com/statistics/330695/num
ber-of-smartphone-users-worldwide/(accessApril2017)].
TheKinectsensorinclude:acamera,anInfrareddepth
sensor,amicrophoneandanLEDlightasshowninFigure
4andFigure5.Itcancapture8and16-bitswitha
resolutionof320×240and640x480pixelsproperties
resolutionperchannel.Heterogeneousmethodhasbeen
appliedtocomputetheobtaineddatafromthesetools[2],
[3],[38]–[41].
Andforwearableinertialsensorswhichisoftendirectly
connectedorplacedonhuman(smartphoneandother
sensorsequipment)andinothercase(rarely)indirectly
connectedorplacedonthehuman;theygenerate
accelerometerandrotationsignalscorrespondingtoan
actionperformedbytheactor(humanmostly),Figure4
showsacaptureimageofa3Dskeletonsourceofdata.
Andacquiring2Dinformation,requireaneasyan
accessibletooltoallsuchasmobilephoneincorporatinga
camera.Thisshowhowaccessingto2Ddataismore
45
46
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page10of35
simplecomparetoaccessing3Ddata.
2.1.Appearancebasedapproach
Shape,motionandhybridbasedapproachare
discussinginthispart,wheremethodologiesandtechnics
areappliedon2Daswellason3Ddata.Shapebasedare
pursuitobjectivein[42]withauthorsproposingtheuseof
bagofwords(BOW)frameworktoclassifyeachframesof
avideoandin[43],tensorshapedescriptorandtensor
dynamictimewarpingwasuseby[44].Morearticlesalso
appliedappearancebasedapproachintheirfounding:
gesturerecognition[45],blobanalysis[46].
2.1.1.Shapebased
Inthisapproachfeaturesareobtainedfromshape
featuresilhouette.[47]obtained3Ddatawhichisconvert
to2Ddatausingspatialdistributionofgradientsthedatais
thencomputewithR-transformthetechnicisappliedon
Weizmann,KTH,andBalletdataset.In[17]theauthors
analyzedmapsfeaturetoseparatesilhouettefromnoisy
backgroundlatertheframeworkperformatrackingto
checkthesilhouettemovementinthescene.Themethod
createssequenceofscenefromthehumansilhouettemaps
representationandusedahybridclassifier.Inpractice
HARmethodshouldbecomputationallylean.Similarly
methodwasproposedusingK-neighborin[48].
In[49],proposedapose-basedviewinvariantHAR
methodbasedonthecontourpointswithsequenceofthe
multi-viewkeyposes.In[50].theauthorsemploythe
contourpointsofthehumansilhouetteandradialscheme
withtheSVMasclassifier.[51],[52]buildaregion-based
descriptorfromextractingfeaturesfromsurrounding
regionsofthesilhouetteintheimage.[52]usedpose
informationbyfirstly,extractingthescaleinvariant
features,andthenclusteredittobuildthekeyposes,
finishingbyclassifyingusingaweightedvotingscheme.
2.1.2.Motionbased
Fortheapproachfeaturesareobtainedfrommotion
45
[themood or amood]
46
[) ],
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page11of35
featuresappliedwithgenericclassifier.Amotion
descriptorwasproposedin[53]forunconstrainedvideos
representation.Themotiondescriptorisbasedonmotion
explicitmotionmodelingoperatingoncodewords
generatedbydenselocalpatchtrajectories,and,sodoesn’t
needforeground-backgroundseparation.Anothermotion-
basedmethodwasintroduceby[54]usinghistogramof
orientedgradients.In[55],actionrecognitionmethodwas
proposedbasedonHumanObjectInteractiondescriptor
andposeestimation.Otherauthorsappliedkinematic
splinecurves[56],multiplekeymotionhistoryimages
[57],motiontrajectories[58]andjointmotionsimilarity
[59].
2.1.3.Hybrid
Approachescombiningshape-basedapproachandmotion-
basedapproachfeatures.Anmaplevelandsilhouette-
basedshapefeatureswereusedforseparatingthenoise
fromtheactualsilhouettein[54]followedbyan
histogramsoforientedgradientstobetterclassify.Other
methodsbasedonhybridapproachwereproposedin[60]
[61].TheBOWandablock-wiseweightedkernelfunction
matrixwereusedformulti-viewin[62].While,[63]
appliedshape-motionprototypetrees.Representingaction
asasequenceofprototypesanddistancemeasurewas
usedforsequencematching.Methodtestedon5datasets.
[64]proposedkeyposesmethodasvariantofmotion
energyandmotionhistoryImageswithsimplenearest-
neighborclassifier.
2.2.Spacetimebasedapproach
Approachesthatfocusonrecognizingactivitiesbasedon
space-timefeaturesoralsoontrajectorymatching.Andan
activityisrepresentedbyasetofspace-timefeatures.It
hasfourmajorcomponents:thespacetimeinterestpoint
withtwosub-categoriesdensedetectorsandsparse
detectors;featuredescriptorwithlocalandglobalfeatures
astype;vocabularycompriseofBOWandmodelbased
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page12of35
andfinallytheclassifierwithsupervisedandunsupervised
categories.Figure6showanexampleofahumanactions
withdensetrajectoriesappliedin[65]
AndFigure7showthedifferentmajorcomponent
availableandappliedinspacetimeapproach.Moreover,
[66]employedmotionfeaturesasinputtohidden
conditionalrandomfields(HCRF)totacklemuchbroader
rangeofcomplexhiddenstructureswhereas[67]proposed
aRealtimeclassificationandpredictionofactions.
AnactiondescriptorofHIP,relyingontheworkof
[68]wasproposeby[69]and[70]proposedtoincorporate
informationfromhuman–objectsinteractionsapplied
overseveraldatasets.
2.2.1.Spacetimevolumes
In[71],anHARsystemwasproposedusingtemporal-
spatialsemantic,insteadofusingSTVtheauthorsused
templatescomposedof2Dobservations.Theapproach
wasthenextendedby[72]wheremotionhistoryimage,
foregroundimageapproachandHOGwerecombined,to
finallyusedSMILE-SVMforclassification.Applying
spacetimebasedapproachondifferentdatasetshave
shownoutstandingaccuracyresultoutputsuchasin[73]
withanaccuracyperformanceof98.2%appliedoverthe
KTHDataset.And[74]withaperformanceof89.4%over
theUCF(UniversityofCentralFlorida)datasetusing
discriminativeclustering,treemining,treeclusteringand
rankingtoselectdiscriminativetreepatterns.
2.2.2.Spacetimetrajectories
Humanactioncanbeseenassetofspatio-temporal
trajectories,trajectoriesinSpacetimetrajectorieshave
differentlevelsofabstractionfromlow-leveltrajectoriesto
high-leveltrajectorieslikehandwrittencharacters.
However,allspacetimetrajectoriesapproachhasa
commonproperty:time-structuredpatterns.Spacetime
trajectoriesisappliedonjointposition(bodyjoint)to
differentiateactions.Fromthesenotionmanypapershave
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page13of35
beenpublishedandapproacheshavebeenproposed[75],
[76].
Inspired byimageclassificationdensesamplingmethod
[65]introducedtheconceptofdensetrajectoriesapplyon
videoactionrecognition.Aftersamplingandtrackedusing
displacementinformation,densepointsfromimageframe
ofdenseopticalflowfield.Theapproachshows
robustnessoftheproposaltoirregularmotionchanges.
[77]Improve[65]workbyusingSURFdescriptorand
denseopticalflowtooptimizetheestimation.However,
whenapplytheapproachwithhighdensitytrajectories
featuresinthevideothecomputationalcostincrease.In
fact,therearehavebeenattemptstoreducethecost,to
tacklethechallengesaliencymapmethodwasusedto
capturesalientregionwithinaframeasin[78],[79],[80].
Assuch,applyingthesaliencymapallowtodropsome
densetrajectoriesfeatureduringtheprocesswithout
compromisingtheframeinput.
In2016twomajorpublicationswasmadeavailable[81],
[82]representingskeletonshapesastrajectorieson
Kendall’sshapemanifold.Themethodusestransported-
squarerootvectorfields(TSRVFs)oftrajectoriesand
standardEuclideannormtoreducethecomputationalcost
andincreaseacomputationalefficiency.And[83]used
HOG,HOF,andMBHmethodfortrajectories,recording
anhighestaccuracy.[53]proposetheuseofexplicit
motionmodellingmethodtoresolvethechallengeofHAR
inunconstrainedvideosinputdata.
2.2.3.Spacetimesfeatures
Ingeneral,spacetimefeaturearelocalpropertiesthat
containdiscriminativeactioncharacteristics.Andcanbe
dividedinto2separatecategories:sparsepropertyand
denseproperty.Featuresdetectorsbasedoninterestpoint
detectorssuchasBOW[84],and3DHOF[85]are
groupedinsparsecategory,whilethosebasedonoptical
flowaregroupedintodensecategory.It(interestpoint
47
48
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page14of35
detectors)providefound ationformostrecentmethods
(algorithms)proposed.
[86]buildafeaturedescriptorframeworkandapplyPCA–
SVMforclassificationand[87]usedacomparisonof
Harris3DandMultimodalDecomposableModelsfor
classification.BOWstillthemostpopularmethodfor
representationwithallthedifferentvariationssuchas
BOVWfollowingfeatureextractionstep,codebook
generationstep,encodingstepandpoolingstep[88],[89],
[90],[91].TheperformanceofBOVWvariantofBOW
approachisduetoeffectivedensetrajectorylowlevel
feature.Tofurtherimprovespacetimefeaturemethodand
providebetterperformancesomeresearchersapplied
Fishervector,spacetimeoccurrence.
Spacetimeapproachwithfeaturedetectorwitha
particularityofglobalfeaturehasadisadvantageofbeing
sensitivetonoiseandtoocclusions.So,detectingthe
presenceofmultiplepersoninascenemakespace-time
approachescanhardtorecognizeactions.But,space-time
featuresfocusmainlyonspatiotemporalinformation.Other
limitationsareSTVsapproacheslackthecapacityof
recognizingmultipleentity(person)inamultipleperson
imageframe.Trajectory-basedapproacheslackthe
precisioninlocalizejointposition.Spacetimeapproach,
eventhoughsuitableforsimpledatasetrequiremultiple
featurecombinationtohandlecomplexdatasetwhichalso
increasingthecomputationalcomplexity.However,to
overcomethelimitationswemayapplythebackground
subtractiontechnic,slidingwindowandmoremethods.
2.3.Otherapproaches
Paradoxicaltopreviousparagraph,thereareother
methods,technics,approachwhichcanbegroupedand
categorizedastraditionalapproach,butcan’tfitin
formerlyappearanceorspacetimeapproach.Forthat,we
havegroupeditinotherssuchasLocalBinaryPatternand
fuzzylogic-basedapproach.
49
47
[published ],
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page15of35
2.3.1.Localbinarypattern
ThisisatypeofvisualdescriptororTextureSpectrum
modelusedforclassificationincomputervision,introduce
inthefieldin1990by[92][
https://en.wikipedia.org/wiki/Local_binary_patterns].Since
itsintroduction,LBPcombinedwithHOGhasshown
considerableimprovementindetectionperformanceanda
fullLBPsurveyofthedifferentversions wasproposed
by[93]in2016.
Severalversionssuchashavebeenproposedfordifferent
classification[94],[95].AHARfacerecognitionwas
proposeby[96]basedonNearestNeighborInterpolation
classifier.ThismethodwasappliedontheOlivetti
ResearchLaboratorydatasetresultinginanaccuracyof
97.5%recognitionrateperformance.Anotherhuman
actionrecognitionapproachusingLBPwithGaussian
mixturewasusedin[97],theauthorsmethodontopof
intensitydifferencepropertyofLBPintroducethe
extractionofmultiplefeaturewitherrorcorrectingoutput
codeapplyoverthesimplevectormachineclassifier.
Thelinearbasepatternapproachwasalsobeenapplied
formulti-viewHAR,likein[98],whereamulti-view
basedoncontour-basedposefeaturesanduniform
rotation-invariantwithsimplevectormachineclassifier.
MotionBinaryPatternwasintroducedformulti-viewHAR
by[99]incombinationofVolumeLocalBinaryPattern
andopticalflow.AndwastestedovertheINRIAXmas
MotionAcquisitionSequencesdatasetwitharecord
performanceaccuracyof80.5%.
2.3.2.FuzzyLogic
Traditionalapproachesemployspatialortemporal
featureswithgenericclassifierforrepresentationand
classif ication.However,itischallengingtohandle
uncertaintyandcomplexityinvolvedinrealworld
applications.And,sotoresolvethisissuetheFuzzylogic
approachwasintroduced,tobenefitfromitparticularityof
50
51
48
[animage or theimage]
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page16of35
consideringastruthonlyintegervariablesofvalueina
rangeof0to1.Butthenotionandtermwasfirstly
introducedinnineteensixty-fiveinafuzzysettheoryby
Lotfizadhe[
https://en.wikipedia.org/wiki/Lotfi_A._Zadeh].
Toresolvetheseuncertainty,fuzzylogicbased
approachhasbeenappliedasin[100]basedonInterval
Type-2FuzzyLogicSystemswithfeatureinformation
optimizewithBigBang-BigCrunchalgorithm,the
experimentswereperformedonWeizmannhumanaction
datasetwhichoutperformedtheequivalentType1Fuzzy
LogicSystemandnon-fuzzymethodsregarding
recognitionaccuracyandanalysisperformance..In[101]
authorsutilizedsilhouetteslicesfeaturesandmovement
speedfeatures,andemployedfuzzyc-meansclustering
techniquetoacquiremembershipfunction.Andin[102]
fuzzylogicbasedclassifiermethodwasusedtorecognize
humanintention,[103]appliedfuzzyviewestimation
frameworktopredictsquatevolutionofscenarios.
MostHARappro achesdependontheviewand
recognizeanactivitythroughfixedviewpoint.However,in
realtimeworldapplicationstherecognitionmustcome
fromanyviewpoint,whichintroducetheuseofmulti
cameratocollectthedata,butthissolutionisdifficultin
practicebecauseofcameracalibration.Followingthispath
[104]proposeamethodforviewinvariantusingsingle
cameraandclusteringalgorithm,themethodwasapplied
overtheIXMASdataset.Inadditionotherapproachfocus
onneuro-fuzzysystemshavealsobeenproposedfor
gesturerecognitioninparticular[105]andotherbehavior
recognition[106]arealsoverysuccessfulinbehavior
recognition.
3.DataDrivingBasedApproach
Wementionitinpreviouslinestheperformanceofthe
HARdependsonthemethodsandtheappropriatechosen
featureaswellasefficientrepresentationofdata.
52
49
[adense or thedense]
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page17of35
Dissimilartotraditionalapproacheswheretheactionis
representedbypicked(chosen)featuredetectorsand
descriptors;learning-basedapproachintheotherhand
havecapabilitytoautomaticallylearnthefeaturefromraw
data,alongthislineintroducingend-to-endlearning
concept,meaningconversionfrompixelleveltoaction
classificationlevel.Theseapproachesaregroupedinnon-
deeplearningapproachanddeeplearningapproachas
showninFigure8bellow.
3.1.Non-DeepLearning-Based
Asoneofthecategorydictionarylearningapproachisa
typeofrepresentationgenerallyfocusingonsparse
representation.Ithasbeenusedinmanyapplicationslike
inimageclassificationorinactionrecognition[107].The
conceptissimilartoBOVWmethodologybecauseit
basedonvectorsrepresentation.Andthesevectorsalso
calledcodewords,alsocalleddictionaryatoms
sometimes.[108],fourdatasetweresubjectofthestudy
withtheauthorsapplyingspatio-temporalmotionfeatures.
Geneticprogrammingisanevolutionarytechnique
inspiredbytheprocessofnaturalevolution.Andmaybe
usedtosolveproblemswithouthavingpriorknowledge
andhelpmaximizingtherecognitiontaskperformance.
Alongthewayfeaturedescriptorevolvedonfilling3D
operatorssuchas3D-Gaborfilterandwavelet.
[109]proposebasedondiscriminativeBayesianonfive
datasettorecognizeactionandface.[110]addressthe
problemofCross-viewactionrecognitionbyusing
transferabledictionarypair.Theauthorsdifferentiate
specificdictionarieswhereeachdictionaryequaltoone
cameraview.Moreover,[111]extended[110]workwith
commondictionarytechnicwhichacquireinformation
fromdifferentviews.Aweaklysuperviseddictionary
learning-basedapproachwithtracelassowasproposedin
[112].Theapproachuseddictionaryandfullyexploiting
visualattributecorrelationsratherthanpriorslabel
50
[performance ],
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page18of35
information.In[111]theauthorsapplieddictionary
leaning-basedmethodsforcross-viewactionrecognition.
Thismethodusedtwodictionarylearningapproachesto
learnthesparserepresentationsofvideosregardlessofthe
views,byenforcingcorrespondencevideosinaset.Itwas
performedovertreedatasetandshowsgreatperformance.
3.2.DeepLearningBased
Thisispartofmachinelearningalgorithmsthatuse
cascadenonlinearprocessingunitlayerstoextractfeature
andtransformtheinputintomultiplesmallfeaturelevel.
AndEachlayerusesoutputfrompreviouslayerasinput.
Andthealgorith msmaybesupervisedforanalysis
patternorunsupervisedfor
classification[https://en.wikipedia.org/wiki/Deep_learning#
Definitions].
Previousstudiesappliedondifferentdatasetshowsthat
traditionalapproachdoesnotfulfilltotallytheprocessof
computervisionandactionrecognition.Assuch,HAR
systemthatcanofferthepossibilityofautomatically
determinefeaturedescriptor,learnandevolvewithoutthe
interventionofhumanwillbecrucialforevolvementof
actionrecognition.Thisiswheredeeplearningcomein
handyanditasshownoverthepaststudieshowimportant
itisinmachinelearningwiththeaimedoflearning
differentmultiplelevelsofrepresentationandabstraction,
tomakeinformationmeaningfulanddeeplearningasalso
shownitaccuracyandperformancehigherthantraditional
approachanditisappliedinspeech,images,videosand
textextraction,representation,andclassification.Asin
Figure8deeplearningcanbegroupedintotwoentities:
unsupervisedapproachsuchasDeepBeliefNetworks,
DeepBoltzmannmachines,RestrictedBoltzmann
Machines,andregularizedauto-encodersandsupervised
approach:DeepNeuralNetworks,RecurrentNeural
Networks,andConvolutionalNeuralNetworks.
Butduetothesuccessofmodelssuchasthesimple
53
51
[ageneric or thegeneric]
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page19of35
vectormachine,non-availabledatatoperformalgorithm
onfortrainingdeeplearningapproachhavereceivedlittle
attentioninthebeginningofcomputervisionfieldand
actionrecognitioninparticular.
3.2.1.Unsuperviseddeeplearningmodel
Duringtrainingprocessinthismodelthereisnoneed
forclass tolabel,meaningthismodelisusedandapply
whenfacingtheunavailabilityoflabelleddata.In2006,
[113]worktriggerthenotionofdeeplearningby
proposingdeepbeliefnetworksmethodwiththeusesof
unsupervisedalgorithmtotrainDNNalayeratthetime.
Thesameyearsaw[114]followingthesamepath
proposingafeaturereductiontechnicfordeeplearning.
Consideringtheintroductionofdeeplearningapproach,
therehavebeenanincreaseconcerntoapplythisapproach
fordivergentapplicationwhetheritisinimage,
classification,humanactionrecognition,speech
recognition,healthcaresystem,intelligenthome,object
recognitionormore.
[115]proposedforvideoactionrecognitionan
unsupervisedlearningapproach,wheretheauthorsuseda
spatialappearancefeatureandincorporatewithCNN
technic.Thesolutionproposedwasappliedonthe
ImageNetDataset.[116]proposedDBNwithRestricted
BoltzmannMachines.Despitethefactthatunsupervised
approachofferperformancehigherthantraditional
approachseenbefore,therestillachallengefacedby
researchers,becauseprocessingfromunlabeledvideodata
stillachallenge.
Tobringsomelighttoit,[117]usedunsupervised
approach,whereasdatawerecollectedfromfourdifferent
datasetappliedwithhybridfeaturemodelsandactive
learning.AnotherstudyusingDeepBeliefNetworkswas
proposedby[118]wheretheauthorsusedskeleton
coordinatesfeatureobtainfromdepthimages.Even,
thoughwehaveseenperformanceinitapplication,
5452
[thesquat]
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page20of35
unsupervisedapproachresearchersarelosingand
abandoningthemethodoverthesupervisedapproach,
especiallywiththeimplicationofConvolutionalNeural
Networks.But[119]studyadvocatethatinthefuture
unsupervisedapproachwillbethemostappliedapproach
ratherthansupervisedapproachbecause,aslikehuman
recognitionandidentificationofobjectcomeby
observationandnotbythenotionofbeingtold,sodoes
futuresystemwillbeabletorecognizeunsupervised
elements.
3.2.2.Superviseddeeplearningmodel
Thereisasignificantincreaseofstudiesrelatedtodeep
learninginrecentyearswhetheritappliedfor
classification,modelingtexture,regression,information
retrieval,robotics,faultdiagnosisandmanymorewith
deepCNNorRNN.Manyreasoncanbelistedforthat
matterbuthereweonlynominatedtheaccesstodata,the
accesstomaterialsandthecomputationalabilities.
Untilnow,CNNisconsideredasoneofthemosteffective
andpowerfulsolutionforactionrecognition,ithasshown
greatperformanceindifferentapplicationsandfor
differenttaskslikeHAR,imageclassificationoreven
handwritingrecognition[120],[121],[122],[123].The
ConvolutionalNeuralNetworkconsistofdividingthe
inputintomultiplelayerssuch:convolutionallayers,
RectifierLinearUnits,poolinglayersandfullyconnected
layer,butintheoryonlythreecategoriesarecited:
convolutionlayers,subsamplinglayers,andfull
connectionlayersasinFigure9.
[124]elongated[125]workonbyapplyingthetechnicon
videousingfixedsceneframeasdatamatrixinput,
unfortunatelytheoutcomeperformancewasnotuseful.
Later[126]usingtwo-streamconvolutionalneuralnetwork
toresolvetheissuesfacedby[124]bycombininglate
fusionandthemethodproducegreatresult.However,due
tocomputationalcomplexitytwostreamtechnicisnot
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page21of35
recommendedorsuitableforrealtimesystemapplication.
Ingeneral,deeplearningdealwithretrieveinformation
expressintwodimension,butsomeapplicationretrieves
three-dimensiondataassuchrequire3Dconvolution
neuralnetwork.[127],[128]worksapplied3DCNN,the
firstperformancereachingahighsensitivityof93.16%
withaverageof2.74falsepositivesfordetectionand
recognitionofmicrobleedsinmagneticresonanceimages
andthesecondoneinspiredbyVoxNetand3DShapeNets
applied3DCNNontheModelNetdatasettoacquireand
recognizealsoclassifythedata.
Therestillexistissuessuchascomputationalcomplexity
ortheamountofrequiredatatocreatethe100%perfect
systemforHAR.Tofollowthepath[129]proposea
variationofCNNcalledFactorizedSpatiotemporal
Convolutionalnetwork.Theapproachfactorizesstandard
three-dimensionconvolutionalneuralnetworkmodelas
twodimensionspatialkernelstoreducetheamountof
learnparametersandthecomplexityofthenetwork,and
anotherstudywasmadeby[129]stillusing3DCNN.And
theapplicationofthisapproachshowsthattheapproachis
betterforspatiotemporalinformationcompareto2Ddata
andapplywithlinearclassifierexceedthestate-of-the-art
methods.Otherresearchershavemixedtraditional
approachandCNN,arguingthatitimprovesperformance
[130].Anothervariationofconvolutionalneuralnetwork
wasintroducebyadjustingpre-trainedconvolutional
neuralnetwork,extractingatframelevelfeature,applying
PCA,SVMin[131].
Anothermethodofsuperviseddeeplearningissemantic
basedfeaturelikeposeisalsouseincomputervisionto
describeanaction[132],[133].Descriptorsofthismethod
arebasedonthemotionandappearanceinformation,from
jointhumanbodyparts.Experimentaloftheapproach
wereevaluatedonBerkeleyMHADdataset,onJHMDB
andMPIICookingdatasets.Theoutcomeresultshown
53
[theprevious]
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page22of35
betterperformancethatsomeotherstechnics.[134]utilize
contextualinformationandadaptedtheregionbasedCNN
forclassification,and[135]addressthetaskofsemantic
imagesegmentation.[136]proposeamethodtodealwith
multiviewdatasourcebylearningfrom2Ddense
trajectoriesandrenderssynthetic3Dmodelandoncemore
itsshownhowdeeplearningapproachisfarbetterin
performancecomparetotraditionalapproach.Also,
learningbasedapproachusetheadvantagesoflearning
featurefromrawdataorunlabeleddata.
However,learningbasedapproachhavesomelimitations
suchastheamountofdataneededfortraining.Tosolve
theproblem[137]proposedin2016adatasetcompose
of200actionor849hoursofvideotohelpapplylearning
baseapproachalgorithm.Infact,recentstatisticshows
increaseinterestincomputervision,inhumanaction
recognitionandinconvolutionalneuralnetworkas
process,whichmeanswewillnotbesurpriseifsome
researchersfoundinnearfuturebreakthroughalgorithms
foractionrecognition.
4.Discussion
Alowlevelfeatureit’saportionofanimage,thatallow
tosimplifythecomplexityofanimagebygetting
propertiesrelatedonlytoacertainpattern.Assuch,the
inputmaybeanwithMvalueontheXaxis,Nvalueon
theYaxisand3thecolorpropertyRGB.Whichleadusto
valueofalowlevelfeatureentity.Extractingsucha
valuableinformationit’soneofthefirsttaskfacedby
systemincomputervisioninparticular.
Ashuman,fromthedaywearebornwedealeasilywith
imagesandthenaturalinstinctalwaystaketheleadin
categorizingtheenvironmentsurroundingus[138],sodoes
onemaywonderifacomputercanalsoadaptand
recognizeentityfromimages.Inpastyearstheanswerto
thequestionwillbenobutwithrecentresearchdiscovery
ithasbeenmadepossibleforcomputertoobtain,readand
55
54
[model ],
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page23of35
understandanimageassuchclassifyit.Toreachthegoal,
computerdoesnotreadtheinputasanallonesingle
element,thattheimportanceofdeviseasithasbeen
proven“Dividetobetterreign”,sothesystemwillreduce
anddividetheinputintomultiplesmallestentitiespossible
andtreateachnewasasingleelement.Commonlythere
aretreepropertyusedduringtheextractionprocess:color,
shapeandtexture[25],[27],[139]–[143].Andthe
performanceofthesystemisarelatedtoagoodchoiceof
featureandextractionmethod.
Inregardtothepreviousnotedpropertiesandbasedon
theirimportanceofextractingfeatureincomputervisionor
predictivemodelingandprobabilisticdatamining;There
stillchallengesthatneedsolutiontobefoundfor,to
completelycaptureandclassifyHumanActivityorHuman
Behavior,giventhecomplexityofhumanactionor
reactiontothereality,environment,etc.Advancehave
beenmadeincomputervisionbyapplyingdifferent
technicsandmethodovermultiplepropertiestoovercame
thechallenge.Someresearcherhaveappliedfacefeature
andspeechfeature[144]–[146],ortextfeatureandspeech
feature[147],[148],orthecombinationofmultiplefeature
(posture,speech,face,etc.)[149]–[151].Wehave
acknowledgedthefactthatusingmultiplefeatureincrease
theperformanceandtheaccuracybutatthesametime
increasethecomplexity.
Wehavetonoticethatallapproachesrepresentactivities
(action,behavior)asaframesequenceintimeandspace
locationswhetherithasbeenextractedfrommoving
entitiesorfiximagesandusedifferentclassification
models.[70]proposedtotransferfromonedatasetto
anotherdatasetafterincorporatinginformationgoingfrom
humaninteractiontoobjectinteraction.Hiddenconditional
randomfieldsfrommotionfeaturesinputwitha
combinationoflarge-scaleglobalfeaturesandlocalpatch
featurestodistinguishvariousactionsin[66].[152]and
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page24of35
[153]usedRandomforestsforactionrepresentation
respectivelyclassifyandlocalizehumanactionsinvideo
usingaHoughtransformvotingframework,and,a
vocabularyoflocalappearance-motionfeaturesandfast
approximatesearchinalargenumberoftrees.Areal-time
algorithmtodescribeinteractionswasproposedtheearly
twothousand,withacapacitytodetectandtrack
movements,creatingafeaturevectorgivenasinputto
HiddenMarkovModelforclassificationthatdescribesthe
motion[154].Complexactivityrecognitionwithtwo
sequentialsub-tasksincreasinggranularitylevels,applying
firstlyhuman-to-objectinteractiontechniques,then
context-basedinformationtotrainaconditionalrandom
fieldmodelwasproposedby[155].
Self-organizingmapstolearnbodyposturewithfuzzy
distances,fortimeinvariantactionrepresentation,the
algorithmisbasedonmultilayerperceptionswasusedby
[156].Localoccupancypatternandactionletensemble
modelwasproposedby[157]inwhichtheauthorsfirst
capturedthehumanbodypartsthencapturedintraclass
variationstoallowerrorhandlingindepthcamera.
Interactionbetweenactivityandscenetorecognizehuman
activitiesusing3Dskeletalrepresentationandgeometric
representationofthescenes.and,appearance-to-pose
mappingforactivityproblem.Gaussianprocessesasan
onlineprobabilisticfeatureusingsparserepresentationto
reducecomplexityincomputationwasappliedby[75]and
[158]usedsparserepresentationofskeletaldatawith
dissimilarityspacetorecognizebehaviororactivities.
Todescribeanevent,anactionwithmultiplefeatures
containingmeaningfulinformationcanbeconsideredto
achievethegoal.Asinpreviousparagraphmoreandmore
papershavebeenpublishedincomputervisionfield.And
forthesearticlestheyaremostlybasedonfeaturefusion
whichcanwhetherbeearlyfusionorlatefusion.Using
onekeyelementisgoodinthefunctionalityofanything
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page25of35
butusingmultiplekeyfeatureincreasedramaticallythe
chanceofbetterperformanceandgreataccuracyinthe
outcome.And,sousingmultiplefeatureincreasesthe
performancerecognition.
[150]proposedanovelmethodbyapplyingKernel
CanonicalCorrelationAnalysisandMulti-viewHidden
ConditionalRandomFieldsforHumanActivity
Recognitiontodetectandinterpretagreementand
disagreementnotionfromnonverbalaudio-visualcues
data.However,theproposedmethodsfrompreviouspaper
facechallengingdifficultywhenclassifying,and,
sometimestheaudiosamplegetlostintheprocedure.In
theotherhand[145]appliedmultiplehierarchical
classificationmodelstakenfromthepropertiesofNN
(Neuralnetwork)forrecognizingaudioemotionalfeature
aswellasvisualemotionalfeatureinsteadoflabels.
[159]usedtheHollywoodHumanActionsdatasetandby
takingadvantagesofvideosequencestoproposeaHAR
system,theresearcherextractfirstlyvisualfeaturebefore
extractingaudiofeatureandfinallyapplysupportvector
machinesclassifiers[160]usedaudioandvisualcuesand
applyseveralclassifierstoseparatetheinformationand
categorizewhetheritanaudioorvisualcontentusing
spatio-temporalfeaturestoallowtheextendspatio-
temporalbagoffeatureswithgeometry,and,applykernel-
basedlearningtechniques.Similarly,[161]withpreviously
usingmultiplekernellearningalgorithmforbetter
estimation,appliedfuzzytechniquesandputtogether
supportvectormachinesclassifiersoutput.
Buthumanactionoractivityarecomplexandinfluenceby
themood,theemotions,theinteractions,etc.Thisexplain
thecomplexityincomputervisionfield,andchoosingthe
exactparametersorpropertiesorproperfeaturesusefulfor
HARbecomeakeycomponenttoadvanceinrecognizing
andpredicthumanbehaviorasin[162].Someresearchers
focusonaudiodata,such[163]wheretheauthorsusing
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page26of35
thecanonicalcorrelationanalysis(CCA)proposeanother
wayofusingandinterprethumanbehaviorapplytoleap
featureandspeechsynchronization.Whereas[164]use
canonicalkernelandSpaceVectorMachine(SVM)in
learningandclassifyingimages.Otherresearcherstook
advantageoffacialexpressionsandfacialactioncoding
system(FACS)[165],todescribesalleventualityof
behaviorwiththecombinationofactionunits(AU),and
audioinformationtoidentifytheiremotionofactors,
followingthepathwitharealtime3Dsystem[166].[167]
appliedConditionalrandomnetworktosolveatacertain
pointthechallengingtaskofrecognizingandclassifying
humanbehaviorbyselectingtreemainclassesfriendly,
aggressiveorneutralemployingconditionalrandomfield
method,theauthorappliedtheremethodoverthedataset
obtainfromspeechintheGreekparliament.
HierarchicalDirichletProcesswasappliedby[168]which
allowthecreationofmultiplehiddenstateandused
Markov-chainMonteCarloforsamplingthedatawhich
gavetheopportunitytoidentifyandclassifybehaviorin
twotypeagreementordisagreementfromnon-verbal
featuresmodelandcues.[169]paperstudythemimicry
duringhumaninteractions,withanoticeonthefactthat
firstandformostthesesignalswherestudyby
psychologistbeforebeingusedandclassifybyresearcher
incomputervision,so,theauthorsasoneofthefirstinthe
firstinthefieldtoappliedcomputationtechniquesonsuch
typeoffeaturetocapturecontinuousdetectionofhuman
behavioralmimicry.And[170]appliedpsychology[171]
notioncouplingwithcomputationalmethodtoclassify
humanactivitybydecomposinganactivityanduseeach
sectionoftheactionasfeatureorinput.Andcomparing
theresulttotheHiddenMarkovModelclassifierthe
authorfoundasignificantincreasingimprovement.
Whetherit’sinahealthcaresystems[172],insecurity
[173],orinautonomousprediction[174]computervision
55
[2016 ],
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page27of35
willkeepattractingresearcherbecause,thecomplexityof
humanbehaviorkeepputtingabigdifferencebetween
humanandmachines,Although,therearecurrently
improvementinmachinelearningwhileapplyingtechnics
tounderstandhumanbehavioritstillachallengetofully
understandaccuratelyhowhumanbehaveorcouldbehave.
Assuch,selectingexactusefulandimportantkeyelement
forinterpretationofhumanactivitydemureanissue.
Eventhough[175],[176]trytocharacterizeorclassify
humanactionitstillnotsufficient,uptodateonly
combinationofmultipledifferentfeaturescanalmosttryto
describehumanbehavior.Nonetheless,complex
computationalclassificationistheconsequentofhighlevel
feature,assuchthereisnotenoughresearchappliedwith
theseproperties.Also,Learning-basedapproacheshave
beencategorizedintodictionarylearningandsupervised
approach,geneticapproachaswellasunsuperviseddeep
learningapproach.However,thecategorizationboundary
mayoverlap,assuchitisnotstrictboundarylimit.
5.Exampleofsomepublicdatasets
Manypublicdomaindatasethavebeenmadeavailableto
all,bellowisanon-exhaustivelistofsomeofthedata
source.
Commonwellknownpublicdataset
5.1.BerkeleyMHADdataset[http://tele-immersion.citris-
uc.org/berkeley_mhad#about]
GeneratedaspartoftheNSFfundedproject(#0941382),
CDI-TypeI:CollaborativeResearch:ABio-Inspired
ApproachtoRecognitionofHumanMovementsand
MovementStyles.TheBerkeleyMultimodalHuman
ActionDatabase(MHAD)contains11actionsperformed
by7malesand5femalesubjectsintherange23-30years
ofageexceptforoneelderlysubjectperformingatotalof
660actionssuchasjumpinginplace,jumpingjacks,
throwing,wavinghands,clappinghands,sitdown,stand
up[177].
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page28of35
With[178]applyingmeta-cognitiveradialbasis
functionnetworkanditsprojectionbasedlearning
algorithmtoachieveover97%recognitionaccuracy.
5.2.URFDdataset
CreatedbyMichalKępskifromInterdisciplinaryCentre
forComputationalModellingattheUniversityofRzeszow
inDecember2014.Thedatasetconsistsof70sequenceof
30falls+40activitiesofdailylivingrecordwith2
MicrosoftKinectcameras.
[179]and[180]bothappliedtheirmethodontheURFD
datasetcorrespondinglywithstatisticalcontrolchartand
neuralnetworkforclassificationandimprovingHAR
systemoutput,and,strategyforfalleventsdetection.
5.3.UTDMHADdataset
CollectedaspartofaresearchonHARusingfusionof
depthandinertialsensordata,thedatasetwascreatedat
theDepartmentofElectricalEngineering,Universityof
TexasatDallas.Consistingof300actions(wave,throw,
catch,draw,etc.)performbysixactors(3malesand3
females)withdepthsequencesizeof424x512xnumberof
frame.
[181]methodappliedSpatio-TemporalInterestPointto
detectchanges.Then,extractappearanceandmotion
featuresinterestpointsusingtheHOGandHistogramof
OpticalFlow(HOF)descriptors.Tofinallymatchthe
SVMbyBOWofthespace-timeinterestpointdescriptor.
[182]encodespatio-temporalinformationofskeleton
sequenceswithconvNets.
5.4.WeizmannHumanActionDataset
DatasetintroducedbytheWeizmanninstituteofScience
in2005.Thisdatasetconsistsof10simpleactionswith
staticbackground:walk,run,skip,jack,jumpforwardor
jump,jumpinplaceorpjump,gallop-sidewaysorside,
bend,wave1,andwave2.Consistingof90videosof
Resolution=180x144ofStaticcamera.Thedatasethas
homogeneousoutdoorbackgrounds.Alsoprovides
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page29of35
irregularversions(withdog,occluded,withbag,etc.)for
robustnessexperiment.Someresearchhasshownan
accuracyofhundredpercentwhenappliedonthis
dataset[52].
6.Conclusion.
Thissurveyreview differentapproachesusedin
HumanActionRecognition(HAR)orHumanBehavior
Recognitionalongwithtechnicsandmethodapplied.
Focusingincategorizingtraditionalrepresentationbased
andlearningbaserepresentation.Despitetheenormous
amountofpublishedpapers,methodologiesemployor
technicsapplied tocollectandprocessthedata,there
stillchallengingproblemwhetherintheinterpretationor
labelingofaction.Human cansometimesmake
action whichdoesnotexactlymeanswhatitlookslike
butinsteadmeaning differentlyaccordingtothemood
(e.g.puttingbothhand behindtheneck)orothers
reasons.Assuch,therestillwindowforimprovementin
computervisionfield.Thatbeingsaid ,theaccuracyand
performancearefactorsofusedfeatures butthatalso
implythatthesystembecome morecomplex ifmore
features, areextractandmoremethodareappliedtoit.
Nextstepofthisdocumentwillbetogivemore
documents andgiveevenmoredetailsonthefounding
sofarincomputervisionandhelpnewresearcherstohave
adocumentthat reflect everythingthatneed tobe
knownbeforejumpingintothefieldandhavetheperfect
knowledgefoundation.Theresearchwillfacilitatebetter
judgement inwheredoesthenotionofHumanActivity
recognitioncomefrom,whatisit currentstateandfinal
howcanfutureresearchersimproveandsolvedifferent
challenges.
56
57
58
60 59
61
62
63
64
65 66
67
68 69
70 71 72
73
74
Grammarly GrammarlyReportgeneratedonSaturday,May6,2017,7:40AM Page34of35
56
[review reviews]→
57
Repetitiveword:applied
58
[ ]59
Repetitiveword:action60
[anaction or theaction]61
Repetitiveword:meaning62
[hand hands]→
63
Passivevoice64
[features ],65
[become becomes]→66
Overusedword:complex67
[features ],
Human Ahuman→