Visual recognition in the real world SKT services...반짝반짝두뇌게임 44 성능...

Visual recognition

in the real world SKT services

박병관

SK Telecom

AI Center / 영상인식기술Cell

2019.07.02

SKT Services

2

Contents

3

1. T map 도로교통정보인식

a. 서비스개요

b. Core Engine Architecture

c. Core Engine

d. Multi Frame Integration

e. Evaluation

2. NUGU nemo 영상인식

a. Hand Posture 게임

b. OksusuKids 시청가이드

4

T map 도로교통정보인식

5

1. Service Overview

Service Overview

6

● Goal

○ 정보수집카메라영상에서도로안내표지판과과속카메라정보자동인식

○ 도로데이터로변환하여기존데이터검증및신규데이터생성

● 기대효과○ Agile 업데이트

■ VoC 및신규/변경도로정보의빠른반영

○ 커버리지확대

■ 촬영 Coverage ~= 검증 Coverage

Service Overview (Example)

7

8

2. Core Engine Architectures

Core Engine Architectures

9

Origin

al Im

age

mid

-siz

e I

mage

Road Sign

Detector

Cro

p &

Resiz

e

Text

Detector

Cro

p, W

arp

&

Resiz

e

Language

Classifier

Text

Recognition

...

인천국제공항...

청중로봉오대로

...

Practical Issue #1 (다양한야외환경)

10

Lighting Blur Occlusion

Practical Issue #2 (비규격표지판)

11

12

3. Core Engine

Road Sign Detection

● 도로영상에서의미있는표지판을잘검출○ 많은표지판중주요대상선별

■ 제한속도 / 도로교통표지 / 과속카메라등

■ 비슷한표지판까지학습에반영

○ Two-stage R-CNN 기반객체검출

검출대상

검출대상아님

검출대상

검출대상아님

Text Detection (I)

14

● 문자검출기술개요○ 4 DoF → 5 DoF → 8 DoF → 2N DoF 로발전中

○ 표지판내문자는정해진규격존재 (Arbitrary Shape X)

■ but 차량 Motion에의한표지판회전발생

○ 표지판내문자는 5 DoF 검출후 Warping하여문자인식 Engine에전달

TYPE RECT RBOX Polygon

Degrees of Freedom 4 (x, y, w, h) 5 (x, y, w, h, θ) 2N (x1, y1, … ,xN. yN)

Example

Text Recognition (I)

15

● 검출된문자를잘인식○ CNN + RNN 기반 Text Recognition Engine

○ 한글의복잡도를고려한 Customized CNN + Attentive RNN

A B C D E F G... (영문/숫자/특수문자: 80여종, 음소문자)

VS

닮닳쏘쪼개걔흥홍훙흉횽(한글/영문/숫자/특수문자: 2400여종,

음소/음절문자)

Text Recognition (II)

16

● 검출된문자를잘인식○ 고가 + 다량의 Training DB 필요

■ 한글의복잡도로인해다량의 Training DB 필요

■ But… 한글 Labeling은굉장히비싼작업● 5음절의한글 500만단어 Labeling 예상비용

● 500만 * 5 * 10 = 2.5억 (10원/음절typing)

○ Target Customized 합성 DB 활용

■ 생성 == Labeling

■ Augmentation by 3D Effect

■ Text Detection Box의 Jittering 모사가능등

3D Plate Modeling for Text Detection

18

4. Multi-frame Integration

Multi-frame Integration

19

● 한표지판을여러 Frame에서인식하여표지판단위인식성능향상○ 다수 Frame 결과 Integration으로일부 Frame의오인식, 가림등에의한성능저하개선

○ 여러 Frame에서등장하는표지판을하나의결과로 Integration필요

○ Scene Splitting → Tracking → Word Integration → Word Refining

20

5. Quantitative/Qualitative

Evaluation

Quantitative/Qualitative Evaluation

21

● 평가 Set을 Hard set과 Normal set으로분리하여평가

case back light

hard set


22


case back light blur

hard set


23


case back light blur occlusion

hard set


24


case back light blur occlusion exposure

hard set


25

case Hard Normal Total

E2E Acc. 90.32% 95.65% 95.18%


case back light blur occlusion exposure

hard set


26

27

NUGU nemo 영상인식

Smart Display Speaker (with Camera)

28

‘19년 4월 26일출시(국내최초)

with 영상인식

29

Hand Posture 두뇌게임

반짝반짝두뇌게임

30

● Hand - Natural User Interface


31

Input : 2D image

● 2D key points

○ Open Pose (CMU)


32

Input : 2D image

● 2D key points

○ Open Pose (CMU)

● 3D key points

○ Learning to Estimate 3D Hand Pose from

Single RGB Images (ICCV 2017)

○ Generated hands for real-time 3d hand

tracking from monocular rgb (CVPR 2018)


33

Input : 2D image

● 2D key points

○ Open Pose (CMU)

● 3D key points

○ Learning to Estimate 3D Hand Pose from

Single RGB Images (ICCV 2017)

○ Ganerated hands for real-time 3d hand

tracking from monocular rgb (CVPR 2018)

Input : 3D depth image

● 3D key points

○ Augmented Skeleton Space Transfer for

Depth-based Hand Pose Estimation

(CVPR 2018)

○ Occlusion-aware Hand Pose Estimation

Using Hierachical Mixture Density Network

(ECCV 2018)


34

Output : Posture

● Input : Static One Frame

○ How many classes do you need to classify?

■ Hard to label


35

Output : Posture

● Input : Static One Frame

○ How many classes do you need to classify?

■ Hard to label

Output : Gesture

● Input : Dynamic Varying Frames

○ Real Time Processing with Tracking


36

● 학습○ 어떤 class를학습시킬것인가?

■ 확실한손자세, 다른 class와최대한 appearance 상으로 겹치지않는 class

■ 7 class + 1 negative = 8 classes

● negative hand class is important

+


37

● 선택○ 2d (r,g,b) image vs 3d depth image

○ posture vs gesture

○ key point vs detection

■ rock, paper, scissors 3종■ v pose, heart, palm, okay, thumbs up, thumbs down 6종


38

● 문제점○ 경계를정하는일

■ 어느회전각도까지허용할것인가?


39



○ Pose variation


40



○ Pose variation


41



○ Pose variation

■ 어떤 pose까지허용할것인가?


42



○ Pose variation

■ 어떤 pose까지허용할것인가?


43

● 해결방법○ Learning by Failure

■ 완벽한 engine을초기에만들수없다.

■ 쉬운 (평이한) 손자세 DB는학습에도움이되지않는다.

■ 엔진의문제점은실사용자로부터얻는것이확실하다.

■ CBT를통한실패 Case분석및엔진고도화의지속적인 Iteration (8차까지진행된 CBT)


44

● 성능○ 4차까지의 CBT를통해 base-line엔진문제점파악

■ Pose Variation

● 아이들의다양한손동작● roll, pitch, yaw 방향 pose variation db 보강

○ 6차테스트후■ Scale Variation

● 가까운거리(20 cm 이하)에서인식률이상대적으로떨어짐

● 다양한 Scale DB 보강○ 출시된이후에도 CBT 진행하며성능고도화中

45

얼굴검출 OKSUSU Kids 시청습관

OksusuKids 시청가이드서비스

46

● 어린이시청습관을위한영상인식서비스○ 15cm 이내거리에서디스플레이사용시 VoD를멈추고 ‘뒤로가기' 안내

○ VoD 시작 1분후부터동작, 1회 ‘뒤로가기’ 안내후 5분뒤다시서비스동작

OksusuKids 시청가이드서비스

47

● Embedded 얼굴검출기술을활용한디스플레이와얼굴사이거리추정

● Embedded 필요성○ Privacy concerns

○ Server cost

○ Prompt response

Legacy Face Detector

48

● Legacy embedded face detector

○ Shallow learning based (Runs 9fps @ NUGU nemo)

○ We need to go deeper...

Limitation/Performance

49

● Nvidia GTX 1080 Ti vs NUGU nemo gpu

○ 11.34 TFLOPS vs 0.007 TFLOPS

Current Face Detection @ NUGU

50

51

Wrap Up

Infra for Visual Recognition

52

Training GPU

Infra : DGX-1V Inference GPU Infra :

V100

맺음말

53

● 서비스적용을위한길

○ 출시전서비스에맞는Training DB와 Test DB 확보○ 서비스출시후지속적인 Update 가능한구조

○ Beyond Open Source and Paper

■ 공개된 Network 이상의 Adaptation / Modification

○ 풍부한 GPU Infra

○ 서비스에대한애정과열정 (VoC마저사랑할수있는 Mind set)

Date post:	15-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Visual recognition in the real world SKT services...반짝반짝두뇌게임 44 성능...

Documents