Presentation Outline
1. Introduction to Language Engineering
2. Major Technology Advances through 2030
3. What is the Language Engineering technology
4. AI’s Core of Language Engineering technology
(1) NLP (2) SP (3) OCR
5. Challenges facing the advances of Arabic
6. Applied Examples in Arabic Language
7. What do we need from the Datasets
8. Conclusions2
3
▪ 90% of the global population will have a supercomputer in their pocket by 2023.
▪ 1 Trillion sensors will be connected to the internet by 2022.
▪ 10% of reading glasses will be connected to the internet by 2023. (augmented reality) and
(Eye-tracking)
▪ A government will collect taxes for the first time via blockchain 2023.
▪ Driverless cars will account for 10% of all cars in the US by 2026.
▪ Robots will be more intelligent, they listen and talk, and will be in everyplace: homes,
schools,.. etc.
▪ 10% of the world's population will be wearing clothes connected to the internet by 2022.
▪ Up to 5% of products will be printed on 3D printers, and the first 3D-printed car will be in
production by 2022.
▪ Artificial intelligence takes decision making
Major Technology Advances through 2030
https://www.news-innovation.com/research-development/10-technologies-that-will-change-the-world-by-2030
https://www.hiveforhousing.com/article/21-technology-milestones-we-will-achieve-by-2030_c
Major Technology Advances through 20304
Languagedependent
High performance processor
Huge memory capacity
Biometric and environment sensors
Wireless communication
Human language interaction
Very long battery life
What is Language Engineering Technology?
Language Engineering Technologies
involve: Oral or spoken, signed and
written languages.
To Enhance communications among:
Man-Machine or Machine/Man
Man-Man (Different languages)
To ease the access of information5
1) Natural Language Processing, includes:
Machine Translation
Text Understanding, Generation and Summarization
Information Retrieval and Question Answering
2) Speech Processing: that includes
Automatic Speech Recognition
Text to Speech
3) Computer Vision: Document Analysis using OCR ; includes:
Printed or Typewritten, and Handwritten
Off-line, On-line
AI’s Core of Language Engineering Technology6
Bilingual Machine
Translation Ecosystem
(English/Arabic)
نظام ترجمة آلي ثنائي اللغة
(عربي/إنجليزي)
8
9
▪ Develop an English to Arabic translation model with quality for
continuous improvement and flexible to be expanded multi-lingual
other language pairs.
▪ Bilingual corpora/dictionaries will be involved, after cleaning and
removing non-alphanumeric texts using linguistic modification tasks for
the proposed machine translation model.
▪ Dataset creation, collection, annotation , adaption, etc.
▪ Bilingual machine translation model based on neural networks will be
developed.
▪ Encoder and decoder models are involved for such machine
translation.
▪ Evaluate the proposed translation model using standard methods.
Bilingual Machine Translation
Arabic/English Machine Translation
10
Natural
Language
Translator
Natural Language #1
(text)
Natural Language #2
(text)
Arabic English
أأكلانا I am eating
2
Machine translation accuracy
Evaluation
Human Expertise 1
12
Using BERT-Score (English Reference)
Bidirectional Encoder Representations from Transformers BLEU: Bilingual Evaluation Understudy
15 Automatic Speech Recognition
Feature
VectorsDecoder (Modeling/Classification & search)
Dictionary Model
Acoustic
Model
Languag
e Model
Training Data (1200 Hours)
Recognized Text
Demos
Example 1
Example 2
Example 3
ArSL https://youtu.be/iBZqCt13JQs
ASL (Virtual Character) WebSign (SML&XML)
Jalees Reader http://www.jaleesreader.com/DemoArabic/OEBPS/pages/index00.html
16
Special Arabic Characteristics
1. Connectivity properties.
2. Dotting properties.
3. Multi Graphemes (location dependent)
4. Ligatures properties.
Arabic word segments can be represented by single
atomic grapheme.
5. Overlapping properties.
6. Font Size properties.
Arabic graphemes don’t have fixed height or
fixed width.
Fonts Families and variations: ،نسخ، رقعة، كوفي...
18
▪ Historical documents, challenges and
difficulties of recognizing Arabic calligraphy
that are cursive in nature, composed of dots
and diacritics, and has different writing style.
▪ We will propose an implementation approach
for layout document analysis, features
extractions, then object segmentation, and
recognition of Arabic historical documents
using deep learning.
▪ Collect Arabic manuscripts images in a dataset
(Dataset Collections).
▪ Arabic manuscript features extractions.
19
Optical Character Recognition
Arabic OCR Overview
Large volumes of Arabic documents:
Early Printed Documents
Printed Documents
Calligraphy Documents
Handwritten Documents
Historical Documents
1. What are the domains should be covered in the document datasets that are needed to start with?
2. What are the sizes and the volumes of the images to be processed at the training and classification phases?
3. How to measure the accuracy and the system performance?
20
20
Document Understanding System Modules
Firstly, an image is described by an object data of different types:
1. Graphic information where the whole image is represented as a sequence of orthogonal pixel runs [1]
2. Segmented information to describe texture regions
3. Layout data description to represent the arrangement of objects (their geometry)
4. Symbolic data representing that multiple glyph images
21
Modules / Approaches Processes: Extracted attributes & features
Preprocessing Binarization,
Document enhancement and noises removing, and
Skew and slant detection
Layout Analysis Document categorizations
Page orientation, and
Segmentation in Text and non-text regions.
21
Arabic Documents Types Overview
22
Islamic/ Christian observatory A chemical processes in Arabic manuscriptPatterns used to decorate
buildings
Painting in honor of Sultan Murad III (1574-95)
https://www.google.com.sa/search?q=Islamic+manuscripts+%2B+PPT&safe=active&sa=N&biw=1164&bih=595&tbm=isch&tbo=u&source=univ&ved=0ahUKEwib9aPVm7HLAhVDORQKHY-9A4w4KBDsCQgu
22
Page Layout Analysis and Decomposition
23
Document Preprocessing & Layout Analysis
Non-Textual RegionsTextual Regions
Optical Character
Recognition
Unified/ Universal
DescriptionGraphical Processing
Regions and Symbol
Processing
23
Document Image
Analysis
Graphical ProcessingTextual Processing
Optical
Character
Recognition
Page
Layout
Analysis
LineProcessing
RoI
Processing
TextSkew, blocks,
paragraphs
Lines, curves,
corners
Filled
regions
24
Line Separation
Ascenders & descenders interfering with lines
Region-growing approach
In Devanagari, single word is a single connected component
Grow regions using horizontally adjacent components
26
Analysis of Arabic Calligraphy Pages
27
(a) Original document (b) After Binarization (c) After De-noising
(d) After De-Framing (f) After De-skewing (g) After Segmentation
Example of Arabic calligraphy document after the pre-processing processes
27
Example of Arabic Printed document after the pre-processing processes
28
Example of Arabic Printed document after the pre-processing processes
(a) Original document (b) After Binarization (c) After Skewing
(d) After De-noising (f) After De-Framing (g) After Segmentation
28
29
(a) Original document (b) After Binarization (c) After Skewing
(d) After De-noising (f) After De-Framing (g) After Segmentation
Example of Arabic Printed document after the pre-processing processes29
Overall Evaluation
30
Scanned Image #1 OCR Results Accuracy (%)
Version 2.0 = (90.50 %) Version 1.0 (82.35 %)
30
Overall Evaluation
32
Scanned Image #3 OCR Results Accuracy (%)
Version 2.0 = (90.07 %) Version 1.0 (0.74 %)
32
Document Analysis / Machine Translation
Source Document Target Translation (Arabic)
36
OCR(Tesseract)
Translation
(Google)
Vietnamese document Arabic language
OCR(Tesseract)
Translation
(Google)
Germany document English language
I need a beer!
New Technologies
BCI
Brain Computer Interface
الحاسوبيةالدماغواجهة
Brain machine interface (BMI)
EEG
Electro EncephaloGram
كهربينشاطالتفكير،
ECG
Electro CardioGram
كهربينشاطالقلب،تخطيط
Head-mounted Display(HMD)
Device paired to a headset such as a harness or helmet
شاشة عرض مع سماعة: خوذة
Eye Glasses
Eye wear that employs cameras to intercept the real world view and re-display it's augmented view through the eye pieces.
(العالم الحقيقي)توظف الكاميرا وتعزز المشهد مع الواقع : نظارة طبية
Mobile Technology
Augmented, Virtual and Mixed Realities
تعزيز، افتراض ي، مختلط
Sign Language.
37
37
Several testing have been successfully measured with 90% accuracy (for all the examples).
The audio dataset is prepared, labeled and assigned to contain audio signals of Arabic with different accents, sizes, domains, styles, directions, and skewed.
The OCR recognition system utilized basic computer vision and image processing algorithms (edge detection, contours, and contour filtering) to segment characters/words from an input image.
The type of datasets included bilingual lexicons (MT), audio signals (SP), and documented images of manuscripts of calligraphy, early printed and printeddatasets (OCR).
We need to unify between international languages to be used in direct translation.
Using new technologies:
Study any time and any where.
Integrate with VR, AR and Mix reality to help others.
Conclusion
38