· Web view2020. 8. 16. · Sentence extraction from images. Varsha Vegad1, Abhinay Pandaya2....

Sentence extraction from images

Varsha Vegad1, Abhinay Pandaya2

1P.G. student, 2Assistant professor1,2Department of computer engineering

1,2LDRP-ITR, Gandhinagar, Gujarat, India.

ABSTRACT

We seen that information about the world describes by computer vision researchers and common people both are different. There are more amount of data of world language easily available. In the world, People communicate using different kind of language like spoken, typed, written. We describe picture or video using visually descriptive language which is source of world’s information (visual world) and image description by people natural language’s training data. We implement a system to generate natural language descriptions from images automatically which may using text data and detect and recognize object algorithms from computer vision. The system generates descriptive sentence from images using natural language and conditional random field.

Keywords: Natural language description, Computer vision, Recognition and detection of objects, conditional Random Field

1. IntroductionPeople communicate using different kind of language like spoken, typed, written. Based on people’s communication, we can’t describe whole world. Visually description of world done by visually descriptive language. There may not be sufficient information. We take input as image and get output as sentences from computer vision efforts on object and scene recognition. People not describe scene using only specific information ,but use specific objects, locations and additional information.

Here, Problem is that how we connect particular image content and sentence generation process.

This is completed by detecting objects, modifiers (adjectives), and spatial relationships

(prepositions), in an image, we have already descriptive text and results are used as for

sentence generation which is performed either using a n-gram language model or a simple

template based approach. Results are shown in one example. It describes our image or scene.

Example[1]

Our system automatically generates the following descriptive text for this example image:

“This picture shows , The person is in the chair. The person is near the green grass which is

by the chair, and near the potted plant.”

We take one other example which describe our steps.[1]

System describe as an example image: 1) object(stuff) detectors which find candidate objects,

2) Set of attribute classifiers process each candidate region, 3)Prepositional relationship

functions process each pair of candidate regions, 4) A CRF is constructed higher order text

based potentials computed from large document corpora, 5) Predict A labeling of the graph,

6) Based on the labeling ,Sentences are generated.

4) Constructed CRF[1]

5) Predicted Labeling[1]

<< null, person_b >,against,< brown, sofa_c >>

<< null, dog_a >,beside,< brown, sofa_c >>

<< null, dog_a >,near,< null, person_b >>

6) Generated Sentence

This the picture of one lady person, one sofa and one dog. Person is against brown sofa as we

see. A dog is near person and also beside sofa.

Is it important to emphasize that the dog is walking alongside the sofa? (by the way,

How did we infer that from a still image?)

Here, we tackle this problem of “AUTOMATICALLY” assigning correct textual

labels to an input image which can in addition to being useful as “description” of an

image, also is useful for RETRIEVAL in image search & retrieval systems

For sentence generation, we use natural language processing as well as computer vision.

Natural language generation constitutes one of the fundamental research problems in natural

language processing (NLP) used for interactions between computers and human (natural)

languages.NPL have major tasks(Discourse analysis, Machine Translation, Morphological

segmentation, Named Entity Recognition, Natural Language Generation, Natural Language

understanding, OCR)and is core to a wide range of NLP applications.

Introduction to the problem:

“a picture describe by 1000 words ” WRONG!Why?

There is no explicit semantic representation of an image.

We understand image in our brains with processes that we do not know enough about.

For example, look at the following image

What is the correct description of the image?

A group of people sitting on land

A group of people selling vegetables

A tribal area where aboriginal people are selling vegetables in a local market seems

the third one is most correct.

Will the answer be same if asked to a 7-year old?

-No.

- Why then we make this inference?

- Knowledge previously acquired? How much?

- How is knowledge stored and retrieved? More importantly, how are we able to make

this inference from “insufficient information”?

- Hence, we claim that in order to “unambiguously” tranfer information through

images.

- We also must attach textual cues to support its correct interpretation/view-point.

Why is the problem difficult / Important research questions:

(1) How to best approach the conversion from visual information to linguistic

expressions?

(2) Which part of the visual information is described by humans?

(3) What is a good semantic representation (SR) of visual content and what is the limit of

such a representation given perfect visual recognition?

- Availability of labeled dataset for validation of our models

- Requires knowledge rich in both image processing as well as NLP

- As outlined earlier, we need to find the correct “level” of description.

- In a hierarchy of descriptions from fine to coarce, we need to decide at what level we

wish to describe the input image textually.

- Since it is a very novel application on the confluence of image processing and natual

language processing,

- We decide to describe the image as the following tuple:

<

Named entities of the objects in the image,

Named entity representing the scene/background,

Prepositions describing relative locations of objects in the image

>

For example:

< (A dog, a woman, sofa), (a room), (woman on sofa, dog on floor)>

Motivation for the problem:

To transfer image information from source to destination

image search/retrieval systems

To assist visually impaired people

For HCI (human-computer interaction), especially robotics

2.Method overview

A general approach to solve the problem:

Step 1: recognize objects, scene from the images(Detectors are used to detect things as bird,

bus, car, person, etc.and stuff like as grass, trees, water, road, etc.). Set of attribute classifiers

based on Each candidate object (either thing or stuff) region. Prepositional relationship

functions based on Each pair of candidate regions.

Step 2: to create a linguistic expression (such as subject-object-verb or agent-object-theme)

from the objects extracted as above. Also extract prepositions from the scene.

Step 3: generate natural language sentences from the triplets extracted as above(A CRF is

constructed that incorporates the unary image potentials computed by 1-3, with higher order

text based potentials computed from large text corpora, Predict labeling of the graph,

Generate Sentences based on the labeling. )

Step 4: validate these sentences against training dataset.

Object detection and Scene recognitionWe extract object category detections using deformable part models (Felzenszwalb et al. 2011) for 89 common object categories (Li et al. 2010; Ordonez et al. 2011). Of course, running tens or hundreds of object detectors on an image would produce extremely noisy results.[10]

Some object detection methods make use of the temporal information computed from a

sequence of frames to reduce the number of false detections. For object detection, there are

several common object detection methods described.

Point detectors

Point detectors are used to find interesting points in images. In literature, commonly used

interest point detectors include Moravec’s detector, Harris detector, KLT detector, SIFT

detector.

Background Subtraction

Object detection can be achieved by building a representation of the scene called the

background model and then finding deviations from the model for each incoming frame. Any

significant change in an image region from the background model signifies a moving object.

The pixels constituting the regions undergoing change are marked for further processing.

This process is referred to as the background subtraction.

Segmentation

The aim of image segmentation algorithms is to partition the image into similar regions.

Every segmentation algorithm addresses two problems, the criteria for a good partition and

the method for achieving efficient partitioning. In the literature survey it has been discussed

various segmentation techniques that are relevant to object tracking They are, mean shift

clustering, and image segmentation using Graph-Cuts (Normalized cuts) and Active contours.

The sliding window protocol

Assume we are dealing with objects that have a relatively well-behaved appearance, and do

not deform much. Then we can detect them with a very simple recipe. We build a dataset of

labeled image windows of fixed size (say, n × m).

Detecting deformable objects

High-level object models, under the name deformable-template models, were introducedin

the statistics community in Grenander (1970, 1978). A statistical model is constructed that

describes the variability in object instantiation in terms of a prior distribution on deformations

of a template. The template is defined in terms of generators and bonds between subsets of

generators. The generators and the bonds are labeled with variables that define the

deformation of the template. In addition, a statistical model of the image data, given a

particular deformation of the template, is provided. The data model and the prior are

combined to define a posterior distribution on deformations given the image data. The model

proposed by Fischler and Elschlager (1973) is closely related, although not formulated in

statistical terms, and is quite ahead of its time in terms of the proposed computational tools.

The actual applications described in Grenander (1993) assume that the basic pose parameters,

such as location and scale, are roughly known—namely, the detection process is initialized

by the user. The models involve large numbers of generators with “elastic” types of

constraints on their relative locations. Because deformation space the space of bond values—

is high dimensional, there is still much left to be done after location and scale are identified.

The algorithms are primarily based on relaxation techniques for maximizing the posterior

distributions.

Photometric invariance

This can be problematic in achieving photometric invariance, invariance to variations in

lighting, gray-scale maps, and so on. At the single pixel level, the distributions can be rather

complex due to variable lighting conditions. Furthermore, the gray-level values have complex

interactions requiring complex distributions in high-dimensional spaces. The options are then

to use very simple models, which are computationally tractable but lacking photometric

invariance, or to introduce complex models, which entail enormous computational cost. An

alternative is to transform the image data to variables that are photometric invariant perhaps

at the cost of reducing the information content of the data.

Drawback:

Some form of initialization provided by the user is necessary. However, the introduction of

binary features of varying degrees of complexity allows us to formulate simpler and sparser

models with more-transparent constraints on the instantiations. Using these models, the

initialization problem can be solved with no user intervention and in a very efficient way.

Gist-based scene recognition

We encode global information of images using gist. Our features for scenes are the condences

of our Adaboost style classi_er for scenes. First we build node features by setting a

discriminative classifier (a linear SVM) to predict each of the nodes independently on the

image features. Although the classifiers are being learned independently, they are well aware

of other objects and scene information. We call these estimates node features.

This is a number-of-nodes-dimensional vector and each element in this vector provides a

score for a node given the image. This can be a node potential for object, action, and

scene nodes. We expect similar images to have similar meanings, and so we obtain a set

of features by matching our test image to training images. We combine these features into

various other node potentials as below: { by matching image features, we obtain the k-

nearest neighbours in the training set to the test image, then compute the average of the

node features over those neighbors , computed from the image side. By doing so, we have

a representation of what the node features are for similar images. { by matching image

features, we obtain the k-nearest neighbours in the training set to the test image, then

compute the average of the node features over those neighbours, computed from the

sentence side. By doing so, we have a representation of what the sentence representation

does for images that look like our image. { by matching those node features derived from

classi_ers and detectors (above), we obtain the k-nearest neighbours in the training set to

the test image, then compute the average of the node features over those neighbours,

computed from the image side. By doing so, we have a representation of what the node

features are for images that produce similar classi_er and detector outputs. { by matching

those node features derived from classi_ers and detectors (above), we obtain the k-nearest

neighbours in the training set to the test image, then compute the average of the node

features over those neighbours, computed from the sentence side. By doing so, we have a

representation of what the sentence representation does for images that produce similar

classifier and detector outputs

Create a linguistic expression

We decide to describe the image as the following tuple:

<




>

For example:

< (A dog, a woman, sofa), (a room), (woman on sofa, dog on floor)>

Constructed CRFCRF Labeling[1]

The task of assigning label sequences to a set of observation sequences arises in many fields, including bioinformatics, computational linguistics and speech recognition [6, 9, 12]. For example, consider the natural language processing task of labeling the words in a sentence with their corresponding part-of-speech (POS) tags. In this task, each word is labeled with a tag indicating its appropriate part of speech, resulting in annotated text[11]

Conditional Random Field (CRF)[11]

To predict the best labeling for an image

Nodes of the CRF

a) Objects-things or stuff

b) attributes which modify the appearance of an object

c) prepositions (relationships between pairs of objects)

Minimize an energy function over labelings L, imageI ,T=text prior

N = number of objects

so 2/(N − 1) normalizes – for variable number of node graphs – the contribution from

object pair terms so that they contribute equally with the single object terms to the

energy function

Converting to Pairwise potentials[1]

Preposition nodes describe the relationship between a preposition label and two object

labels, they are most naturally modeled through trinary potential functions

CRF inference code

accepts only unary and pairwise potentials

Convert this trinary potential into a set of unary and pairwise potentials through the

introduction of an additional z node

Two object nodes has domain O1×P×O2 where O1=domain of object node1

P =prepositional relations

O2=domain of object node2.

CRF Learning[1]

scoring function that is graph size independent:

measuring the score of a predicted labeling as:

a) number of true obj labels(-)number of false obj labels normalized by the

number of objects (+)

b) number of true mod-obj label pairs (-) number of false mod-obj pairs (+)

c) number of true obj-prep-obj triples(-)number of false obj-prep-obj triples

normalized by the number of nodes and the number of pairs of objects

CRF Inference[1]

To predict the best labeling for an input image graph.

we utilize the sequential tree re-weighted message passing (TRW-S) algorithm

introduced by Kolmogorov [6] which improves upon original TRW algorithm from

Wainwright et al [7]. These algorithms are inspired by the problem of maximizing a

lower bound on the energy.

TRW-S modifies the TRW algorithm so that the value of the bound is guaranteed not

to decrease. For our image graphs, the CRF constructed is relatively small (on the

order of 10s of nodes). Thus inference process is quite fast, taking on average less

than a second to run per image.[8]

Generation[1]

Output of our CRF is a predicted labeling of the image.

Labeling encodes three kinds of information

1.objects present in the image (nouns),

2.visual attributes of those objects (modifiers),

3.spatial relationships between objects (prepositions).

ex.<< white, cloud >, in,< blue, sky >>

Decoding using Language Models.

A N-gram language model is a conditional probability distribution P(xi|xi−N+1, ...,

xi−1) of N-word sequences (xi−N+1, ..., xi), such that the prediction of the next word

depends only on the previous N-1 words. That is, with N-1’th order Markov

assumption, P(xi|x1, ..., xi−1) = P(xi|xi−N+1, ..., xi−1). Language models are shown

to be simple but effective for improving machine translation andautomatic grammar

corrections.

Suppose we want to determine whether to insert a functionword x between a pair of

words and in the meaning representation. Then, we need to compare the

lengthnormalized probability ˆp(x) with ˆp(), where ˆp takes the n’th root of the

probability p for n-word sequences, and p(x) = p()p(x|)p(|x) using bigram (2-gram)

language models. If considering more than two function words between and ,

dynamic programming can be used to find the optimal sequence of function words

efficiently.

Templates with Linguistic Constraints[1]

Decoding based on language models is a statistically principled approach, however,

two main limitations are:

(1) it is difficult to enforce grammatically correct sentences using

language models alone

(2) it is ignorant of discourse structure (coherency among sentences), as

each sentence is generated independently.

3.Conclusion

Human evaluated generated sentences. Our system was automatically mining and parsing large text collections to obtain statistical models for visually descriptive language. The other is taking advantage of state of the art vision systems and combining all of these in a CRF to produce input for language generation methods

As outlined earlier, we need to find the correct “level” of description.

In a hierarchy of descriptions from fine to coarce, we need to decide at what level

we wish to describe the input image textually.

Since it is a very novel application on the confluence of image processing and

natual language processing,

we decide to describe the image as the following tuple:

<




>

For example:

< (A dog, a man, sofa), (a room), (man on sofa, dog on floor)>

References

[1] Girish Kulkarni Visruth Premraj. “Understanding and Generating Simple

Image Descriptions. ” NY 11794, USA

[2] Y Yang, CL Teo, H Daumé Y Aloimonos “Corpus-Guided Sentence Generation of Natural Images” [2011 ]ACM

[3] Jie, L., Caputo, B., and Ferrari, V. “Who’s doing what: Joint modeling of names and verbs for simultan-eous face and pose annotation.” (2009)

[4] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.

[5] Farhadi, A., Hejrati, S. M. M., Sadeghi, M. A., Young, P.,Rashtchian, C., Hockenmaier, J., and Forsyth, D. A. “Every picture tells a story: Generating sentences from images.” (2010). Springer.

[6] V. Kolmogorov. Convergent tree-reweighted message passing for energy minimization. TPAMI, 28, Oct. 2006. 1604

[7] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Map estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches. IEEE Tr Information Theory, 51:3697–3717, 2005. 1604

[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. 1604, 1605

[9] M.-C. de Marnee and C. D. Manning. Stanford typed dependencies manual. 1605

[10] Large Scale Retrieval and Generation of Image Descriptions Vicente Ordonez1 · Xufeng Han1 · Polina Kuznetsova2 · Girish Kulkarni2 · Margaret Mitchell3 · Kota Yamaguchi4 · Karl Stratos5 · Amit Goyal6 · Jesse Dodge7 · Alyssa Mensch8 · Hal Daumé III9 · Alexander C. Berg1 ·Yejin Choi10 · Tamara L. Berg1

[11] Hanna M. Wallach February 24, 2004Conditional Random Fields: An Introduction.

[12] Image Processing – S . Sridhar

Date post:	22-Jun-2021
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

· Web view2020. 8. 16. · Sentence extraction from images. Varsha Vegad1, Abhinay Pandaya2....

Documents