Multimedia databases and database contents retrieval

8/14/2019 Multimedia databases and database contents retrieval

1/21

Multimedia Databases and Content-Based Retrieval

Mais M. Fatayer

Department of Computer Science

Amman Arab University

Amman Jordan

eMail:[email protected]


2/21

Introduction Traditional database management systems cant handle the de-mands of managing multimedia data. With the rapid growth of mul-timedia platforms and the world wide web, database managementsystems must now process, store, index, and retrieve alphanumericdata, bitmapped and vector-based graphics, and video and audioclips both compressed and uncompressed.Before the emergence of content-based retrieval, media was annot-ated with text,allowing the media to be accessed by text-based searching.

Through textual description, media can be managed based on theclassification of subject or semantics. This hierarchical structure allows users to eas-ily navigate andbrowse, and can search using standard Boolean queries. However,with the emergenceof massive multimedia databases, the traditional text-based searchsuffers from thefollowing limitations:- Manual annotations require too much time and are expensive toimplement. Asthe number of media in a database grows, the difficulty in findingdesiredinformation increases. It becomes infeasible to manually annotateall attributes of the media content. Annotating a sixty-minute video, containing

more than100,000 images, consumes a vast amount of time and expense.

- Manual annotations fail to deal with the discrepancy of subjectiveperception.

The phrase, an image says more than a thousand words, impliesthat the textualdescription is sufficient for depicting subjective perception. To cap-ture allconcepts, thoughts, and feelings for the content of any media is al-mostimpossible.

- Some media contents are difficult to concretely describe in words.For example,a piece of melody without lyric or irregular organic shape cannoteasily beexpressed in textual form, but people expect to search media withsimilarcontents based on examples they provided.In an attempt to overcome these difficulties, content-based retrieval

employs


3/21

content information to automatically index data with minimal humanintervention.

APPLICATIONSContent-based retrieval has been proposed by different communit-ies for variousapplications. These include:

Medical diagnosis : The amount of digital medical images used inhospitals hasincreased tremendously. As images with the similar pathology-bear-ing regionscan be found and interpreted, those images can be applied to aiddiagnosis forimage-based reasoning. For example, Wei & Li (2004) proposed ageneralframework for content-based medical image retrieval and construc-ted a retrievalsystem for locating digital mammograms with similar pathologicalparts.

Intellectual property : Trademark image registration has appliedcontent-basedretrieval techniques to compare a new candidate mark with existingmarks to

ensure that there is no repetition. Copyright protection can also be-nefit fromcontent-based retrieval as copyright owners are able to search andidentify unauthorized copies of images on the Internet. For example,Wang & Chen (2002)developed a content-based system using hit statistics to retrievetrademarks. Broadcasting archives : Every day broadcasting companies producea lot of audio-visual data. To deal with these large archives, which can con-tain millionsof hours of video and audio data, content-based retrieval techniquesare used toannotate their contents and summarize the audio-visual data todrastically reducethe volume of raw footage. For example, Yang et al. (2003) de-veloped acontent-based video retrieval system to support personalized newsretrieval.

Information searching on the Internet : A large amount of media

has been made


4/21

available on the Internet for retrieval. Existing search enginesmainly performtext-based retrieval. To access the various media on the Internet,content-basedsearch engines can assist users in searching the information withthe most similarcontents based on queries. For example, Hong & Nah (2004) de-signed aXML-scheme to enable content-based image retrieval on the Inter-net.

TEXT DOCUMENT INDEXING AND RETRIEVALIR (information Retrieval) techniques are important in multimedia information man-agement systems. Where there exist a large number of text documents in many organ-izations such as libraries. Text is very important information source for any organiza-tionText can be used to annotate other media such as audio, images and video.Two major design issues of IR systems are how to represent documents and quiresand how to compare similarities between documents and query representations. A re-trieval model defines these two aspects. The most common technique is the exactmatch technique and the Boolean model will be discussed as example on this retrievalmethod.

Automatic Text documents indexing and Boolean retrieval ModelBasic Boolean Retrieval ModelMost of the commercial IR systems can be classified as Boolean IR systems or text pattern search systems. Text pattern search quires are strings or regular expressions.During retrieval, all documents are searched and these containing the query string areretrieved.Text-pattern systems are more common for searching small documents databases or collections. In Boolean retrieval system, documents are indexed by set of keywords.queries are also represented by set of keywords joined by logical (Boolean)operatorsthat supply relationships between the query terms.Three types of operators are in common use: OR, AND, and NOT. Their retrievalrules are:- The OR operators treats two terms as effectively synonymous. For example,

given the query (term1 OR term2), the presence of either term in a record or document suffices to retrieve that record.

- The AND operator combines terms into term phrases; thus the query (term1AND term2) indicates that both terms must be present in the document in or-der for it to be retrieved.

- The NOT operator is a restriction, or term-narrowing, operator that is normallyused in conjunction with the AND operator to restrict the applicability of par-ticular terms; thus the query(term1 AND NOT term2)leads to the retrieval of records containing term 1 but not term2.


5/21

Term operation and Automatic IndexingA document contains many terms or words. But not every word is useful and import-ant. for example, prepositions and articles such as of ,the ,and a are not usefulto represent the content of the document. These terms are called stop words .During the indexing process, a document is treated as a list of words and stop wordsare removed from the list. The remaining terms or words are further processed to im- prove indexing and retrieval efficiency and effectiveness. Common operations carriedout on these terns are stemming ,thesaurus, and weighting.Stemming is the automated conflation of related words, usually by reducing the wordsto a common root form. for example, suppose that the words retrieval ,retrieved,retrieving and retrieve all appear in a document. Instead of treating these as four different words, for indexing purposes these four words are reduced to a common rootretrieve. The term retrieve is used as index term of the document.Another way of conflating related terms is with athesaurus that lists synonymousterms and sometimes the relationships among them. For example, the words study,learning, schoolwork, and reading have similar meanings. So instead of usingfour index terms, a general term study can be used to represent the four terms.Different indexing terms have different frequencies of occurrence and importance tothe document. Note that the occurring frequency of a term after stemming or thesaur-us operations is the sum of the occurring frequencies of all its variations.The introduction of term-importanceweight for document terms and query term maydistinguish the term that is more important to the document for retrieval purposesfrom less important terms.

IMAGE INDEXING AND RETRIEVAL

Images are stored in a database in raw form as a set of pixels or cell values, or stored in compressed form to save space.Each image is represented in grid of cells. There are many approaches to imageindexing and retrieval.The first approach, attribute based, where the image contents are modeled as a setof attributes extracted manually and managed within the frame work if conven-tional DBMSs. Quires specified using these attributes. Examples of such attributesare image file name, image category, and date of creation, subject, author and im-age source. However, database attributes may not be able to describe the imagecontents completely. Another problem is that types of quires are limited to thoseattributes.The second approach, feature extraction/object recognition depends on subsys-tem to automate the feature extraction and object recognition. Limitations of thisapproach are that its computationally expensive, difficult to implement and tendsto be domain specific.Another method is annotating images high-level features and using IR techniquesto carry out retrieval. Where text can describe the high level feature s contained inthe images, and for retrieval this approach uses the relevance feedback and do-main knowledge, where it can overcome some problems of incompleteness andsubjectieveness.Finally, using low-level feature method is used to index and retrieve images.


6/21

In practice, the second and the fourth approaches have provided a good efficiencyin performance; however, the second approach is not applicable for general applications.

In following sections, a description for low-level feature combined with text basedretrieval techniques are provided in more details. Methods based on color, shapeand texture.In practice, text based and low-level, feature based techniques are combined toachieve high-relative performance

Text Based Image RetrievalIn Text Based Image Retrieval, images are described with free text. Queries arein form of keywords with/without Boolean operators. The retrieval techniques are based on similarities between query and the text descriptions of images.There are two main differences between Text Based Image Retrieval and conven-tional text document retrieval.First, text annotation is manual process where high-level image understanding isnot possible .in image annotation we care for efficiency and how to describe im-age contents completely and consistently. domain knowledge or thesaurus should be used to overcome completeness and consistency problems. Relationships between words or terms will also be considered. for example ,child, man andwoman are issues a query using the key work human being, intending to re-trieve all images contain human beings.Second ,the text description may not be complete and may be subjective. thus theuse of knowledge base and relevance feedback is extremely important for text- base image retrieval.

The advantage of Text Based Image Retrieval, is that it captures high level ab-straction and concepts, such as smile and happy, contained in images.however, it can not retrieve images based on example, and some high-level fea-tures are difficult to describe such as shape and texture.

Color Based Image Indexing and Retrieval TechniqueThis is a commonly used approach in content-based retrieval techniques. The ideaof color-based image retrieval technique is to retrieve a database images that havesimilar colors to users query.

Each image in the database is represented using 3 channels of the color spacechosen. The most common color space used is the RGB (red, green and blue).eachcolor channel is discretized into m intervals. so the total number of discrete color combinations (Called bins) n is equal to m3.for example, if each color channel isdiscretized into 16 intervals, we can have 4,096 bins in total.A color histogram H(M) is a vector (h1 ,h2 ,h3 ,h j ,,hn ),where elementh j representthe number of pixels in imageM falling into bin j. this histogram is the featurevector to be stored as the index of the image.During image retrieval, a histogram is found for the query image or estimatedfrom the users query. the distances between the histograms of the query imageand images in the database are measured. Images with a histogram distance smal-ler than a predefined threshold are retrieved from the database and presented tothe user. Alternatively ,the first k images with the smallest distance are retrieved.


7/21

In the following formula, the L-1 metric is defined as the distance between imagesI and H:

d(I,H)= i l - h l

wherei l and h l is the number of pixels falling in binl in image I and H, respect-ively.For example, suppose we have three images of 88 pixels and each pixel is one of the eight colors C1to C8.Image 1 has 8 pixels in each of the eight colors, image 2 has 7 pixels in each of colors C1 to C4,and 9 pixels in each of colors C5 to C8.Image 3 has 2 pixels ineach of colors C1andC2,and 10pixels in each of colors C3 to C8.then we have thefollowing three histograms:

H1=(8,8,8,8,8,8,8,8)H2=(7,7,7,7,9,9,9,9)H3=(2,2,10,10,10,10,10,10)

The distances between these three images are:d(H1,H2)=1+1+1+1+1+1+1+1=8d(H1,H3)=6+6+2+2+2+2+2+2=24d(H2,H3)=5+5+3+3+3+1+1+1+1=23

Therefore, images 1 and 2 are most similar and images 1 and 3 most different ac-cording to the histogram.

Image Retrieval Based on ShapeShape representation is a fundamental issue in the newly emerging multimedia applications. In the content-based image retrieval (CBIR), shape is an importantlow-level image feature.A good shape representation and similarity measurement for recognition and re-trieval purposes should have the following two important properties:- Each shape should have a unique representation, invariant to translation, rota-

tion and scale- Similar shapes should have similar representation so that retrieval can be

based on distances among shape representations.

There are generally two types of shape representations, i.e., contour-based andregion-based. Contour-based methods need extraction of boundary informationwhich in some cases may not available. Region-based methods, however, do notnecessarily rely on shape boundary information, but they do not reflect localfeatures of a shape. Therefore, for generic purposes, both types of shape represent-ations are necessary. Several shape descriptors, which have been widelyadopted for CBIR, they are: Fourier descriptors (FD), and grid descriptors (GD).

Fourier descriptors method : in Fourier descriptor-based method, a shape is firstrepresented by feature function called a shape signature. a discrete Fourier trans-

form is applied to the signature to obtain (FD) of the shape. These FDs are used toindex the shape and for calculation of shape.

l =1

n


8/21

Grid descriptors :In grid shape representation, a shape is projected onto a grid of fixed size. The grid cells are assigned the value of 1 if they are covered by theshape (or covered beyond a threshold) and 0 if they are outside the shape. A shapenumber consisting of a binary sequence is created by scanning the grid in left-right and top-bottom order, and this binary sequence is used as shape descriptorsto index the shape.

Image Retrieval Based on TextureTexture is an important image feature, but it is difficult to describe and its percep-tion is subjective to a certain extent.One of the best methods proposed is the one by Tamura,H.S. Mori , and T.Yamawaki. To find a texture description, they conducted psychological experi-ments.They aimed to make the description conform to human perception as closely as possible. According to their specification, six features describe texture, as follows:- Coarseness: coarse is opposite to fine. Coarseness is the most fundamental tex-

ture feature and to some people texture means coarseness .the larger the dis-tinctive image elements, the coarser the image. so, an enlarged image is coars-er than the original one.

- Contrast: the contrast is measured using four parameters: dynamic range of gray levels of the image, polarization of the distribution of black and white onthe gray-level histogram or ratio of black and white on areas, sharpness of edges, and period of repeating patterns.

- Directionality: it is a global property over he given region. It measures bothelement shape and placement .the orientation of the texture pattern is not important.

- Line likeness: this parameter is concerned with the shape of a texture element.Two common types of shapes are linelike and bloblike.

- Regularity: this measures variation of an element placement rule. It is con-cerned with whether the texture is regular or irregular. Different element shapereduces regularity. A fine texture tends to be perceived as regular.

- Roughness: this measures whether the texture is rough or smooth. It is relatedto coarseness and contrast.

Not all six features are used in texture-based image retrieval systems. For examplein QBIC system, texture is described by coarseness, contrast and directionality.

Retrieval is based on similarity instead of exact match.

Integrated Image Indexing and Retrieval TechniquesAn individual feature will not be able to describe an image adequately. For ex-ample, its not possible to distinguish a red car from a red apple based on color alone. therefore; a combination of features is required for effective image indexingand retrieval.A practical system, QBIC, was developed by IBM Corporation. It allows a largeimage database to be queried by visual properties such as colors, color percent-ages, texture, shape and sketch, as well as by keywords.


9/21

QBIC capabilities have been incorporated into IBMs DB2 Universal Database product.

VIDEO INDEXING AND RETRIEVAL

Video is information rich. A complete video may consists of text, sound track (bothspeech and nonspeech), and images recorded or played out continuously at fixed rate.

Following methods are used for video indexing and retrieval:- Metadata-based method : video is indexed and retrieved based on structured

author/producer/director, date of production and type of video.- Text-based method : using IR techniques, video can be indexed and retrieved.- Audio-based method : using speech recognition techniques and IR techniques

audio video can be indexed and retrieved based on spoken words associatedwith video frames.

- Content-based method : there are two approaches, in the first approach, videois treated as independent frames or images, and use the image indexing and re-trieval methods. The other approach divides the video into group of similar frames, and indexing is based on the representative frame of these groups, thisapproach is called shot-based video indexing and retrieval .

- Integrated approach : two or more methods of the above methods can becombined to provide more effective video indexing and retrieval.

The following section talks about shot-based video indexing and retrieval technique.

Shot-based video indexing and retrieval techniqueA video sequence consists of a sequence of images taken at a certain rate. A longvideo contains many frames, which are if treated individually the indexing and re-trieval will be very hard. So that, video is made of number of logical units or seg-ments called shots.

A shot can have the following features:- The frame depict the same scene- The frame signify a single camera operation- The frames contain a distinct event or/and action such as the significant pres-

ence of an object- The frames are chosen as a single indexable entity by the user.

We need to identifyThe part of the videothat contains requiredinformation

Frames taken in samescene and featuring samegroup of peoplecorrespond to a shot

Shot

ShotShot


10/21

Shot-based video indexing and retrieval consists of the followingmain steps:1- Segment the video into shots (called video temporal segmentation, partition

or shot detection)

2- Index each shot. (The common approach is to first identify key frames or representative frames(r frames) for each shot.Then, use image indexing method (described before)

3- Apply similarity measurement between query and video shot and retrieveshots with high similarities, (this is achieved by using he image retrievalmethods based on indexes or feature vectors obtained in step2

SEGMENT THE VIDEO INTO SHOTS

Video Shot Detection or SegmentationConsecutive frames on either side of a camera break, generally, display a significantquantitative change.Here, a suitable quantitative measure that capture the difference between a pair of frames is needed. If difference between two consecutive frames exceeds a giventhreshold, then it may be interpreted as indicating segment boundary.From the above, its obvious that camera break is the simplest transaction between

two scenes, where a camera may have other transactions such as dissolve, wipe, fade-in and fade-out. The last operations have a gradual change between two consecutiveframes than does a camera break.

Basic video segment techniquesThe key issue of shot detection is how to measure the frame to- frame differences.The simplest way is to measure the sum of pixel-to-pixel differences between neighboring frames. If the sum is larger than the preset threshold then assign a shot bound-ary between the two frames.However, this method is not effective and much false shot detection will be reported,where two frames within one shot may have a large pixel-to-pixel difference due toobject movement from frame-to-frame


11/21

To overcome this limitation of the last approach, new methods were introduced tomeasure color histogram distance between neighboring frames, the principle behindthese methods is that object motion causes a little histogram differences. If a large dif-ference founded, hence a camera break occurred.Following formula used to measure the difference between theith frame and its suc-cessor: SDi= Hi( j ) -Hi + 1( j )

Where,H i ( j ) denotes the histogram for theith frame, and j is one of the G possiblegray levels. If S D i is larger than the pre detected threshold, a shot boundary is de-clared.Another simple but more effective approach is used to compare histogram based on acolor code derived from the R, G and B components.

SDi= (Hi( j ) -Hi + 1( j ) )2/ ( Hi + 1( j ) )This measurement is called 2 test. Here, j denotes a color code instead of gray. Inthis technique, selection of appropriate threshold values is a key issue in determiningthe segmentation performance.

Detecting Shot Boundaries with Gradual ChangeThe above technique relies on a single frame-to-frame difference threshold for shotdetection. What was found in practice is that the previous techniques cannot detectshot boundaries when the change between frames is gradual as in videos producedwith the techniques of fade-in, fade-out, dissolve and wipe operations. Also, when thecolor histogram between two different frames of two different scenes are similar.

Fade-in is when a scene gradually appears. Fade-out is when a scene gradually disap- pear. Dissolve is when one scene gradually disappears while another gradually appears.Wipe is when one scene gradually enters across the frame while another graduallyleaves.In such operations, the difference values tend to be higher than those within a shot butsignificantly lower than the shot threshold.

Here, a single threshold does not work, because to capture their boundaries, thethreshold must be lowered significantly causing much false detection.To overcome such situation Zhang et al. developed atwin-comparison technique thatcan detect normal camera break and gradual transitions. This technique requires theuse of two difference threshold:

T b: used to detect normal camera breakTs: a lower threshold to detect the potential frames when gradual

transition may occur.During the shot boundary detection process, consecutive frames are compared using

one of the previous described methods.

j

j


12/21

If the difference is larger than T b, a shot boundary is declared. If the difference is lessthan Tb and the difference is larger than Ts, the frame is declared as a potential trans-ition frame. Then, add the frame-to-frame difference of the potential transition framesoccurring consecutively. If the accumulated frame-to-frame differences, of consecut-ive potential frames is larger than T b,a transition is declared and the consecutive po-tential frames are treated as special segment. here, the accumulated difference is onlycomputed when the frame-to-frame difference is larger than Ts consecutively.

VIDEO INDEXING AND RETRIEVAL Now, we need to represent and index each shot so that shots can be located and re-trieved quickly in response to quires.the most common way is to represent each shotwith one or more key frames or representative frames(r frames). Retrieval is then based on similarity between the query and r frames.

Indexing and Retrieval Based on r Frames of Video ShotsUsing a representative frame is the most common way to represent a shot.r frame,capture the main contents of the shot. Features of this frame are extracted and indexed based on color, shape and texture as in image retrieval. During retrieval, queries arecompared with indices or feature vectors of these frames.If this frame is similar to the query, then its presented to the user so he/she can playout the shot it represents.When shot is static any frame is good enough to be representative frame. But whenthere are a lot of object movements in the shot, other methods should be used.

We need to address two issues regarding r frame selection.first,how many r framesshould be used in a shot.second,how to select these r frames within a shot.

To determine how many r frame should be used, a number of methods have been proposed:

1- Using one r frame per shot. However, this method does not consider the lengthand content changes of shots.

2- Assigning the number of r frames to shots according to their length, where for each second or less, one r frame is assigned to represent the shot. if the lengthof a shot is longer than one second, one r frame is assigned to each second of the video. This method can partially overcome the limitation of the first meth-od, but it ignores shot contents.

3- A shot is divided into subshots or scenes and assigns one r frame to each sub-

shot.A subshot is detected based on changes in contents. The contents are determ-ined based on motion vectore, optical flow and frame-to-frame difference.

In second step, we need to determine how these r frames are selected.According to previous methods of determining the number of r frames for each shot.Three possibilities also proposed (here, a general term segment is used to refer to ashot, a second of video or a subshot, depending on the used method):

1- In first method, the first frame of each segment is normally used as the r frame. This choice is based on the observation that cinematographers attemptto characterize a segment with the first few frames, before beginning to


13/21

track or zoom to a closeup.thus the first frame of a segment normally capturesoverall contents of the segment

2- In the second method, an average frame is defined so that each pixel in thisframe is the average of pixel values at the same grid point in all frames of thesegment. Then the frame within the segment that is most similar to this aver-age is selected as the representative frame of the segment.3- In the third method, the histograms of all the frames in the segment are aver-aged. The frame whose histogram is closest to this average histogram is selec-ted as the representative frame.

4- The fourth method is mainly used for segments captured using camera pan-ning. Each image or frame within the segment is divided into background andforeground objects. A large background is then constructed from the back-ground of all frames, and then the main foreground objects of all frames aresuperimposed onto the constructed background.

Between all the above mentioned methods its hard to determined which is the best, where the choice of r frame is application dependent.The next section addresses some additional techniques for video index and retriev-al.

Indexing and Retrieval Based on Motion InformationVideo indexing and retrieval method based on motion information has been proposed to complement the r frame-based approach, where the last treats a video asa collection of still images.In Video indexing and retrieval method based on motion information, motion in-formation is derived from motion vectors and determined for each r frame, thus r frame are indexed-based on both image contents and motion information.

Indexing and Retrieval Based on ObjectsObject based indexing schemes find a way to distinguish individual objectsthroughout a given scene, that is a complex collection of objects, and carry out theindexing process based on information about each object. the indexing strategywould be able to capture the changes in content throughout the sequence.

Indexing and Retrieval Based on MetadataMetadata for video is available in some standard video format. Video indexingand retrieval can be based on this metadata using conventional DBMSs.

Indexing and Retrieval Based on AnnotationUsing Video manual interpretation and annotating or by using transcripts and sub-titles, or by applying speech recognition to sound track to extract spoken words,which can then be used for indexing and retrieval.

AUDIO INDEXING AND RETRIEVALDigital audio is represented as a sequence of samples and normally stored in compressed form.


14/21

For human being, its easy to recognize different types of audio; we all can tellwhether the audio is music, noise or human voice, also the mood whether itshappy, sad, relaxing, etc.For a computer, audio is just a sequence of sample values. So, it needs a retrievaltechnique to access audio file and retrieve query request. for the traditional meth-od of accessing audio pieces, its based on their titles or file names, which is notgood enough to retrieve a query such as find audio pieces similar to the one be-ing played, or in other words, query by example.Too overcome previous problem, the content based audio retrieval techniques arerequired.The following general approach to content-based audio retrieval techniques arenormally taken:- Audio is classified into some common types of audio such as speech, music

and noise.- Different audio types are processed and indexed in different ways. For ex-

ample, if the audio type is speech ,speech recognition is applied and thespeech is indexed based on recognized words.

- Query audio pieces are similarly classified, processed and indexed.- Audio pieces are retrieved based on similarity between the query index and the

index in the database.

Audio signals are represented in the time domain or the frequency domain. differentfeatures are extracted from these two representations.

Time-Domain Features Average Energy: Indicates the loudness of the audio signal

Zero Crossing Rate: Indicates the frequency of signal amplitude sign change Silence Ratio: Indicates the proportion of the sound piece that is silent.

Frequency-Domain Features Sound Spectrum: show the frequency components and frequency distribution of a sound signal, represented in frequency domain. In frequency domain the signalis represented as amplitude varying with frequency, indicating the amount of en-ergy at different frequencies. Bandwidth: indicate the frequency range of a sound; can be taken as the differ-ence between the highest frequency and lowest frequency of non-zero spectrumcomponents non-zero may be defined as at least 3dB above the silence level

Energy distribution: Signal distribution across frequency components. Oneimportant feature derived from the energy distribution is the centroid, which isthe mid-point of the spectral energy distribution of a sound. Centroid is alsocalled brightness. Harmonicity:In harmonic sound, the spectral components are mostly wholenumber multiples of the lowest and most often loudest frequency. Lowest fre-quency is called fundamental frequency. Music is normally more harmonic thanother sounds Pitch: the distinctive quality of a sound, dependent primarily on the frequency of the sound waves produced by its source. only period sounds, such as those pro-

duced by musical instruments and the voice, give rise to a sensation of a pitch . In practice, we use the fundamental frequency as the approximation of the pitch


15/21

SpectrogramPrevious two representations are simple .though, in amplitude time representa-tion doesnt show the frequency component of the signal. and spectrum doesntshow when the difference frequency components occur.To overcome the limitation of the two representations, a combined representationcalled spectrogram is used. The spectrogram of a signal shows the relation between the three variables: frequency contents, time and intensity. In the spectro-gram, the frequency content is shown along the vertical axis, and time along thehorizontal one. The gray scales the darkest part marking the greatestamplitude/power.

Audio ClassificationWe need to classify the audio into speech, music and possiblyinto other categories/subcategories ,where different audio types require different pro-cessing and indexing retrieval techniques also, they have different significance to dif-ferent applications.Main Characteristics of Different Type of SoundFollowing are main characteristics of speech and music as they are the basis for audioclassification.Speech

Speech has a low bandwidth comparing to music, within the range of 0-7KHZ;hence,the spectral centroid (brightness)of speech signals are usually lower than those of mu-sic.Speech signals have a higher silence ratio than music, because of the frequent pausesin a speech occurring between words and sentences.MusicMusic normally has a high frequency range, from 16 to 20,000 HZ.thus; its spectralcentroid is higher than that of speech. Music has a low silence ratio, comparing tospeech. one exception may be music produced by a solo instrument or singing withoutaccompanying music.

Audio Classification Framework All classification methods are based on calculated feature values. however, they differ in how these features are used. Step by Step ClassificationEach feature is used individually in different classification steps ,each feature usedseparately to determine if an audio piece is speech or music. Each feature is seen asfiltering criterion. At each step, an audio piece is determined as one type or another.In this classification method, the centroid of all input audio pieces is calculated, if thecentroid is higher than the pre-determined threshold then its a music, else its either music or speech(where some music has a high centroid).then the silence ration is cal-culated, and if it has a low value, then audio piece is music ,else, it is either solo mu-sic or speech(solo music has a high silence ratio).finally the ZCR(zero crossing ratio)is calculated, and if the input has a high ZCR variability ,it is a speech.


16/21

The above order of the algorithm is based on the differences between features, wherethe less complicated feature with the high differentiating power is used first.a possiblefiltering process is shown in figure 1.

Figure 1:Audio classification process

Feature Vector Based Audio ClassificationValues of a set of features are calculated and used as a feature vector. During the

training stage, the average feature vector is found for each class of audio. During clas-sification, the feature vector of an input is calculated and the vector distance betweenthe input feature vector and each of the reference vectors are calculated. The input usclassified into the class from which the input has least vector distance

Audio Input

High silenceratio?

HighCentriod?

High ZRCvariability?

Speech puls music

Speech puls solo music

Solo music

Speech

Music

MusicYes

No

Yes

Yes

No

No


17/21

Speech recognitionSpeech recognition is the process of converting an acoustic signal, captured by a

microphone or a telephone, to a set of words. The recognized words can be the finalresults, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding

Speech recognition systems can be characterized by many parameters, some of themore important of which are:

Parameters RangeSpeakingMode

Isolated words to continuous speech

Speaking style Read speech to spontaneous speechEnrollment Speaker-dependent to Speaker-independentVocabulary Small(20,000 words)LanguageModel

Finite-state to context-sensitive

Perplexity Small(100)SNR High(>30 dB) to low (


18/21

There are two types of music, Structured music or synthetic and Sample-based musicIndexing and Retrieval of Structured Music and Sound EffectsStructured music and sound effects are represented by a set of commands or al-gorithms. the most common structured music is MIDI, which represent music as anumber of notes and control commands.MPEG-4 is a new standard for structured au-dio, which represents sound in algorithms and control languages.These standard are developed for sound transmission, synthesis, and production. thesestandard are not designed for indexing and retrieval purposes.However,the explicitstructure and notes descriptions in these formats make the retrieval process easy(noneed for feature extraction from audio signals)User query for sound file will also depend on exact match between queries and database sound files. Sometimes, the sound produced by the retrieved sound files may not be what user wants, thats because different devices can render the same structure of sound file differently.

Indexing and Retrieval of Sample-based music

There are two general approaches to indexing and retrieval of sample-based music.Retrieval based on a set of featuresBuild model for each class based on a set of features and then compute the similarity between the features of the query and the models.

Retrieval based on pitchThe pitch for each note has to be extracted or estimated, and converts the musicalsound into a symbolic representation.

FUTURE RESEARCH ISSUES AND TRENDSSince the 1990s, remarkable progress has been made in theoreticalresearch and

system development. However, there are still many challenging re-search problems. This section identifies and addresses some issues in the future re-search agenda.Automatic Metadata GenerationMetadata (data about data) is the data associated with an informa-tion object for thepurposes of description, administration, technical functionality andso on. Metadatastandards have been proposed to support the annotation of multi-media content.Automatic generation of annotations for multimedia involves high-level semantic


19/21

representation and machine learning to ensure accuracy of annota-tion. Content-basedretrieval techniques can be employed to generate the metadata,which can be furtherused by the text-based retrieval.Embedding Relevance Feedback Multimedia contains large quantities of rich information and involvesthe subjectivityof human perception. The design of content-based retrieval systemshas turned out toemphasize an interactive approach instead of a computer-centricapproach. A userinteraction approach requires human and computer to interact in re-fining thehigh-level queries. Relevance feedback is a powerful technique usedfor facilitatinginteraction between the user and the system. The research issue in-cludes the design of the interface with regard to usability, and learning algorithms whichcan dynamicallyupdate the weights embedded in the query object to model the highlevel concepts andperceptual subjectivity.

Bridging the Semantic GapOne of the main challenges in multimedia retrieval is bridging the

gap betweenlow-level representations and high-level semantics (Lew & Eakins,2002). Thesemantic gap exists because low-level features are more easily com-puted in thesystem design process, but high-level queries are used at the start-ing point of theretrieval process. The semantic gap is not only the conversionbetween low-levelfeatures and high-level semantics, but also the understanding of contextual meaningof the query involving human knowledge and emotion. Current re-search intends todevelop mechanisms or models that directly associate the high-levelsemantic objectsand representation of low-level features.

Conclusion

So far, the main concepts, issues and techniques in developing multimedia informa-tion indexing and retrieval system have been discussed. The importance of multime-


20/21

dia databases made the researchers to focus their efforts to go forward and design

more efficient methods and techniques to retrieve the best of these database.

References

Guojun Lu ,Multimedia Databse Management Systems, Artech HousePublishers,1999Chia-Hung Wei and Chang-Tsun Li, Design of Content-based MultimediaRetrieval, Department of Computer Science ,University of Warwick ,Coventry CV47AL, UK

Leung, Survey papers on Audio Indexing and Retrieval,2004/2005, http://www.it.-cityu.edu.hk

Content-based Image Retrieval& Shape as Feature of Image,Media Signal Processing,Presentation by :Jahanzeb Farooq,Michael Osadebey

Content-Based Shape Retrieval Using Different Shape Descriptors: AComparative Study, Dengsheng Zhang and Guojun Lu,Gippsland School of Comput-ing and Information Technology,Monash University,Churchill, Victoria 3842Australia


21/21

Terms and Definitions

Boolean Query: A query that uses Boolean operators (AND, OR, and NOT)toformulate a complex condition. A Boolean query examplecan be university OR college

Content-Based Retriev-al:

An application that directly makes use of the contents of media, rather than annotation inputted by the human, tolocate the desired data inlarge databases.

Feature Extraction: A subject of multimedia processing which involves apply-ingalgorithms to calculate and extract some attributes for de-scribing the media.

High-level feature: Such as timber, rhythm, instruments, and events involvedifferent degrees of semantics contained in the media

Intensity: Power of a frequency component at a particular time inter-val

Low-level feature: Such as object motion, color, shape, texture, loudness, power spectrum, bandwidth and pitch

Query by Example: A method of searching a database using example media assearch criteria. This mode allows the users to select pre-defined examples requiring the users to learn the use of query languages.

Segmentation: Is a process of dividing a video sequence into shots.

Shot: A short sequence of contiguous frames.

Similarity Measure: A measure that compares the similarity of any two objects

represented in the multi-dimensional space. The general approach is to represent the data features as multi-dimension-al points and then to calculate the distances between thecorresponding multi-dimensional points

Video: A combination of text audio and images with time dimen-sion.

Date post:	30-May-2018
Category:	Documents
Upload:	maisfatayer9157
View:	223 times
Download:	0 times

Multimedia databases and database contents retrieval

Documents