research - Tamara Bergtamaraberg.com/Tamara/research.pdf · Claudia Schiffer German supermodel...

Research Statement

Tamara L. Berg

My research straddles the boundary between Computer Visionand Natural Language Processing, focusingon organizing large collections of images with associated text. There are billions of photographs with associatedtext available on the web. In order to organize, search and exploit these collections it is necessary to utilize boththe visual and textual information effectively.

Images and words are often naturally linked. Some common examples include: web pages, captionedphotographs, and video with speech or closed captioning. The central challenge that needs to be solved is howto extract images in which specified objects are depicted from large pools of pictures with noisy text. This isa very difficult problem; the relationship between words associated with an image and objects depicted withinthe image is often complex.

My work has demonstrated that for many situations these collections can be mined successfully. Someprojects that I have worked on include: automatically labeling faces in news photographs, classifying imagesfrom the web, and ranking iconic images from consumer photo collections. All papers, created datasets, anddemos are available online at: http://www.cs.berkeley.edu/ millert/.

KateWinslet

SamMendes

British directorSam Mendes and his part-ner actressKate Winslet arrive at the Lon-don premiere of ’The Road to Perdition’,September 18, 2002. The films starsTomHanks as a Chicago hit man who has a sep-arate family life and co-starsPaul Newmanand Jude Law. REUTERS/Dan Chung

Claudia Schiffer

German supermodelClaudia Schiffer gavebirth to a baby boy by Caesarian sectionJanuary 30, 2003, her spokeswoman said.The baby is the first child for both Schif-fer, 32, and her husband, British film pro-ducer Matthew Vaughn, who was at herside for the birth. Schiffer is seen on theGerman television show ’Bet It...?!’ (’Wet-ten Dass...?!’) in Braunschweig, on January26, 2002. (Alexandra Winkler/Reuters)

Figure 1: Example news photographs with associated captions. Faces have been detected and automaticallylabeled with names extracted from the associated captions.

Face Labeling

The majority of photographs contain people. This means thatthere are many wonderful resources for picturesof people. However, these resources are not easy to use directly because individual faces are rarely labeled withnames. My work shows [1, 2] that for one such source, online news photographs with associated captions, itis possible to automatically label faces that appear in photographs. I accomplish this by analyzing the facescontained within the pictures in conjunction with the namescontained within the captions. Two such labelingsappear in figure 1.

This labeling provides two major end results. The first is a large and varied face dataset which is alreadybeing used by several research groups outside of Berkeley. This dataset, containing 30,281 face images, is morerealistic than usual face recognition datasets, because itcontains faces captured “in the wild” rather than inthe lab; The faces display a wide range of positions, poses, facial expressions, and illuminations. The secondproduct is a face dictionary (figure 2) which allows users to browse news photographs and articles according tothe individuals present.

This project shows how one might exploit previous successesin Computer Vision (face detection) andNatural Language Processing (proper name detection) to accomplish a task that would be impossible to achieveusing only one field or the other. It also demonstrates a new method for incorporating simple natural languagetechniques to model name context, which improves the labeling process significantly. Following publication,this work spawned several related projects in Computer Vision, including an extension used to label faces invideo.

Actress Jennifer Lopez was nominated for aGolden Raspberry or Razzie award as "the year’s worst actress" for "Enough" and "Maid

shown at the premiere of "Maid in Manhattan" on Dec. 8 and is joined by Madonna, Britney Spears, Winona Ryder and Angelina Jolie for the dubious honor. (Jeff Christensen/Reuters)

in Manhattan" on February 10, 2003. Lopez is

Figure 2: I have created a web interface for organizing and browsing news photographs by individual. Thedataset consists of 30,281 faces depicting approximately 3,000 different individuals. Here I show a screenshotfrom this face dictionary. Each face in the the top image indicates an individual cluster/individual in the dic-tionary. The bottom left image shows one cluster, Actress Jennifer Lopez. The bottom right image shows oneof that cluster’s indexed news photographs with corresponding caption. Note the variety of expressions andimages within the cluster. Available at: http://www.cs.berkeley.edu/ millert/

Figure 3: Some example images from the “monkey” category. Animal images are challenging to classifyvisually because they typically include a wide range of aspects, configurations, appearances and even multiplespecies.

Ranking Web Images

Truly automatic image search is one of the grand challenges of Computer Vision. This work focuses on asub-problem of general object search: identifying web images containing various categories of animals [3].Animals are particularly challenging visual categories because they are often depicted in a wide range of aspects,configurations and appearances as seen in figure 3. In addition, the animal images typically portray multiplespecies that differ in appearance (e.g. uakari’s, vervet monkeys, spider monkeys, rhesus monkeys, etc.). Myclassification of the images is accurate despite this variation and though this paper was applied only to animalcategories the methods should be directly applicable to other general categories of objects.

The first step is to harvest a collection of pictures of animals from the web. For each category, I retrievethe 1000 top-ranked pages using Google text search. I then re-rank the images contained on these pages usinga classification scheme that incorporates multiple text andimage cues to determine whether an image depictsan animal of the considered category. Unlike the face labeling project with its structured captions, this systemis applied to web pages with free text, making the word cue extremely noisy. This noisiness leads to poorperformance for a purely text based classification, but withthe addition of visual information performance isquite good. This result is extremely useful given the fact that most commercial image search systems focusalmost entirely on the available text, ignoring the images themselves.

For the “monkey” category, I use the same algorithm to label amuch larger collection of images. The datasetproduced from this set of images is extremely accurate (81% precision for the first 500 images) and displaysgreat visual variety. This suggests that it should be possible to build enormous, rich sets of labeled images in amostly automatic fashion.

Identifying Iconic Images

There are now many popular websites where people share and store pictures (such as Flickr or Shutterfly).Typically these pictures are labeled, often with the names of well-known objects. However, the labelings arenot particularly accurate, perhaps because people will label all pictures in a memory card or from a trip with aparticular word. This means that these collections are hardto use as is for training object recognition programs,or, for that matter, as a source of illustrations, etc. It would be useful be able to automatically select imagesfrom the set that depict the category well. An image that depicts a category well, from a good aspect and in anuncluttered way, is defined as an iconic image. Such iconic representations should exist for many categories,especially landmarks such as the Chrysler Building in figure4, because people tend to take many photographsof these objects from characteristic viewpoints. These iconic images are also arguably the images that are mostrelevant to the category.

In this project, I show that iconic images can be identified rather accurately in natural datasets by segmentingimages with a procedure that identifies foreground pixels, then ranking based on the appearance and shape ofthose foreground regions. This foreground/background segmentation also yields a good estimate of where thesubject of the image lies. Results are evaluated qualitatively and with a user study which indicates that thissegmentation procedure is very helpful for classification.

Figure 4: Ranked images of the Chrysler Building.Top: shows images ranked using similarity computed basedon the shape and appearance of segmented objects.Bottom: shows images ranked using appearance similaritycomputed across the whole image. Incorporating the automatic detection of photographic subjects improvesclassification performance significantly.

Future Goals

The main thrust of my work will be to explore more areas and applications that can benefit from using acombination of text and image information. One problem thatI am extremely excited about pursuing furtheris image search. Current image search technologies rely largely on textual information, using cues like theimage file name or nearby words. My work on ranking web images shows that incorporating image informationcan boost performance significantly. I will extend and improve techniques used in my projects toward a goalof better automated image search for general object categories. This will involve a deeper understanding ofthe link between text and images as well as the development oftechniques specific to different types of visualcategories.

Another area of my future research will be to exploit the hugeamount of consumer photographs publiclyavailable on the web. Photo sharing sites currently receiveupwards of 10,000 photographs per hour. Mywork will focus on ways to organize these photo collections and procedures for automatically labeling pictureswith appropriate text. From this immense amount of data I also hope to learn about the statistics of naturalphotographs and about what makes particular photographs aesthetically interesting. One application of thiswork would be to select subsets of interesting or similarly themed photographs from large collections of images.

Lastly, I would like to extend our understanding of visual and semantic categories. It has been estimated thata human can discriminate between about 3,000 object categories1. Current object recognition algorithms candiscriminate between on the order of 100 object categories,but much work will need to be done to identify ataxonomy of objects and to extend systems to deal with the complex and varied categories existing in the world.

References[1] T.L. Berg, A.C. Berg, J. Edwards, M. Maire, R. White, E. Learned-Miller, D.A. Forsyth “Names and Faces in the News”Computer

Vision and Pattern Recognition, 2004.[2] T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth “Who’s in the Picture”Neural Information Processing Systems, 2004.[3] T. L. Berg, D. A. Forsyth “Animals on the Web”Computer Vision and Pattern Recognition, 2006.[4] T. L. Berg, D. A. Forsyth “Automatically Ranking Iconic Images”Berkeley Technical Report, Jan. 2007.

1Biederman, Recognition-by-Components: A Theory of Human Image Understanding

Date post:	17-Oct-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

research - Tamara Bergtamaraberg.com/Tamara/research.pdf · Claudia Schiffer German supermodel...

Documents