Connect:   

WordsEye - from Text to Image

Svetlana Stoyanchev

SLTC Newsletter, February 2012

Wordseye is a software that generates 3-dimensional graphical scenes. Unlike 3d Max or Maya, Wordseye's input is plain text. The idea behind this project is to create scenes in a very quick manner by simply describing what you want to see in a scene.

Overview

When reading fiction (or non-fiction), we often use our imagination to visualize described scenes in our minds, creating vivid images from textual narration. WordsEye aims to do something similar to that. The goal of WordsEye "is to provide a blank slate where the user can literally paint a picture with words, where the description may consist not only of spatial relations, but also actions performed by objects in the scene"[1].

Richness and expressiveness of natural language allows for description of complex and dynamic scenes. However, textual scene descriptions are often vague and underspecified, leaving many details, such as background of a scenes, relative positioning, sizes, and colours of objects, up to the reader's imagination. For example, a description "a person is sitting at a table and reading a book" evokes a visual scene which is based on our knowledge of positioning of a chair with respect to a table, a sitting pose of a person, a person's gaze towards a book. Although these specifics are absent from the textual description, they will be common in imagination of different people. When imagining a scene, we complete these details based on our background and common knowledge of the world.

When creating images automatically, the under-specified details have to be reconstructed by the system. WordsEye uses existing semantic resources including WordNet and FrameNet. In addition, the authors are also creating a new theory and lexical resource called VigNet[2]. VigNet records world knowledge depicting distinction between variants of relations based on their attributes. This world knowledge is then used when generating visual presentation of these relations.

How it works

In order to generate an image, the software has to understand natural language text.

Input text is parsed and converted to semantics. Nouns are mapped to objects that correspond to 3D objects available in the objects library. Verbs, adjectives, and prepositions get mapped to semantic relations that represent, for example, object's location, colour, or material. Semantic presentation is then generated in the form of graphical objects in a 3D scene. WordsEye uses a library of 3D models with approximately 2000 different objects, including household items such as tables, chairs, animals, as well as other less usual objects such as dragons or statues.

The software interprets spatial language and other low-level graphical relations in the input text. For example, "the cat is on the table" creates an image a cat on top of a table.

Different versions of cat and table object may be selected by a user. A user may add colours, textures, and lighting to the scene.

Using the software

Creating images from text is a fun experience, images appear following your thoughts as they would in your imagination. It does require some learning in figuring out natural language expressions that would give you desirable result. This learning can often be done by example, by looking at other images with desired effects and adopting their syntax or vocabulary.

WordsEye website has been running for 6 years. It is being used by 7000 users who have created over 10,000 images. Some generated images are surreal and fairy-tale-like and make you think of Alice in Wonderland or Dali's paintings.

WordsEye has also been piloted in an educational setting where it helped 6-graders enhance their essay writing skills through the use of the software[3].

Questions to one of the WordsEye's Creator Bob Coyne

We have asked Bob Coyne, the creator of WordsEye (in collaboration with Richard Sproat) about current research directions of the project.

Q: Theatre set descriptions seem to be very suitable for automatic scene generation. For example from Chehov's Uncle Vanya: "A country house on a terrace. In front of it a garden. In an avenue of trees, under an old poplar, stands a table set for tea, with a samovar, etc. Some benches and chairs stand near the table. On one of them is lying a guitar. A hammock is swung near the table. It is three o'clock in the afternoon of a cloudy day." How far are you from generating images using such flexible language and what in your view are the main challenges?

Bob: Something like that is certainly in the realm of possibility and represents the type of language the system is designed to handle as well as some of the key issues we're currently working on. For example, the current system is unable to depict words (e.g., for locations like garden) that denote arrangements of multiple objects. Other compound object relations, such as table set for tea, also pose a problem and would involve interpreting and translating a multiword description into a situation-specific configuration (what we call vignettes). Another challenge (e.g., in hammock is swung) is that except for poses and facial expressions, WordsEye only deals with rigid 3D objects. This is especially important when processing descriptions of how people are dressed, where clothing must conform to the person wearing it. And this is also an issue when describing the shape of a particular object or its parts (e.g., chair with a high curved back). In fact, even for single word or simple multiword nominals (eg samovar, country house, old poplar), we won't always have a 3D object corresponding to the specified type of entity or be able to modify an existing object by changing its shape or style. So all these cases would be covered by a fallback strategy where we instead use a closely related object (e.g. samovar - urn or kettle), or just drop the modifier and use a more generic form (country house - house; old poplar - poplar). Other aspects of the Uncle Vanya example, such as Time of day and cloudiness are both depictable in the current system, though the variation of language specification for that hasn't been extensively fleshed out. And some aspects (e.g. near the table) are handled as-is. And of course there are many other issues, such as PP-attachment disambiguation, word sense disambiguation, etc. But, overall, I think the Checkov description represents type type of low-level descriptions the system should eventually be able to handle given the work we are doing to flesh out the system (such as adding support for words that denote arrangements of multiple objects) and a robust set of fallback strategies.

Q: A user can currently specify "low-level" graphical relations of the scene. Can you describe the direction that you are currently working on?

Bob: Any given scene can usually be described in a couple very different ways. What we call low-level language can be used to describe what the scene looks like. This will include spatial relations between objects, surface properties like color and texture, poses and facial expressions of humans, etc. Low-level language can conceptually be mapped into graphical objects and a limited set of graphical constraints -- as exemplified by the Chekov example. In contrast, high-level language describes not what a scene looks like but what event or state-of-affairs is represented by the scene. For example, if you say "wash an apple" you might expect to see a person holding an apple and standing in front of a sink. However, "wash the floor" might imply a person kneeling on the floor and holding a sponge. And "washing a car": might involve being outside on a driveway and holding a hose and pointing it at the car. The high level semantics is basically the same -- an object is being washed, but the low-level semantics are very different. The mappings between high-level semantics and low-level semantics (i.e. standard ways of doing things or common configurations of objects) are what we call vignettes. It might seem that there are an unbounded number of vignette types, but if you examine the actual structure of visual scenes, you'll notice that a limited set of structures are repeat again and again -- it's mostly the mapping between vignettes and high-level semantics that varies. For example, cutting carrots and writing on a piece of paper are structurally very similar -- both involve the same vignette type of applying a hand-held instrument (knife or pencil) to a patient (carrot or piece of paper) that is resting on a horizontal work surface (kitchen counter or desk).

Q: How do you see WordsEye system used in the future?

Bob: I think WordsEye (and text-to-scene generation more generally) has tremendous potential in several application areas. We recently performed a controlled experiment using WordsEye as a tool to enhance literacy skills in a middle school summer enrichment program at Harlem Educational Activities Fund. Students using it showed significantly greater growth in a pre- and post- test test in writing and literary response compared to students in a control group. We're planning to apply it next to English Language Learners. Another potential area is social media, where the speed and ease of creating pictures can empower people to express themselves not just by what they write but by the pictures they create. We're also experimenting with using WordsEye to automatically visualize existing text -- in particular with Twitter to see if some percentage of Tweets can be automatically turned into pictures. A third, very interesting potential application area is in 3D games where language could be used in the gameplay itself to change and interact with the environment. Games are increasingly allowing more in-game variation of the graphical content, and using natural language would provide an exciting new way to do that.

Acknowledgements

Participants in the WordsEye project at Columbia University include Richard Sproat, Owen Rambow, Julia Hirschberg, Daniel Bauer, Masoud Rouhizadeh, Morgan Ulinski, Margit Bowler, Jack Crawford, Kenny Harvey, Alex Klapheke, Gabe Schubiner, Cecilia Schudel, Sam Wiseman, Mi Zhou, Yen-Han Lin, Yilei Yang, and Victor Soto. This project is supported by a grant from the National Science Foundation, IIS-0904361.

References

  • [1] B. Coyne and R. Sproat. WordsEye: An automatic text-to-scene conversion system. In SIGGRAPH Proceedings of the Annual Conference on Computer Graphics, 2001.

  • [2] B. Coyne, O. Rambow, J. Hirschberg, and R. Sproat. Frame Semantics in Text-to-Scene Generation. In Proceedings of the KES'10 workshop on 3D Visualisation of Natural Language, 2010.

  • [3] B. Coyne, C. Schidel, M. Bitz, and J. Hirschberg Evaluating a Text-to-Scene Generation System as an Aid to Literacy In Proceedings of ISCA workshop on Speech and Language Technology in Education 2011

  • [4] Interview with Bob Coyne on The Creators Project Blog

  • [5] Current version of the WordsEye website (new user registration available)

  • [6] Previous version of the WordsEye website

If you have comments, corrections, or additions to this article, please contact the author: Svetlana Stoyanchev, sstoyanchev [at] cs [dot] columbia [dot] edu.

Svetlana Stoyanchev is a Postdoctoral Research Fellow at Columbia University. Her interests are in dialog and information presentation.