PHOIBLE


PHOIBLE:

PHOnetics Information Base and LExicon

Introduction and Rationale

Accessible language data is central to the research of a variety of scientific disciplines, including natural language processing, speech and hearing sciences, pedagogy, and linguistics. Researchers leveraging computer processing in these areas recognize that increased detailed information about languages' phonological systems will provide insights and advancements in speech recognition and synthesis, language identification, computer assisted language learning, machine translation, and the typological, historical and genetic classification of languages. However, the full potential of language information will only be realized when language data is perspicuously ported into accessible electronic resources.

Every spoken language has a phonology - a set of language specific speech sounds. In languages with sound based writing systems, literate speakers visualize these sounds through graphic representations. In English, for example, the letter 'a' represents the first sound in the word 'apple', 'b' the first sound in 'book', and so forth. However, languages also have a phonological component below the level of speech sounds that determines how speech sounds pattern together within a language. For example the sounds ‘m’ and ‘n’ both have voicing and therefore pattern together with other voiced sounds like 'z', 'b' and 'a', while at the same time 'm' is made with the lips as is 'b' and 'p' and therefore patterns with these labial sounds to the exclusion of 'n', finally 'm' and 'n' pattern to the exclusion of other sounds like 'b' and 'd' because the have a lowered velum making them nasals. The dimensions along which sounds can be classified in this way are referred to as distinctive features. A language's phonological system can be defined by these distinctive features, which contrast to produce the wide variety of phonologies that the languages of the world entail. This set of features play both a paradigmatic and a syntagmatic role in a language’s phonology by defining the language’s sound inventory and the combination of sounds into higher level structures like syllables and words. For example, many languages, like Russian, permit clusters of consonants only if they all have the same feature for voicing, while some, such as Tsou, permit combinations of voiced and voiceless elements in the same cluster.

All languages follow the principle of economy, the use of a few features to create a large number of phonemic contrasts. Competing with economy is markedness, the observation that some types of sounds are more complex than others and tend to be less well represented in languages. These observations led to the development of distinctive feature theory through the works of Trubetzkoy (ex 1939) and Jakobson (ex 1929) in the 1930s and 1940s. Features also define natural classes of sounds that commonly function together to produce phonological patterns and they provide a fine-grained mechanism for linguistic analysis. Because distinctive features are finite, as are the range of sounds humans can produce due to anatomical restrictions, phonetics and phonology prove the most computationally tractable problem space within language.

Linguistics, like many data-driven sciences, is a discipline that stands to reap large advances from computational methods, computer processing and statistical models. Although numerous individual languages' phonologies and phonetics have been described, the majority still reside in widely inaccessible formats, including paper, proprietary software programs, antique hardware, or inoperable encodings. To date there is no central repository for the sounds from all known languages that includes theoretical models of distinctive feature sets. Such a resource would allow researchers in a wide array of scientific disciplines to search, test theoretical hypotheses, and apply computational and machine learning analyses to a wide variety of language data.

Here, we are developing the PHOnetics Information Base and Lexicon (PHOIBLE), a typological phonological database to encompass the feature sets and sound systems from all known languages for which resources can be discovered. The system will allow, for example, researchers to access detailed multilingual phonological data for computational analysis. It will provide language resources to support the development of language processing applications that are increasingly important to the global society. It will allow boot strapping for sparse data on lesser known languages, automatic speech recognition and synthesis applications, language and dialect identification, and testing of phonological rules. It will be a powerful tool for language and dialect identification, as well as a tool for pedagogy. And it will provide linguists with a tool for testing current theories about the distribution of sounds across languages and provide data for detailed analysis for comparing sound change in languages for genetic, historic, and typological studies. Thus PHOIBLE will make a comprehensive, multidisciplinary resource for the sounds of the world's languages.

PHOIBLE will also be an archive of linguistic knowledge. Linguistics, like biology, is in the unfortunate position that its object of study is rapidly disappearing. Current estimates are that as many as 40% of the 6000 languages currently spoken will be extinct within the next century (Krauss 1992). The PHOIBLE project will archive phonological information and will make available online the phonologies of resources that are currently only available in widely inaccessible formats.

PHOIBLE is being funded by a grant from the Royalty Research Fund at the University of Washington. Funding began in January 2009.

We currently have 200 African languages' phonemic and corresponding graphemic inventories in the PHOIBLE database. Try the search Phoible page to search and browse languages in the database. Within a few weeks, the PHOIBLE database will also include 200 languages' phonemic and corresponding allophonic inventories from the Stanford Phonology Archive (Crothers et al, 1979) and 451 languages' phonemic inventories from UPSID (Maddieson 1984).

Under development is a Web-based GUI for inputting sound inventories' phonemes, allophones and graphemes. This online input tool is currently being tested by students working with the PHOIBLE project. It will be made available to the public with the additional feature that it will provide users with an IPA chart of the language of input. For example, students will use the input app to input a phonemic inventory of a language they are writing a term paper about. The student will then receive a formatted IPA chart in PDF for importing into their word processing document. It provides students with an easy method for including phonetic charts in their paper, and it provides instructors with a uniform format in students' papers. The resource is also time-saving and gives students more time ro work on the paper's analysis.

Objectives

The proposed project, PHOnetics Information Base and Lexicon (PHOIBLE) will:

The PHOIBLE project will also integrate the theoretical models of distinctive features from a selection of theoretically divergent models such as Chomsky and Halle (1968) and related feature sets such as that proposed by Sagey (1990), Goldsmith (1990), and Bates et al (2007); as well as sets from a variety of publications such as Flemming (2002) and Ladefoged and Maddieson (1996). For practical reasons we have chosen a relatively narrow sample of feature sets from leading theoretical approaches in the field, but once constructed the PHOIBLE database will be readily extensible to include any feature set. This will be accomplished by creating a mapping relationship from each feature set to the complete IPA. In this way, the IPA will act as the pivot for interoperability across all resources in PHOIBLE.

Linguistic information will be collected from authoritative resources and dynamically integrated into PHOIBLE throughout the duration of the project. Moreover, many of the languages that will be added represent understudied languages with electronically inaccessible resources, such as paper grammars from the 18th, 19th and 20th centuries. In addition, PHOIBLE will be built in a manner that allows users to add language resources as well, much like Wiki software provides tools for collaborative documentation. This feature will be particularly useful for field linguists who are documenting heretofore undocumented languages.

The PHOIBLE project will also leverage computational research and facilities such as the Online Database of INterlinear Glossed Text (ODIN) for collecting information about languages and language resources automatically from the Web. Other language resources will also be gleaned from other Internet resources.

PHOIBLE is innovative in that it will provide researchers with to-date non-existent resources for phonological investigation. It will provide mapping of feature sets from competing theories over language data, allowing for fine grain analysis or coarse underspecification of features of speech sounds. In this way, PHOIBLE will build on and extend the work of UPSID, Stanford Phonology Archive, and field linguists who have collected data from around the world and make that work available to the larger scientific community.

Moran, Steven and Richard Wright. 2009. Phonetics Information Base and Lexicon (PHOIBLE). Online: http://phoible.org