Word Association Norms, Mutual Information, and Lexicography
Kenneth Ward Church, Patrick Hanks
The term word association is used in a very particular sense in the psycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). This paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (The standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) The proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words.
4985 sitasi
en
Computer Science
Dictionaries and lexicography in the AI era
Robert Lew
This paper examines the implications of AI and machine translation on traditional lexicography, using three canonical scenarios for dictionary use: text reception, text production, and text translation as test cases. With the advent of high-capacity, AI-driven language models such as OpenAI’s GPT-3 and GPT-4, and the efficacy of machine translation, the utility of conventional dictionaries comes under question. Despite these advancements, the study finds that lexicography remains relevant, especially for less-documented languages where AI falls short, but human lexicographers excel in data-sparse environments. It argues for the importance of lexicography in promoting linguistic diversity and maintaining the integrity of lesser-known languages. Moreover, as AI technologies progress, they present opportunities for lexicographers to expand their methodology and embrace interdisciplinarity. The role of lexicographers is likely to shift towards guiding and refining increasingly automated tools, ensuring ethical linguistic data use, and counteracting AI biases.
The ROI of AI in lexicography
Erin L. McKean, W. Fitzgerald
Large Language Models (LLMs) are being used for many language-based tasks, including translation, summarization and paraphrasing, sentiment analysis, and for content-generation tasks, such as code generation, answering search queries in natural language, and to power chatbots in customer service and other domains. Since much modern lexicography is based on investigation and analysis of large-scale corpora analogous to the (much larger) corpora used to train LLMs, we hypothesize that LLMs could be used for typical lexicographic tasks. A commercially-available LLM API (OpenAI’s ChatGPT gpt-3.5-turbo) was used to complete typical lexicographic tasks, such as headword expansion, phrase and form finding, and creation of definitions and examples. The results showed that the output of this LLM is not up to the standard of human editorial work, requiring significant oversight because of errors and “hallucinations” (the tendency of LLMs to invent facts). In addition, the externalities of LLM use, including concerns about environmental impact and replication of bias, add to the overall cost.
The Definition, Presentation and Automatic Generation of Contextual Data in Lexicography
M. J. Domínguez, R. Gouws
This paper deals with several aspects of context in lexicography. Section 1 briefly mentions some different approaches to the concept context in various fields. Section 2 puts the focus on different uses and perceptions of the concept context in lexicography, contrasting it with related concepts, such as cotext, contextualization and contextual information. A more comprehensive discussion also covers different aspects of the occurrence of the concept context in dictionary research, with specific reference to central aspects of the so-called inner and outer context. Various portals, dictionaries and dictionary entries will illustrate the above-mentioned approaches. Section 3 approaches the subject from a user perspective. Section 4 addresses the question How can contextual data be extracted or generated? To answer this question, some methods and tools for (automatic) acquisition and analysis of contextual data, – in particular of the local contextual data in terms of Faber and León-Araúz (2016) – are introduced. Examples of these are lexical databases or semantic networks, like WordNet, and corpora, like Sketch Engine, or predictive methods, like Word2vec and similar ones. Some advantages and disadvantages of specific data acquisition tools used for the analysis of local contextual data are indicated. This section also contributes to a more detailed discussion of the automatic generation of the so-called local syntactic-semantic context or word environment, specifically of the building of syntactic-semantic argument patterns and their examples.
Terms of venery in Croatian and Russian languages
Lidija Milković
This paper analyses and compares the principles of linguistic categorisation of terms of venery in the Croatian and Russian languages. Since the monolingual and bilingual dictionaries of the Croatian and Russian languages do not exhaustively describe restrictions or criteria for the use of nouns that denote animals with terms of venery, a corpus study was conducted on the collocations of terms of venery, which aimed to determine more precisely how terms of venery classify groups of animals in the examined languages. In total, the paper analysed the use of 10 Croatian and 10 Russian terms of venery in the corpus of the Croatian language hrWaC and the corpus of the Russian language ruTenTen11 using the Sketch Engine tool. The sample was obtained by excerpting monolingual dictionaries of the Russian and Croatian languages. The similarities and differences in the classification of animal classes and the terms of venery used with each are schematically represented by four tables in the paper, which can serve as an aid when translating from one language to the other. The analysis showed that terms of venery do not categorise all groups of animals, but those that interact closely with humans. Etymological and historical lexicographical sources were used in the paper, with the goal of determining how the lexical combination of venery terms changed and to what extent the meaning of lexemes affects the principles of classification. Three naming models of terms of venery were singled out, namely: metonymic shift, types of animal movement, and the shape formed by a group of animals in motion. Some naming models are also classification criteria by which animals are divided. Terms of venery classify the animal world in both studied languages according to whether the animals are adults or young, according to the manner of moving, belonging to a particular (super)class of animals, and some quantifiers cover only one type of animal. Terms of venery cover the analysed categories of animal species in different ways in the Russian and Croatian languages, but the principles of organisation for the categories are very similar.
Steel metallurgy in the Republic of Croatia
Mirko Gojić
This paper examines the centuries-long history of the castle Kamenica, situated above the identically named settlement in the Croatian Zagorje region, not far from Lepoglava. It was first mentioned in writing in 1311. It was built in the second half of the 13th century at the latest, by an unknown Zagorje noble, as a fortified residence and centre of the identically named estate. Between 1399 and 1405, Kamenica became the property of Herman II of Celje, whose descendants held it until 1456. Although no source directly mentions the fall of Kamenica, it likely happened during the war for the Celje succession, and the castle was first mentioned as a ruin in 1459, when King Matthias Corvinus donated it to John Vitovac. During the time it was under the Vitovacs (1459–88), Kamenica was permanently abandoned, and its holdings were combined with the neighbouring Trakošćan estate into the joint Trakošćan-Kamenica estate, which was formally seated in Trakošćan, but factually in the fortified manor Klenovnik. The existing, meagre architectural elements suggest Kamenica was a small castle (castrum) dating from the second half of the 13th century. It consisted of a trapezoidal core at the top of a steep, conical eminence, formed of a walled, fortified house with a smallish courtyard protected by a wooden palisade, and a fortified, circular suburb, raised around the core with a combination of wooden palisades and earthen ramparts.
The Intellectualization of African Languages through Terminology and Lexicography: Methodological Reflections with Special Reference to Lexicographic Products of the University of KwaZulu-Natal
Langa Khumalo, D. Nkomo
Terminology development and practical lexicography are crucial in language intellectualization. In South Africa, the Department of Sport, Arts and Culture, National Lexicography Units, universities, commercial publishers and other organizations have been developing terminology and publishing terminographical/lexicographical resources to facilitate the use of African languages alongside English and Afrikaans in prestigious domains. Theoretical literature in the field of lexicography (e.g., Bergenholtz and Nielsen (2006); Bergenholtz and Tarp (1995; 2010); Gouws 2020) has attempted to resolve traditional distinctions between lexicography and terminology while also addressing terminological imprecisions in the relevant scholarship. Taking the cue from such scholarship, this article reflects on the methodological approaches for developing lexicographical products for specific subject fields, i.e., resources that document and describe terminology from specialized academic and professional fields. Its focus is on the use of traditional methods vis-à-vis the application of electronic corpora and its technologies in the key practical tasks such as term extraction and lemmatization. The article notes that the limited availability of specialized texts in African languages hampers the development and deployment of advanced electronic corpora and its applications to improve the execution of terminological and lexicographical tasks, while also enhancing the quality of the products. The Illustrated Glossary of Southern African Architectural Terms (English–isiZulu), A Glossary of Law Terms (English–isiZulu) and the forthcoming isiZulu dictionary of linguistic terms are used for special reference. Keywords: intellectualization of African languages, lexicography, terminology, terminography, dictionary, subject field dictionaries, subject field lexicography, glossary, electronic corpora
The Problem of Choosing the Language of Communication: Ukrainian Realities
Svitlana Sokolova
The article analyses the situational change in the language behaviour (code switching) of representatives from different regions of Ukraine based on the data of a statistically significant mass survey of the adult population of different regions in 2017. The dependence of language behaviour on the language situation in a region is confirmed and certain patterns inherent in each of the regions are revealed. The use of the Ukrainian language has a similar dependence on the language situation in all regions, but at different levels: it is used more often when communicating with Ukrainian-speakers, officials, in education, etc. It is rarely used in the information sphere. The connection between the degree of Russification of a certain territory and the laws of language choice in a particular situation is revealed. The articles proposes using the results of mass surveys to study language behaviour through the prism of territorial features, which will help to pursue a balanced state language policy, adjusting tactics in accordance with regional language features.
Computational linguistics. Natural language processing, Semantics
The history of the creation of the author's lexicography
Orzigul Shodimurodovna Khodieva, S. S. Sharipov
Between Coronationalism and Infodemic: Covid-19, New Words and New Connotations
Paola ATTOLINO, Dr
Significant social change brings with it significant linguistic change. The recent global emergency caused by Covid-19, which has remorselessly spread all over the world in a few months, has changed significantly our lives and, consequently, our language. What is extraordinary is the rapidity with which this alteration in language has happened, so much so that the Oxford English Dictionary broke its quarterly publication cycle to update its coverage in April 2020. The aim of this paper is to give an overview on how language use has changed over a few weeks in response to an extraordinary event such as the Coronavirus pandemic. On the one hand, taking as a starting point the OED update I will highlight the way technical terms have entered everyday language. On the other hand, I will observe to what extent common words and expressions have come to assume new connotative meanings.
Social Sciences, Language and Literature
PENGEMBANGAN PERPUSTAKAAN KAMUS DIGITAL DI ERA DIGITECH EDUCATION PADA MASA PANDEMI COVID-19
Cicik Tri Jayanti, Lidya Amalia Rahmania, Moh. Safii
Abstract
This study aims to develop a web-based digital dictionary library information system as an effort to increase the inventory of lexicography scientific cultural entries in the DigiTech Education era during the current COVID-19 Pandemic. This digital dictionary library information system is expected to be a means of archiving the archipelago cultural dictionaries created by students and can be widely accessed by people who need information related to the contents of the dictionary. It developed this digital learning model and digital content to improve the quality of learning quality, especially in Lexicography.
Keywords: a web-based digital dictionary library, cultural chemistry, Lexicography, DigiTech Education
Abstrak
Penelitian ini bertujuan untuk mengembangkan sistem informasi perpustakaan kamus digital berbasis web sebagai upaya meningkatkan inventarisasi lema kebudayaan keilmuan Leksikografi di era DigiTech Education pada Masa Pandemi Covid-19 saat ini. Sistem informasi perpustakaan kamus digital ini diharapkan mampu menjadi sarana pengarsipan kamus-kamus kebudayaan nusantara hasil karya mahasiswa serta dapat diakses secara luas oleh masyarakat yang membutuhkan informasi terkait dengan konten kamus tersebut. Model pembelajaran digital dan juga konten digital ini dikembangkan untuk meningkatkan mutu kualitas pembelajaran khususnya dalam keilmuan Leksikografi.
Kata Kunci: Perpustakaan kamus digital berbasis web, lema kebudayaan, Leksikografi, DigiTech Education
Education (General), Language and Literature
A window to the future: Proposal for a lexicography-assisted writing assistant
Pedro A. Fuertes-Olivera, S. Tarp
Abstract The paper initially discusses some of the challenges posed to contemporary lexicography and stresses the need to move upstream in the value chain to guarantee future work. Today’s lexicographers must accept that their product par excellence is not dictionaries, but lexicographical data that can either be presented to the users in the form of dictionaries or be integrated into various types of tools, platforms, and services. From this perspective, the paper describes the functionalities of various digital writing assistants and focuses on one of them, namely the Spanish-English Write Assistant. It illustrates some decisions that have to be taken to prepare a database to feed both this tool and a series of online dictionaries. A proposal on how a big amount of lexicographical data can be presented in a small pop-up window without resorting to data overload will be discussed. In this connection, alternative ways of doing user testing in a digital environment are introduced. Finally, the paper stresses the importance of a human-centered design and terminology.
Slipping Through the Cracks in e-Lexicography
A. Frankenberg-Garcia, G. Rees, R. Lew
Despite the remarkable advances made in recent years to facilitate the lexicographer’s work of interpreting and synthesizing the complexity of language uncovered by corpora, an uncritical use of cutting-edge corpus tools and resources can instill a false sense of assurance. In this paper, authentic examples pertaining to wordlist use, collocation research and example selection that arose when compiling a real-world lexical database are discussed through the lens of problems that can easily slip through the cracks in e-lexicography. In doing so, we emphasize the importance of solid training and sound lexicographic judgment when using corpora, corpus tools and corpus-derived resources, and provide an opportunity to reflect on how e-lexicography can be further refined in the future.
The confessional sciences: scientific lexicography and sexology in the Oxford English Dictionary
S. Turton
ABSTRACT The empirical advancements of scientific lexicography, on whose principles the Oxford English Dictionary was founded, paralleled the developments made in a range of other sciences in the nineteenth century. Rejecting the overt linguistic prescriptivism of many earlier lexicographers, the OED’s editors aimed to approach language as a natural system akin to any other; like their fellow scientists, they were concerned only with objective fact. Yet the representation of any human behaviour will necessarily be selective and subjective, and the conflicting evidence of real linguistic usage would complicate the lexicographers’ ideals of impassive collection and analysis. The better to cast this problem into relief, this paper juxtaposes scientific lexicography with another, more controversial nineteenth-century science: sexology. Sexologists’ pathologisation of ‘deviant’ sexual desires gave rise to an extensive new taxonomy, which the OED began documenting in the early decades of the twentieth century. Drawing on unpublished draft material from the dictionary’s archives, this paper examines the scientific ideologies of lexicography and sexology as they interacted in the OED, exploring what they reveal about the tension between scholarly principles and social practice.
Towards a Discourse Approach to Critical Lexicography
Wenge Chen
Ideology and power, the vital concerns of critical lexicography, are aspects of a dictionary that a lexicographer and a discerning dictionary user have to encounter in any serious lexicographical enterprise (Kachru 1995); however, critical lexicography as a theme did not receive much attention until Kachru and Kahane (1995). This term later appeared in Hornscheidt (2011) and Moon (2014). However, to date there has not been any systematic theoretical exemplification of what critical lexicography is and how critical lexicographical research is done. Additionally, the scope and function of critical lexicography is relatively limited when we consider the global context, since it fails to take into account theoretical and methodological inspirations from other disciplines such as Critical Discourse Studies and/or Postcolonial Studies, which would make it more theoretically robust and analytically explanatory. With this gap in mind, this paper proposes a discourse approach to Critical Lexicography, termed Critical Lexicographical Discourse Studies (CLDS), as a response to the call for lexicographers’ ‘social accountability’. Specifically, the article puts forward a definition of CLDS and its key concepts, denotes its ontological, epistemological and methodological orientations, delineates its principles, proposes a tentative analytic framework and demonstrates a simplified case study. The article argues that a discourse approach to critical lexicography opens up space to understand different meaning-making practices and contestation in lexicography. In doing so, this article contributes to the development of international (English) lexicography and the language(s) it represents.
30 sitasi
en
Computer Science
Towards electronic lexicography for the Kurdish language
Sina Ahmadi, Hossein Hassani, John P. McCrae
24 sitasi
en
Political Science
Language Picture of the World and Associative Lexicography
Natalya Ufimtseva, O. Balyasnikova
Conceptually, the language picture of the world is defined as the representation of the real world in the system of concepts and categories of the language. Language processes (e.g., nomination) are inextricably related to the processes of identifying objects from the outside world and their understanding. The three interdependent basic concepts, culture, activity, and language consciousness, are implied to reflect the activity which is preconditioned by the system of cultural regulators and the expression of these relations in communication by means of units and language categories. The Moscow psycholinguistic school, grounded in the speech activity theory, regards the world picture as a system phenomenon in the consciousness of a language speaker / culture bearer, interprets meaning as an ideal form for the objective world to exist, and specifies optimal methods to register the knowledge behind the word in ordinary consciousness of a native language speaker / culture bearer are specified. Among the methods applied, the authors use the free associative experiment to collect data from numerous respondents, speakers of the national language / members of a cultural community. The associative and verbal network, built on mass associative experiments, reflects the systematic connections between its elements, i.e. words, each of them, in its turn, has its meaning and systematicity. The comparison of the language picture of the world fragments, the model components allows to make conclusions on the discrepancy both between meanings (knowledge behind the words) of components and their systematicity. Thus, the associative and verbal network as an integral model of the language picture of the world is capable of reflecting elements in the system at micro and macro levels estimating the meaning and significance of each system element. The advent of associative lexicography has made it possible to obtain the first ever comprehensive model of the world picture from a "naïve" native language speaker / a culture-bearer, and to observe the real interaction and interdependence between the meaning and significance of a separate word within the entire associative verbal network and the associative field. The data of mass associative experiments reflect the real state of everyday consciousness of a speaker / bearer of a certain language / culture. This evidence is used to analyze the synchronous state of consciousness, as well as to register changes that develop diachronically, given the data were collected at definite time periods. The observed changes indicate shifts in the associative verbal network structure along with the changes in the content of the knowledge associated with the word stimulus, thus reflecting the evolution of the society.
Lexicography
S. M. Assi
The history of lexicography in Iran dates back to more than 2,000 years ago, to the time of the compilation of bilingual and monolingual lexicons for the Middle Persian language. After a review of the long and rich tradition of Persian lexicography, the chapter gives an account of the state of the art in the modern era by describing recent advances and developments in this field. During the last three or four decades, in line with the advancements in western countries, Iranian lexicography evolved from its traditional state into a modern professional and academic activity trying to improve the form and content of dictionaries by implementing the following factors: the latest achievements in theoretical and applied linguistics related to lexicography; and the computer techniques and information technology and corpus-based approach to lexicography.
Connecting the Dots: Tradition and Disruption in Lexicography
S. Tarp
This article botanizes in the history of lexicography trying to connect the dots and get a deeper understanding of what is happening to the discipline in the framework of the Fourth Industrial Revolution. The objective is to suggest possible ways out of the present deadlock. History shows that a sudden change of the technological base, like the one we are now experiencing, suggests a total revolution of the discipline in all its major dimensions. In order to be successful, such a revolution requires a mental break with past traditions and habits. As a matter of example, the article focusses on a series of bilingual writing assistants developed by the Danish company Ordbogen A/S and the new challenges posed to lexicography by these and similar tools. It argues that these challenges cannot be solved by means of traditional user research which is retrospective as it unfolds in the framework of an old paradigm. As an alternative, and without excluding other types of user research, the article recommends disruptive thinking by means of brainstorm, immersion, and contemplation and provides some examples on how to proceed. Finally, it problematizes the incipient competition between human and artificial lexicographers and gives a brief account of a possible future redistribution of tasks.
Lexicography and Theory: Clearing the Ground
Arleta Adamska-Sałaciak