Learning Structured Embeddings of Knowledge Bases
Antoine Bordes, J. Weston, R. Collobert
et al.
Many Knowledge Bases (KBs) are now readily available and encompass colossal quantities of information thanks to either a long-term funding effort (e.g. WordNet, OpenCyc) or a collaborative process (e.g. Freebase, DBpedia). However, each of them is based on a different rigorous symbolic framework which makes it hard to use their data in other systems. It is unfortunate because such rich structured knowledge might lead to a huge leap forward in many other areas of AI like nat- ural language processing (word-sense disambiguation, natural language understanding, ...), vision (scene classification, image semantic annotation, ...) or collaborative filtering. In this paper, we present a learning process based on an innovative neural network architecture designed to embed any of these symbolic representations into a more flexible continuous vector space in which the original knowledge is kept and enhanced. These learnt embeddings would allow data from any KB to be easily used in recent machine learning meth- ods for prediction and information retrieval. We illustrate our method on WordNet and Freebase and also present a way to adapt it to knowledge extraction from raw text.
939 sitasi
en
Computer Science, Mathematics
Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev
Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.
Bialgebraic Reasoning on Stateful Languages
Sergey Goncharov, Stefan Milius, Lutz Schröder
et al.
Reasoning about program equivalence in imperative languages is notoriously challenging, as the presence of states (in the form of variable stores) fundamentally increases the observational power of program terms. The key desideratum for any notion of equivalence is compositionality, guaranteeing that subprograms can be safely replaced by equivalent subprograms regardless of the context. To facilitate compositionality proofs and avoid boilerplate work, one would hope to employ the abstract bialgebraic methods provided by Turi and Plotkin's powerful theory of mathematical operational semantics (a.k.a. abstract GSOS) or its recent extension by Goncharov et al. to higher-order languages. However, multiple attempts to apply abstract GSOS to stateful languages have thus failed. We propose a novel approach to the operational semantics of stateful languages based on the formal distinction between readers (terms that expect an initial input store before being executed), and writers (running terms that have already been provided with a store). In contrast to earlier work, this style of semantics is fully compatible with abstract GSOS, and we can thus leverage the existing theory to obtain coinductive reasoning techniques. We demonstrate that our approach generates non-trivial compositionality results for stateful languages with first-order and higher-order store and that it flexibly applies to program equivalences at different levels of granularity, such as trace, cost, and natural equivalence.
Idefix-Closed Languages and Their Application in Contextual Grammars
Marvin Ködding, Bianca Truthe
In this paper, we continue the research on the power of contextual grammars with selection languages from subfamilies of the family of regular languages. We investigate infix-, prefix-, and suffix-closed languages (referred to as idefix-closed languages) and compare such language families to some other subregular families of languages (finite, monoidal, nilpotent, combinational, (symmetric) definite, ordered, non-counting, power-separating, commutative, circular, union-free, star, and comet languages). Further, we compare the families of the hierarchies obtained for external and internal contextual grammars with the language families defined by these new types for the selection. In this way, we extend the existing hierarchies by new language families. Moreover, we solve an open problem regarding internal contextual grammars with suffix-closed selection languages.
Identity Testing for Stochastic Languages
Smayan Agarwal, Shobhit Singh, Aalok Thakkar
Determining whether an unknown distribution matches a known reference is a cornerstone problem in distributional analysis. While classical results establish a rigorous framework in the case of distributions over finite domains, real-world applications in computational linguistics, bioinformatics, and program analysis demand testing over infinite combinatorial structures, particularly strings. In this paper, we initiate the theoretical study of identity testing for stochastic languages, bridging formal language theory with modern distribution property testing. We first propose a polynomial-time algorithm to verify if a finite state machine represents a stochastic language, and then prove that rational stochastic languages can approximate an arbitrary probability distribution. Building on these representations, we develop a truncation-based identity testing algorithm that distinguishes between a known and an unknown distributions with sample complexity $\widetildeΘ\left( \frac{\sqrt{n}}{\varepsilon^2} + \frac{n}{\log n} \right)$ where $n$ is the size of the truncated support. Our approach leverages the exponential decay inherent in rational stochastic languages to bound truncation error, then applies classical finite-domain testers to the restricted problem. This work establishes the first identity testing framework for infinite discrete distributions, opening new directions in probabilistic formal methods and statistical analysis of structured data.
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
N. Vanetik, Marina Litvak, Chaya Liebeskind
Offensive language detection in Arabic is a challenging task because of the unique linguistic and cultural characteristics of the Arabic language. This study introduces a high-quality annotated dataset for classifying offensive language in Arabic, based on a structured taxonomy, categorizing offensive content across seven levels, capturing both explicit and implicit expressions. Utilizing this taxonomy, we re-annotate the FARAD-500 dataset, creating reFarad-500, which provides fine-grained labels for offensive texts in Arabic. A thorough dataset analysis reveals key patterns in offensive language distribution, emphasizing the importance of target type, offense severity, and linguistic structures. Additionally, we assess text classification techniques to evaluate the dataset’s effectiveness, exploring the impact of sentiment analysis and emotion detection on classification performance. Our findings highlight the complexity of Arabic offensive language and underscore the necessity of extensive annotation frameworks for accurate detection. This paper advances Arabic nat-ural language processing (NLP) in resource-constrained settings by enhancing the recognition of hate speech and fostering a deeper understanding of the linguistic and emotional dimensions of offensive language.
Nirvāṇa and the ten Dharmas in the Altun Yaruk Sudur text
Kübra, Hacer
Buddhism is a path based on understanding the fundamental truths of life and attaining
enlightenment. Nirvāṇa is the ultimate goal of this path. On this journey, the concept of nirvāṇa
and the ten dharmas associated with it hold significant importance. According to Buddhist
thought, the ten dharmas serve as a guide that enables a being to progress on the right path. In
the concept of nirvāṇa explained by the Tathāgatas (Buddhas), emphasis is placed on these ten
fundamental characteristics. These ten features explain the state of enlightenment of the
Buddhas and how they perceive the truth. They demonstrate how a being can attain absolute
reality, that is, nirvāṇa. They express how the Buddhas comprehend the nature of truth and
ultimate peace through enlightenment. Through these ten dharmas, Buddhas establish complete
harmony between individual and universal reality. Additionally, as enlightened beings, they aim
to guide all other beings through these ten dharmas. Nirvāṇa does not merely represent a state
but signifies the transformation of all existence and the complete realization of truth. It is the
key to liberation from the cycle of birth and death. It shows that it is possible for a being to be
freed from suffering. Furthermore, when a being attains nirvāṇa, they reach the peak by
achieving complete freedom. By embracing nirvāṇa and the ten dharmas associated with it, a
being not only transforms themselves but, according to the Mahāyāna school, also positively
influences the other beings around them. This contributes to leading a conscious, harmonious,
and meaningful life. As a result, the sense of responsibility of the being towards both
themselves and their surroundings increases. A being who fulfills these responsibilities attains
nirvāṇa, the state of peace.
In this study, the concept of nirvāṇa in the Altun Yaruk Sudur text and all the terms that
express the concept of nirvāṇa will be identified, and the semantic characteristics of these terms
will be analyzed in detail.
Language and Literature, Ural-Altaic languages
FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation
Jinhao Dong, Yiling Lou, Qihao Zhu
et al.
74 sitasi
en
Computer Science
It's Not Easy Being Green: On the Energy Efficiency of Programming Languages
Nicolas van Kempen, Hyuk-Je Kwon, Dung Tuan Nguyen
et al.
Does the choice of programming language affect energy consumption? Previous highly visible studies have established associations between certain programming languages and energy consumption. A causal misinterpretation of this work has led academics and industry leaders to use or support certain languages based on their claimed impact on energy consumption. This paper tackles this causal question directly: it develops a detailed causal model capturing the complex relationship between programming language choice and energy consumption. This model identifies and incorporates several critical but previously overlooked factors that affect energy usage. These factors, such as distinguishing programming languages from their implementations, the impact of the application implementations themselves, the number of active cores, and memory activity, can significantly skew energy consumption measurements if not accounted for. We show -- via empirical experiments, improved methodology, and careful examination of anomalies -- that when these factors are controlled for, notable discrepancies in prior work vanish. Our analysis suggests that the choice of programming language implementation has no significant impact on energy consumption beyond execution time.
Võro ja seto keelevahetus XX–XXI sajandil. Kas pöördumatu protsess?
Liina Lindström, Maarja-Liisa Pilvik, Helen Plado
et al.
"Võro and Seto language shift in the 20th–21st century: An irreversible process?" This article approaches the endangerment status of Võro and Seto from two main aspects: intergenerational language transmission and language teaching in schools. Despite the revitalisation efforts for Võro and Seto beginning 35 years ago, their overall situation has not seen significant improvement. Currently, there are very few children growing up in a Võro/Seto speaking home.
In our survey, we focused on the mechanisms of intergenerational language transmission. This included identifying from whom different generations acquired their Võro/Seto proficiency and how many have passed the language on to their children. We also investigated experiences and attitudes towards learning the Võro/Seto language in schools. A total of 660 individuals with Võro/Seto roots responded to the survey.
The results indicate a steady language shift from Võro/Seto to Estonian since the 1960s. Võro and Seto have been relegated to the status of “granny languages”, with inconsistent usage resulting in partial acquisition and insufficient transmission to the next generation. Consequently, the role of kindergartens and schools in language learning is becoming increasingly important. However, current efforts have not been adequate to ensure sufficient proficiency among learners.
Commentary on the survey revealed that the Võro/Seto language shift is mainly driven by a widespread negative attitude towards local languages, favouring the use of Standard Estonian. The shift to Estonian is for the most part a voluntary process, motivated by the speakers’ perceived inferiority (stemming from low prestige and limited domains of use for Võro/Seto) and a wish to provide a better future for their children. Reversing this process without state support is very difficult. Nonetheless, the majority of respondents expressed a need to protect the Võro/Seto language and supported its teaching in schools.
Other Finnic languages and dialects
Kogemuslugudes kirjeldatud kokkupuuted füüsiliselt agressiivsete vaimolenditega
Reet Hiiemäe
This article looks at contemporary accounts of encounters with physically aggressive supernatural beings, set against the backdrop of traditional personal experience narratives (collected up to the first half of the 20th century) describing similar encounters. Analyzing the impact of trends in modern spiritual teachings and global information dissemination, it becomes clear that the main focus of modern spiritual teachings is on protection and self-development, and communication with friendly protective spirits is rather expected, at least among those engaged in esoteric practices. However, the fear of attacks by supernatural beings persists, and some accounts describe experiences of strong physical sensations during contact with aggressive supernatural beings. Such encounters are typically perceived as unexpected and frightening. Often, the experiencer is not an active bearer of the respective narrative tradition and only after an intense personal experience begins to feel the need to interpret what was experienced or seek help. Thus, attempts at explanation (e.g., arguments about guilt and revenge) and strategies to achieve a sense of personal control are central to these personal experience narratives. The analysis shows that the interpretations combine heterogeneous information from family lore, parapsychological literature, media coverage of the respective beliefs, and internet keyword searches.
Other Finnic languages and dialects
Yeni Uygur Türkçesi ve Özbek Türkçesinin Söz Varlığındaki Sovyet Etkisi
Yusuf Ulusoy
Yeni Uygur Türkçesi ve Özbek Türkçesi, Türk diyalektleri arasında Güney-Doğu (Türkistan/Uygur) grubu içerisinde yer alır. Çağatay Türkçesinin devamı olarak kabul edilen bu diyalektler daha sonra kendi yazı dillerini oluşturur. Dilde kullanılan kelimelerin bütünü söz varlığıdır. Bir dilin söz varlığı, o toplum hakkında önemli bilgiler verir. Bir toplum, başka bir toplumla çeşitli etkileşimler kurarak oradan kendi söz varlığına kelime alabilir. 1924 yılında Türkistan’da millî sınırlar çizilir. Kazak, Kırgız, Özbek, Tatar, Başkurt vb. sözde ayrı ayrı milletler meydana getirilir. Türk kültürü üzerine bir Rus kültürü inşa edilmeye çalışılır. Bu bütünleşik ve iç içe hâl sonucunda Rusça kelimelerin Türk diyalektlerine nüfuz ettiği görülür. Bu diyalektler arasında bulunan Yeni Uygur Türkçesi ve Özbek Türkçesine de Rusça kelimeler etki eder yani söz varlığında Rusça etkisi görülür. Çalışmamızda, işgalci güçlerin etkisiyle Çinceden ve özellikle Rusçadan Yeni Uygur Türkçesinin ve Özbek Türkçesinin söz varlığına giren kelimeler incelenmiştir. Bu kelimeler tespit edilirken Kaynakça’da verilen; Uyġur Tiliniñ İzahliq Lüġiti, Özbekçe- Türkçe Sözlük ve Uzbek-English Dictionary sözlüklerden faydalanılmıştır. Alan başlıkları, gündelik hayatta en sık kullanılan kelimeler seçilerek belirlenmiştir.
Active voice in Turkish
Hacı İbrahim
The term "active voice" is controversial in the field of Turkish grammar. "Voice" is
considered as a term in Turkish grammar that expresses the realization of the verb formed with
various suffixes in different aspects depending on the subject and object; however, the term
"active voice" does not correspond to a specific category in grammar; rather, it is treated as
identical to the category expressed by the verb term, leading to an unnecessary naming.
In this study, the ambiguities and contradictions in the definition of the umbrella term
will be discussed. It will be revealed that the active voice expresses the same meaning as terms
such as "active verb", "action verb" and "action noun" and that the use of these terms gives the
impression that they are different categories. It is not correct to consider the concept of roof
only as different forms of the verb; It will be emphasized that the actual semantic units that
enable the formation of the sentence should also be taken into consideration. It will be stated
that a grammatical confusion has arisen due to incomplete and unclear definitions of the term
roof. The definition and handling of the term "active framework" will be discussed; it will be
discussed that the definition and use of this term causes confusion with other grammatical
concepts. In particular, it will be demonstrated that mixing different concepts such as "active
verb" and "active voice" makes it difficult to make an accurate linguistic analysis.
Based on these findings, it will be emphasized that the terms should be used correctly
and consistently in order to eliminate the ambiguity and definition difficulties of the terms in
linguistics.
Language and Literature, Ural-Altaic languages
Analyzing ChatGPT’s Mathematical Deficiencies: Insights and Contributions
Vincent Cheng, Yu Zhang
AMR-TST: Abstract Meaning Representation-based Text Style Transfer
Kaize Shi, Xueyao Sun, Li He
et al.
Abstract Meaning Representation (AMR) is a semantic representation that can enhance nat-ural language generation (NLG) by providing a logical semantic input. In this paper, we pro-pose the AMR-TST, an AMR-based text style transfer (TST) technique. The AMR-TST converts the source text to an AMR graph and generates the transferred text based on the AMR graph modified by a TST policy named style rewriting. Our method combines both the ex-plainability and diversity of explicit and implicit TST methods. The experiments show that the proposed method achieves state-of-the-art results compared with other baseline models in automatic and human evaluations. The generated transferred text in qualitative evaluation proves the AMR-TST have significant advantages in keeping semantic features and reducing hallucinations. To the best of our knowledge, this work is the first to apply the AMR method focusing on node-level features to the TST task.
14 sitasi
en
Computer Science
Theonyms of the Highest Pantheon in Folklore Texts of the Altaians
N. Oynotkinova
This article reveals the semantics and origin of theonyms of the highest pantheon of the pagan tradition of the Altaians (Teleuts, Telengits, Altai-Kizhi, Chalkans, Tubalars, and Kumandins) on the basis of folklore material. The relevance of the study is associated with an increased interest in Tengrian religion among the modern peoples of the Altaic language family. The innovative character of the research lies in the application of the ethnolinguistic onomastic approach to the study of theonyms of the Altaians. The main research method is a descriptive method based on the contextual and semantic analysis of theonyms. Particular attention is paid to the internal form and origin of theonyms. The author concludes that the highest pantheon among the Altaians was formed on the basis of the cult of the sky and represents the personification of the heavenly elements, which is akin to many developed mythological systems of antiquity. Teheri ‘Sky’ stands out as the Supreme God belonging to the Upper world. The subject of deification in Altai folklore was the idea of natural elements and cosmic phenomena, which are a kind of force that can affect a person’s life. The personification of various qualities and parts of the sky was reflected in various names of celestial deities, the constituent parts of which contain the lexeme Teheri ‘Sky’. Some theonyms appeared because of the personification of celestial phenomena: thunder, lightning, and hail. The author notes that the process of identifying various deities embodying various attributive qualities and characteristics of the supreme god seems to be a historically later stage in the development of polytheism than the time of the appearance of the idea of the one god Teheri ‘Sky’. With the adoption of the ideas of other world religions (Buddhism, Islam, and Christianity) and shamanism with its polytheism in the culture of the Turkic-Mongolian peoples, the status of the deities of the Upper World was ranked, including the status of the supreme god Teheri ‘Sky’, which was replaced by other theonyms.
Language ideologies and beliefs about language in Estonia and Estonian language planning
Liina Lindström, Lydia Risberg, Helen Plado
Throughout Estonian history, the language ideologies prevailing in Europe have had great influence on Estonian language planning. Language planners, in turn, have influenced the views of Estonian society. In this paper we analyse how language ideologies have supported myths and beliefs throughout the history of cultivating Standard Estonian. The privileged status of Standard Estonian (compared to local dialects) strengthened considerably from the early 20th century. Although Estonian language planning became more tolerant and democratic since the 1980s, a totalitarian understanding of the language still remains in the background. Using foreign words and the mixing of languages and registers is considered especially objectionable due to the ideology of a small nation, which has to defend itself and its language.
In the 2020s, discussions about language change and the principles of language planning re-emerged. These have been interesting, because language planning wishes to make the “top-down” language norms of Standard Estonian closer to actual language use, while language maintenance experts still see any potential changes as a threat to the Estonian language and even the nation.
Kokkuvõte. Liina Lindström, Lydia Risberg, Helen Plado: Keeleideoloogiad ja uskumused keele kohta Eestis ja eesti keelekorralduses. Euroopas valitsevad keeleideoloogiad on aja jooksul eesti keelekorraldusele suurt mõju avaldanud. Keelekorraldajad on omakorda mõjutanud eesti ühiskonnas valitsevaid arusaamu. Artiklis analüüsime, kuidas keeleideoloogiad on alates 19. sajandist toetanud müüte ja uskumusi eesti kirjakeele kohta. 20. sajandi algul tugevnes kirjakeele staatus (võrreldes kohalike murretega) järsku, samuti süvenes hoiak, et on olemas „õige“ ja „vale“ keel. Kuigi eesti keelekorraldus muutus alates 1980. aastatest leebemaks ja demokraatlikumaks, on taustal püsinud siiski totalitaarne arusaam keelest. Võõrsõnu ja keelte ning registrite segamist peetakse eriti halvaks ideoloogia tõttu eestlastest kui väikesest rahvast, kes on pidanud ja peab ennast ja oma keelt pidevalt kaitsma.
2020. aastatel järjekordselt esile kerkinud uued arutelud on huvitavad, sest keelekorraldus soovib norminguid hoida tegeliku keelekasutusega kooskõlas, kuid keeletoimetajad ja õpetajad näevad võimalikke muudatusi ohuna nii eesti keelele kui isegi rahvusele. Niisiis on pika aja jooksul levinud keeleideoloogiad eestlastes tugevalt juurdunud.
Philology. Linguistics, Finnic. Baltic-Finnic
BBVD: A BERT-based Method for Vulnerability Detection
Weichang. Huang, Shu-Tyng Lin, Chen Li
—Software vulnerability detection is one of the key tasks in the field of software security. Detecting vulnerability in the source code in advance can effectively prevent malicious attacks. Traditional vulnerability detection methods are often ineffective and inefficient when dealing with large amounts of source code. In this paper, we present the BBVD approach, which treats high-level programming languages as another nat- ural language and uses BERT-based models in the natural language processing domain to automate vulnerability detection. Our experimental results on both SARD and Big-Vul datasets demonstrate the good performance of the proposed BBVD in detecting software vulnerability.
Robustification of Multilingual Language Models to Real-world Noise with Robust Contrastive Pretraining
Asa Cooper Stickland, Sailik Sengupta, Jason Krone
et al.
4 sitasi
en
Computer Science
Black Holes and White Rabbits: Metaphor Identification with Visual Features
Ekaterina Shutova, Douwe Kiela, Jean Maillard
Metaphor is pervasive in our communication, which makes it an important problem for nat-ural language processing (NLP). Numerous approaches to metaphor processing have thus been proposed, all of which relied on linguistic features and textual data to construct their models. Human metaphor comprehension is, however, known to rely on both our linguistic and perceptual experience, and vision can play a particularly important role when metaphorically projecting imagery across domains. In this paper, we present the first metaphor identification method that simultaneously draws knowledge from linguistic and visual data. Our results demonstrate that it outperforms linguistic and visual models in isolation, as well as being competitive with the best-performing metaphor identification methods, that rely on hand-crafted knowledge about domains and perception.
203 sitasi
en
Computer Science