This thesis develops a system for automatically analyzing and improving dynamic programs, such as those that have driven progress in natural language processing and computer science, more generally, for decades. Finding a correct program with the optimal asymptotic runtime can be unintuitive, time-consuming, and error-prone. This thesis aims to automate this laborious process. To this end, we develop an approach based on 1. a high-level, domain-specific language called Dyna for concisely specifying dynamic programs 2. a general-purpose solver to efficiently execute these programs 3. a static analysis system that provides type analysis and worst-case time/space complexity analyses 4. a rich collection of meaning-preserving transformations to programs, which systematizes the repeated insights of numerous authors when speeding up algorithms in the literature 5. a search algorithm for identifying a good sequence of transformations that reduce the runtime complexity, given an initial, correct program We show that, in practice, automated search -- like the mental search performed by human programmers -- can find substantial improvements to the initial program. Empirically, we show that many speed-ups described in the NLP literature could have been discovered automatically by our system. We provide a freely available prototype system at https://github.com/timvieira/dyna-pi.
The East Asia VLBI Network (EAVN) has recently enabled dual-polarization observations at $22$ and $43\,\mathrm{GHz}$. We present the first systematic verification of its polarimetric performance using EAVN observations of M87, 3C 279, 3C 273, and OJ 287, calibrated with the GPCAL pipeline and evaluated against near-contemporaneous VLBA images at comparable frequencies. Most stations show stable polarimetric leakages with amplitudes of $5$-$10\%$ over monthly timescales. While several VERA stations exhibit D-term phase variations between epochs, we attribute these to field-rotator (FR) offsets and demonstrate that phase stability is restored after applying the analytically derived FR corrections. The resulting linear-polarization morphologies and EVPAs broadly agree with the VLBA results within uncertainties; fractional polarization measured by the EAVN tends to be slightly higher near polarization peaks. Although exact one-to-one comparisons are limited by moderate frequency and epoch differences, the combined evidence indicates robust EAVN polarimetric calibration and imaging capabilities at $22$ and $43\,\mathrm{GHz}$. These results support the scientific capability of EAVN polarimetry and lay the groundwork for expanded, higher-fidelity polarimetric studies in East Asia.
This paper evaluates the performance of transformer-based language models on split-ergative case alignment in Georgian, a particularly rare system for assigning grammatical cases to mark argument roles. We focus on subject and object marking determined through various permutations of nominative, ergative, and dative noun forms. A treebank-based approach for the generation of minimal pairs using the Grew query language is implemented. We create a dataset of 370 syntactic tests made up of seven tasks containing 50-70 samples each, where three noun forms are tested in any given sample. Five encoder- and two decoder-only models are evaluated with word- and/or sentence-level accuracy metrics. Regardless of the specific syntactic makeup, models performed worst in assigning the ergative case correctly and strongest in assigning the nominative case correctly. Performance correlated with the overall frequency distribution of the three forms (NOM > DAT > ERG). Though data scarcity is a known issue for low-resource languages, we show that the highly specific role of the ergative along with a lack of available training data likely contributes to poor performance on this case. The dataset is made publicly available and the methodology provides an interesting avenue for future syntactic evaluations of languages where benchmarks are limited.
Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Purpose: This study compares the hierarchical structure of scientific teams across countries and investigates factors associated with the observed cross-national differences. Design/methodology/approach: Drawing on 150,817 publications with author contribution statements, we focus on the 15 countries with the largest volume of scientific publications, examine cross-country variations in the proportion of tall teams, and analyze how this proportion correlates with other factors. Findings: Scientific output from East Asia is dominated by tall teams, which persist after controlling for team size, indicating that this pattern cannot be fully accounted for by the prevalence of larger teams in these countries. Cultural factors, measured by Power Distance, as well as the observed funding patterns of major basic science agencies, are associated with the dominance of tall teams in East Asia. Research limitations: This study is limited by its reliance on publications with author contribution statements, which may introduce selection bias; its focus on cultural and funding factors, while leaving other institutional contexts unexamined; and its use of a leadership concentration measure that does not capture other dimensions of hierarchy. Practical implications: Understanding cross-national differences in research team structures and their associated cultural and institutional factors can inform science policy and team management. Originality/value: This study provides a systematic cross-national comparison of team hierarchy and offers a mechanistic understanding of the dominance of tall teams in East Asia, highlighting associations with cultural and funding factors.
Language learning refers to the problem of inferring a mathematical model which accurately represents a formal language. Many language learning algorithms learn by asking certain types of queries about the language being modeled. Language learning is of practical interest in the field of cybersecurity, where it is used to model the language accepted by a program's input parser (also known as its input processor). In this setting, a learner can only query a string of its choice by executing the parser on it, which limits the language learning algorithms that can be used. Most practical parsers can indicate not only whether the string is valid or not, but also where the parsing failed. This extra information can be leveraged into producing a type of query we call the prefix query. Notably, no existing language learning algorithms make use of prefix queries, though some ask membership queries i.e., they ask whether or not a given string is valid. When these approaches are used to learn the language of a parser, the prefix information provided by the parser remains unused. In this work, we present PL*, the first known language learning algorithm to make use of the prefix query, and a novel modification of the classical L* algorithm. We show both theoretically and empirically that PL* is able to learn more efficiently than L* due to its ability to exploit the additional information given by prefix queries over membership queries. Furthermore, we show how PL* can be used to learn the language of a parser, by adapting it to a more practical setting in which prefix queries are the only source of information available to it; that is, it does not have access to any labelled examples or any other types of queries. We demonstrate empirically that, even in this more constrained setting, PL* is still capable of accurately learning a range of languages of practical interest.
Peephole optimization is an essential class of compiler optimizations that targets small, inefficient instruction sequences within programs. By replacing such suboptimal instructions with refined and more optimal sequences, these optimizations not only directly optimize code size and performance, but also enable more transformations in the subsequent optimization pipeline. Despite their importance, discovering new and effective peephole optimizations remains challenging due to the complexity and breadth of instruction sets. Prior approaches either lack scalability or have significant restrictions on the peephole optimizations that they can find. This paper introduces LPO, a novel automated framework to discover missed peephole optimizations. Our key insight is that, Large Language Models (LLMs) are effective at creative exploration but susceptible to hallucinations; conversely, formal verification techniques provide rigorous guarantees but struggle with creative discovery. By synergistically combining the strengths of LLMs and formal verifiers in a closed-loop feedback mechanism, LPO can effectively discover verified peephole optimizations that were previously missed. We comprehensively evaluated LPO within LLVM ecosystems. Our evaluation shows that LPO can successfully identify up to 22 out of 25 previously reported missed optimizations in LLVM. In contrast, the recently proposed superoptimizers for LLVM, Souper and Minotaur detected 15 and 3 of them, respectively. More importantly, within eleven months of development and intermittent testing, LPO found 62 missed peephole optimizations, of which 28 were confirmed and an additional 13 had already been fixed in LLVM. These results demonstrate LPO's strong potential to continuously uncover new optimizations as LLMs' reasoning improves.
Makala hii inachunguza mkakati wa tafsiri mkopo wa istilahi za Kiingereza katika Kiswahili na athari zake kwa kuangazia uga wa biolojia, fizikia na kemia kupitia vidahizo teule vya Kamusi ya Biolojia, Fizikia na kemia (2012). Azma kuu ya uchunguzi huu ni kutathmini mchango wa tafsiri mkopo wa msamiati wa Kiingereza katika suala zima la kukuza lugha ya Kiswahili. Kwa mujibu wa makala hii eneo hili halijapewa aula miongoni mwa mada mbalimbali za tafsiri zilizokwishachunguzwa. Hii imemuhamasisha mtafiti kulishughulikia eneo hili. Data ya makala hii ilikusanywa uwandani katika idara na taasisi zilizo katika jiji la Dar es Salaam kwa njia ya usaili, hojaji na chanzo cha data cha maktabani. Idara na taasisi zilizohusishwa katika ukusanyaji wa data ni: Idara ya Lugha za Kigeni na Isimu, TATAKI kutoka Chuo Kikuu cha Dar es Salaam pamoja na BAKIZA. Idara na taasisi hizi zinapatikana mkoani Dar es Salaam Wilaya ya Ubungo ambapo mtafiti alifanikiwa kudodosa na kufanya mahojiano ya ana kwa ana kwa wataalamu wa tafsiri. Aidha, data ilikusanywa kutoka katika matini mbalimbali kutoka chanzo cha data cha maktabani. Mathalani, kamusi, tasinifu, majarida na vitabu kadhaa kulingana na mada ya makala hii. Malengo ya makala hii yametimizwa kwa kuongozwa na Nadharia ya Istilahi za Kisayansi iliyoasisiwa na Kiingi (1989) na kuendelezwa na Kiingi (1992) na Kiingi (1998) na Mwaro-Were (2000, 2001). Halikadhalika, data ya makala hii imechambuliwa kwa kutumia mkabala wa maelezo. Aidha, matokeo yameonesha kuwa Kiswahili kimejipatia istilahi lukuki katika tasnia ya biolojia, fizikia pamoja na kemia kupitia mkakati wa tafsiri mkopo. Makala hii ni muhimu kwa wanaisimu kwani inabainisha istilahi mbalimbali za lugha ya Kiingereza katika uwanja wa biolojia, fizikia pamoja na kemia zilizoingizwa katika Kiswahili kupitia tafsiri mkopo. Inapendekezwa kuwa watafiti wa lugha wachunguze njia nyinginezo za ukopaji zinazokiendeleza Kiswahili kama vile utohozi na uasilishaji.
Thai reflexives have two forms: a bare form tuaʔeŋ ‘self’ and a compound form tuakhǎwʔeŋ ‘himself, herself’. They can be bound by either local or long-distance antecedents. This study investigates the extent to which morphological forms of reflexives, locality constraint and pragmatic factors determine the selection of antecedents for Thai reflexives. Results from a binary forced-choice judgment task show that contextual information is a crucial factor guiding the selection. When contextual information is insufficiently available (i.e., unbiased/neutral), locality constraints are likely to be obeyed. The form of reflexives, bare or compound, does not seem to influence the binding option of Thai reflexives, except in the long-distance biased context where the compound form tuakhǎwʔeŋ is preferred. This study complements previous studies on Thai reflexives by addressing the role of contexts on the binding options for various forms of reflexives.
Languages and literature of Eastern Asia, Africa, Oceania
Namkyeong Lee, Siddhartha Laghuvarapu, Chanyoung Park
et al.
Recently, there has been a growing interest among researchers in understanding molecules and their textual descriptions through molecule language models (MoLM). However, despite some early promising developments, the advancement of MoLM still trails significantly behind that of vision language models (VLM). This is because unique challenges exist apart from VLM in the field of MoLM due to 1) a limited amount of molecule-text paired data and 2) missing expertise that occurred due to the specialized areas of focus among the experts. To this end, we propose AMOLE, which 1) augments molecule-text pairs with structural similarity preserving loss, and 2) transfers the expertise between the molecules. Specifically, AMOLE enriches molecule-text pairs by sharing descriptions among structurally similar molecules with a novel structural similarity preserving loss. Moreover, we propose an expertise reconstruction loss to transfer knowledge from molecules that have extensive expertise to those with less expertise. Extensive experiments on various downstream tasks demonstrate the superiority of AMOLE in comprehending molecules and their descriptions, highlighting its potential for application in real-world drug discovery. The source code for AMOLE is available at https://github.com/Namkyeong/AMOLE.
With growing application of machine learning (ML) technologies in healthcare, there have been calls for developing techniques to understand and mitigate biases these systems may exhibit. Fair-ness considerations in the development of ML-based solutions for health have particular implications for Africa, which already faces inequitable power imbalances between the Global North and South.This paper seeks to explore fairness for global health, with Africa as a case study. We conduct a scoping review to propose axes of disparities for fairness consideration in the African context and delineate where they may come into play in different ML-enabled medical modalities. We then conduct qualitative research studies with 672 general population study participants and 28 experts inML, health, and policy focused on Africa to obtain corroborative evidence on the proposed axes of disparities. Our analysis focuses on colonialism as the attribute of interest and examines the interplay between artificial intelligence (AI), health, and colonialism. Among the pre-identified attributes, we found that colonial history, country of origin, and national income level were specific axes of disparities that participants believed would cause an AI system to be biased.However, there was also divergence of opinion between experts and general population participants. Whereas experts generally expressed a shared view about the relevance of colonial history for the development and implementation of AI technologies in Africa, the majority of the general population participants surveyed did not think there was a direct link between AI and colonialism. Based on these findings, we provide practical recommendations for developing fairness-aware ML solutions for health in Africa.
Masudul Hasan Masud Bhuiyan, Berk Çakar, Ethan H. Burmane
et al.
Regular Expression Denial of Service (ReDoS) is a vulnerability class that has become prominent in recent years. Attackers can weaponize such weaknesses as part of asymmetric cyberattacks that exploit the slow worst-case matching time of regular expression (regex) engines. In the past, problematic regexes have led to outages at Cloudflare and Stack Overflow, showing the severity of the problem. While ReDoS has drawn significant research attention, there has been no systematization of knowledge to delineate the state of the art and identify opportunities for further research. In this paper, we describe the existing knowledge on ReDoS. We first provide a systematic literature review, discussing approaches for detecting, preventing, and mitigating ReDoS vulnerabilities. Then, our engineering review surveys the latest regex engines to examine whether and how ReDoS defenses have been realized. Combining our findings, we observe that (1) in the literature, almost no studies evaluate whether and how ReDoS vulnerabilities can be weaponized against real systems, making it difficult to assess their real-world impact; and (2) from an engineering view, many mainstream regex engines have introduced partial or full ReDoS defenses, rendering many threat models obsolete. We conclude by highlighting avenues for future work. The open challenges in ReDoS research are to evaluate emerging defenses and support engineers in migrating to defended engines. We also highlight the parallel between performance bugs and asymmetric DoS, and we argue that future work should capitalize more on this similarity and adopt a more systematic view on ReDoS-like vulnerabilities.
Languages can encode temporal subordination lexically, via subordinating conjunctions, and morphologically, by marking the relation on the predicate. Systematic cross-linguistic variation among the former can be studied using well-established token-based typological approaches to token-aligned parallel corpora. Variation among different morphological means is instead much harder to tackle and therefore more poorly understood, despite being predominant in several language groups. This paper explores variation in the expression of generic temporal subordination ('when'-clauses) among the languages of Latin America and the Caribbean, where morphological marking is particularly common. It presents probabilistic semantic maps computed on the basis of the languages of the region, thus avoiding bias towards the many world's languages that exclusively use lexified connectors, incorporating associations between character $n$-grams and English $when$. The approach allows capturing morphological clause-linkage devices in addition to lexified connectors, paving the way for larger-scale, strategy-agnostic analyses of typological variation in temporal subordination.
In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes. A common method for ZSC is to fine-tune a language model on a Natural Language Inference (NLI) dataset and then use it to infer the entailment between the input document and the target labels. However, this approach faces certain challenges, particularly for languages with limited resources. In this paper, we propose an alternative solution that leverages dictionaries as a source of data for ZSC. We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets based on a dictionary that provides various synonyms, word translations and example sentences. We evaluate the usability of our dataset and compare it with the NLI-based approach on two topic classification tasks in a zero-shot manner. Our results show that by using the dictionary-based dataset, the trained models outperform the ones following the NLI-based approach for ZSC. While we focus on a single low-resource language in this study, we believe that the efficacy of our approach can also transfer to other languages where such a dictionary is available.
Muhammad Zubair Khan, Oleg E. Peil, Apoorva Sharma
et al.
In the rapidly expanding field of two-dimensional materials, magnetic monolayers show great promise for the future applications in nanoelectronics, data storage, and sensing. The research in intrinsically magnetic two-dimensional materials mainly focuses on synthetic iodide and telluride based compounds, which inherently suffer from the lack of ambient stability. So far, naturally occurring layered magnetic materials have been vastly overlooked. These minerals offer a unique opportunity to explore air-stable complex layered systems with high concentration of local moment bearing ions. We demonstrate magnetic ordering in iron-rich two-dimensional phyllosilicates, focusing on mineral species of minnesotaite, annite, and biotite. These are naturally occurring van der Waals magnetic materials which integrate local moment baring ions of iron via magnesium/aluminium substitution in their octahedral sites. Due to self-inherent capping by silicate/aluminate tetrahedral groups, ultra-thin layers are air-stable. Chemical characterization, quantitative elemental analysis, and iron oxidation states were determined via Raman spectroscopy, wavelength disperse X-ray spectroscopy, X-ray absorption spectroscopy, and X-ray photoelectron spectroscopy. Superconducting quantum interference device magnetometry measurements were performed to examine the magnetic ordering. These layered materials exhibit paramagnetic or superparamagnetic characteristics at room temperature. At low temperature ferrimagnetic or antiferromagnetic ordering occurs, with the critical ordering temperature of 38.7 K for minnesotaite, 36.1 K for annite, and 4.9 K for biotite. In-field magnetic force microscopy on iron bearing phyllosilicates confirmed the paramagnetic response at room temperature, present down to monolayers.
Gbatsoron Anjande, Simeon T Asom, Ngutsav Ayila
et al.
This study examines the roles of government spending and money supply on alleviating poverty in Africa. The study used 48 Sub-Saharan Africa countries from 2001 to 2017. The study employed one step and two-step system GMM and found that both the procedures have similar results. Different specifications were employed and the model selected was robust, with valid instruments and absence of autocorrelation at the second order. The study revealed that government spending and foreign direct investment have significant negative influence on reducing poverty while money supply has positive influence on the level of poverty in the region. The implication of the finding is that monetary policy tool of money supply has no strong influence in combating the menace of poverty. The study therefore recommends that emphasis should be placed on increasing more of government spending that would impact on the quality of life of the people in the region through multiplier effect, improving the financial system for effective monetary policy and attracting foreign direct inflows through enabling business environment in Africa.
At current stage, European Union and Japan are strategic partners, developing comprehensive bilateral interaction nearly in all possible aspects of international relations. This research is aimed at the analysis of the humanitarian areas of their partnership. The paper discusses the EU - Japan interaction in healthcare, education, science, culture, and environmental protection, including such aspects as bilateral dialogues, people-to-people exchanges, joint education programs, joint research projects, and other initiatives. The authors conclude that the humanitarian track of the relations has developed considerably over the last few years. The most intense cooperation can be seen in science, which is determined by the economic needs of both partners and their desire to develop strategic industries. Cooperation in healthcare comes second in terms of intensity, which has been caused by the pandemic and other common threats. Education and culture ties demonstrate fewer joint projects. However, the realization of the first region-to-country agreements between the EU and Japan has been started in these areas, too. Finally, environmental protection as an area of bilateral cooperation is mainly represented by the coordination of the parties’ international policies. The research also demonstrates that the EU is the initiator and an active party of the cooperation. Meanwhile, Japan accepts the initiatives of its partner and can use the situation in order to fulfill its interests. At the same time, the cooperation is potentially beneficial for both partners, thanks to their compatible level of economic development and common interests.