OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data
Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng
Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.
Determination of new national highpoints of five African and Asian countries, Saudi Arabia, Uzbekistan, Gambia, Guinea-Bissau, and Togo
Eric Gilbertson, Matthew Gilbertson
Not all nations on earth have previously been surveyed accurately enough to know for certain which peak is the national highpoint, the highest peak in the country. Knowledge of these peaks is important for understanding the physical geography of these countries in terms of natural resource availability, watershed management, and tourism potential. For this study, ground surveys were conducted between 2018-2025 with modern professional surveying equipment, including differential GPS units and Abney levels, to accurately determine the national highpoints in five African and Asian countries where uncertainty existed. New national highpoints were determined for Saudi Arabia (Jabal Ferwa), Uzbekistan (Alpomish), Gambia (Sare Firasu Hill), Guinea-Bissau (Mt Ronde), and Togo (Mt Atilakoutse). Elevations were measured with sub-meter vertical accuracy for candidate peaks in Saudi Arabia, Gambia, Guinea-Bissau, and Togo. Relative elevations were measured between contender peaks in Uzbekistan with sufficient accuracy to determine the highpoint.
Cultures of Expertise in the Eastern Mediterranean
Barbara Henning, Taisiya Leber, Ani Sargsyan
Indo-Iranian languages and literature, Literature (General)
The Jade Gateway to Exergaming: How Socio-Cultural Factors Shape Exergaming Among East Asian Older Adults
Reza Hadi Mogavi, Juhyung Son, Simin Yang
et al.
Exergaming, blending exercise and gaming, improves the physical and mental health of older adults. We currently do not fully know the factors that drive older adults to either engage in or abstain from exergaming. Large-scale studies investigating this are still scarce, particularly those studying East Asian older adults. To address this, we interviewed 64 older adults from China, Japan, and South Korea about their attitudes toward exergames. Most participants viewed exergames with a positive inquisitiveness. However, socio-cultural factors can obstruct this curiosity. Our study shows that perceptions of aging, lifestyle, the presence of support networks, and the cultural relevance of game mechanics are the crucial factors influencing their exergame engagement. Thus, we stress the value of socio-cultural sensitivity in game design and urge the HCI community to adopt more diverse design practices. We provide several design suggestions for creating more culturally approachable exergames.
Understanding Clinical Decision-Making in Traditional East Asian Medicine through Dimensionality Reduction: An Empirical Investigation
Hyojin Bae, Bongsu Kang, Chang-Eop Kim
This study examines the clinical decision-making processes in Traditional East Asian Medicine (TEAM) by reinterpreting pattern identification (PI) through the lens of dimensionality reduction. Focusing on the Eight Principle Pattern Identification (EPPI) system and utilizing empirical data from the Shang-Han-Lun, we explore the necessity and significance of prioritizing the Exterior-Interior pattern in diagnosis and treatment selection. We test three hypotheses: whether the Ext-Int pattern contains the most information about patient symptoms, represents the most abstract and generalizable symptom information, and facilitates the selection of appropriate herbal prescriptions. Employing quantitative measures such as the abstraction index, cross-conditional generalization performance, and decision tree regression, our results demonstrate that the Exterior-Interior pattern represents the most abstract and generalizable symptom information, contributing to the efficient mapping between symptom and herbal prescription spaces. This research provides an objective framework for understanding the cognitive processes underlying TEAM, bridging traditional medical practices with modern computational approaches. The findings offer insights into the development of AI-driven diagnostic tools in TEAM and conventional medicine, with the potential to advance clinical practice, education, and research.
Esports Debut as a Medal Event at 2023 Asian Games: Exploring Public Perceptions with BERTopic and GPT-4 Topic Fine-Tuning
Tyreal Yizhou Qian, Bo Yu, Weizhe Li
et al.
This study examined the public opinions of esports at the 2023 Asian Games and value co-creation during the event using an LLM-enhanced BERTopic modeling analysis. We identified five major themes representing public perceptions, as well as how major stakeholders co-created value within and beyond the esports ecosystem. Key findings highlighted the strategic use of social media marketing to influence public opinion and promote esports events and brands, emphasizing the importance of event logistics and infrastructure. Additionally, the study revealed the co-creation value contributed by stakeholders outside the traditional esports ecosystem, particularly in promoting national representation and performance. Our findings supported the ongoing efforts to legitimize esports as a sport, noting that mainstream recognition remains a challenge. The inclusion of esports as a medal event showcased broader acceptance and helped mitigate negative public perceptions. Moreover, contributions from non-traditional stakeholders underscored the value of cross-subcultural collaborations in esports.
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang, Hou Pong Chan, Yiran Zhao
et al.
Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.
Persian in the Lands of Rum: Texts, Translations and Courtly Patronage
A.C.S. Peacock
Persian texts composed in Anatolia both in the pre-Ottoman and Ottoman periods have received scant scholarly attention, and Persian remains perceived as an alien language. This article presents an overview of Persian in Anatolia and the Ottoman empire from the twelfth to nineteenth centuries. It argues that in the medieval period Persian texts were more widely disseminated than is often assumed, and goes on to analyse the changing fate of Persian over the fifteenth and sixteenth centuries, and the reasons for its eclipse by Turkish. Finally, it presents manuscript evidence to suggest that even in the later Ottoman period, Persian was far from being restricted to a source of literary models.
Indo-Iranian languages and literature, Literature (General)
Another Day at the Airport
Tenzing Wangdak
Traveling is a complicated endeavor. For some, it is as simple as picking up one’s passport and heading to the nearest airport. For others, it is a harrowing experience, one marked by uncertainty. For tourists, traveling represents a temporary detachment from the mundanes of everyday life. For many migrants, it is an incredibly significant decision that can define their very futures. Drawing from a personal encounter at the airport, this flash ethnographic piece hopes to be able to illustrate the dichotomy of traveling i.e. the ease, and pleasure for some, and the uncertainty of refusal for others.
Asian. Oriental, History of Asia
Quantifying gender imbalance in East Asian academia: Research career and citation practice
Kazuki Nakajima, Ruodan Liu, Kazuyuki Shudo
et al.
Gender imbalance in academia has been confirmed in terms of a variety of indicators, and its magnitude often varies from country to country. Europe and North America, which cover a large fraction of research workforce in the world, have been the main geographical regions for research on gender imbalance in academia. However, the academia in East Asia, which accounts for a substantial fraction of research, may be exposed to strong gender imbalance because Asia has been facing persistent and stronger gender imbalance in society at large than Europe and North America. Here we use publication data between 1950 and 2020 to analyze gender imbalance in academia in China, Japan, and South Korea in terms of the number of researchers, their career, and citation practice. We found that, compared to the average of the other countries, gender imbalance is larger in these three East Asian countries in terms of the number of researchers and their citation practice and additionally in Japan in terms of research career. Moreover, we found that Japan has been exposed to the larger gender imbalance than China and South Korea in terms of research career and citation practice.
Monitoring the Size and Flux Density of Sgr A* during the Active State in 2019 with East Asian VLBI Network
Xiaopeng Cheng, Ilje Cho, Tomohisa Kawashima
et al.
In this work, we studied the Galactic Center supermassive black hole (SMBH), Sagittarius A* (Sgr A*), with the KVN and VERA Array (KaVA)/East Asian VLBI Network (EAVN) monitoring observations. Especially on 13 May 2019, Sgr A* experienced an unprecedented bright near infra-red (NIR) flare; so, we find a possible counterpart at 43 GHz (7 mm). As a result, a large temporal variation of the flux density at the level 15.4%, with the highest flux density of 2.04 Jy, is found on 11 May 2019. Interestingly, the intrinsic sizes are also variable, and the area and major-axis size show a marginal correlation with flux density with >2σ. Thus, we interpret that the emission region at 43 GHz follows the larger-when-brighter relation in 2019. The possible origins are discussed with an emergence of a weak jet/outflow component and the position angle change of the rotation axis of the accretion disk in time.
A Semi-automatic Oriental Ink Painting Framework for Robotic Drawing from 3D Models
Hao Jin, Minghui Lian, Shicheng Qiu
et al.
Creating visually pleasing stylized ink paintings from 3D models is a challenge in robotic manipulation. We propose a semi-automatic framework that can extract expressive strokes from 3D models and draw them in oriental ink painting styles by using a robotic arm. The framework consists of a simulation stage and a robotic drawing stage. In the simulation stage, geometrical contours were automatically extracted from a certain viewpoint and a neural network was employed to create simplified contours. Then, expressive digital strokes were generated after interactive editing according to user's aesthetic understanding. In the robotic drawing stage, an optimization method was presented for drawing smooth and physically consistent strokes to the digital strokes, and two oriental ink painting styles termed as Noutan (shade) and Kasure (scratchiness) were applied to the strokes by robotic control of a brush's translation, dipping and scraping. Unlike existing methods that concentrate on generating paintings from 2D images, our framework has the advantage of rendering stylized ink paintings from 3D models by using a consumer-grade robotic arm. We evaluate the proposed framework by taking 3 standard models and a user-defined model as examples. The results show that our framework is able to draw visually pleasing oriental ink paintings with expressive strokes.
BHASA: A Holistic Southeast Asian Linguistic and Cultural Evaluation Suite for Large Language Models
Wei Qi Leong, Jian Gang Ngui, Yosephine Susanto
et al.
The rapid development of Large Language Models (LLMs) and the emergence of novel abilities with scale have necessitated the construction of holistic, diverse and challenging benchmarks such as HELM and BIG-bench. However, at the moment, most of these benchmarks focus only on performance in English and evaluations that include Southeast Asian (SEA) languages are few in number. We therefore propose BHASA, a holistic linguistic and cultural evaluation suite for LLMs in SEA languages. It comprises three components: (1) a NLP benchmark covering eight tasks across Natural Language Understanding (NLU), Generation (NLG) and Reasoning (NLR) tasks, (2) LINDSEA, a linguistic diagnostic toolkit that spans the gamut of linguistic phenomena including syntax, semantics and pragmatics, and (3) a cultural diagnostics dataset that probes for both cultural representation and sensitivity. For this preliminary effort, we implement the NLP benchmark only for Indonesian, Vietnamese, Thai and Tamil, and we only include Indonesian and Tamil for LINDSEA and the cultural diagnostics dataset. As GPT-4 is purportedly one of the best-performing multilingual LLMs at the moment, we use it as a yardstick to gauge the capabilities of LLMs in the context of SEA languages. Our initial experiments on GPT-4 with BHASA find it lacking in various aspects of linguistic capabilities, cultural representation and sensitivity in the targeted SEA languages. BHASA is a work in progress and will continue to be improved and expanded in the future. The repository for this paper can be found at: https://github.com/aisingapore/BHASA
"What counts as Gaddi?"
Kriti Kapila
The afterword reflects on the generations of anthropological scholarship on the Gaddis of the Western Himalaya, and its import for anthropological itself. It takes up Roy Wagner's productive concept of "strategic relic" to understand the changing place of pastoralism and the enduring ideal of egalitarianism in the Gaddi worldview.
Asian. Oriental, History of Asia
EACELEB: An East Asian Language Speaking Celebrity Dataset for Speaker Recognition
Desmond Caulley, Yufeng Yang, David Anderson
Large datasets are very useful for training speaker recognition systems, and various research groups have constructed several over the years. Voxceleb is a large dataset for speaker recognition that is extracted from Youtube videos. This paper presents an audio-visual method for acquiring audio data from Youtube given the speaker's name as input. The system follows a pipeline similar to that of the Voxceleb data acquisition method. However, our work focuses on fast data acquisition by using face-tracking in subsequent frames once a face has been detected -- this is preferable over face detection for every frame considering its computational cost. We show that applying audio diarization to our data after acquiring it can yield equal error rates comparable to Voxceleb. A secondary set of experiments showed that we could further decrease the error rate by fine-tuning a pre-trained x-vector system with the acquired data. Like Voxceleb, the work here focuses primarily on developing audio for celebrities. However, unlike Voxceleb, our target audio data is from celebrities in East Asian countries. Finally, we set up a speaker verification task to evaluate the accuracy of our acquired data. After diarization and fine-tuning, we achieved an equal error rate of approximately 4\% across our entire dataset.
EmoMent: An Emotion Annotated Mental Health Corpus from two South Asian Countries
Thushari Atapattu, Mahen Herath, Charitha Elvitigala
et al.
People often utilise online media (e.g., Facebook, Reddit) as a platform to express their psychological distress and seek support. State-of-the-art NLP techniques demonstrate strong potential to automatically detect mental health issues from text. Research suggests that mental health issues are reflected in emotions (e.g., sadness) indicated in a person's choice of language. Therefore, we developed a novel emotion-annotated mental health corpus (EmoMent), consisting of 2802 Facebook posts (14845 sentences) extracted from two South Asian countries - Sri Lanka and India. Three clinical psychology postgraduates were involved in annotating these posts into eight categories, including 'mental illness' (e.g., depression) and emotions (e.g., 'sadness', 'anger'). EmoMent corpus achieved 'very good' inter-annotator agreement of 98.3% (i.e. % with two or more agreement) and Fleiss' Kappa of 0.82. Our RoBERTa based models achieved an F1 score of 0.76 and a macro-averaged F1 score of 0.77 for the first task (i.e. predicting a mental health condition from a post) and the second task (i.e. extent of association of relevant posts with the categories defined in our taxonomy), respectively.
Modelling the effect of aerosol and greenhouse gas forcing on the South and East Asian monsoons with an intermediate complexity climate model
Lucy Recchia, Valerio Lucarini
The South and East Asian summer monsoons are globally significant meteorological features, creating a strongly seasonal pattern of precipitation. The stability of the monsoon is of extreme importance for a vast range of ecosystems and for the livelihoods of a large share of the world's population. Simulations are performed with an intermediate complexity climate model, PLASIM, to assess the future response of the monsoons to changing concentrations of aerosols and greenhouse gases. The aerosol loading consists of a mid-tropospheric warming and a surface cooling, which is applied to India, Southeast Asia and East China, both concurrently and independently. The primary effect of increased aerosol loading is a decrease in summer precipitation in the vicinity of the applied forcing, although the regional response varies significantly. The decrease in precipitation is only partially ascribable to a decrease in the precipitable water, and instead derives from a reduction of the precipitation efficiency, due to changes in the stratification of the atmosphere. When the aerosol loading is added in all regions simultaneously, precipitation in East China is most strongly affected, with a quite distinct transition to a low precipitation regime as the radiative forcing increases beyond 60 W/m^2. The response is less abrupt as we move westward, with precipitation in South India being least affected. This lower sensitivity in South India is attributed to aerosol forcing over East China. Additionally, the effect on precipitation is approximately linear with the forcing. The impact of doubling carbon dioxide levels is to increase precipitation over the regions and weaken the circulation. When the carbon dioxide and aerosol forcings are applied at the same time, the carbon dioxide forcing partially offsets the surface cooling and reduction in precipitation associated with the aerosol response.
en
physics.ao-ph, physics.comp-ph
Communication, Leadership, and Community-based Tourism Empowerment in Brunei Darussalam
Wei Lee Chin, Najib Noorashid
Effective communication plays a part in the success and the sustainability of tourism and hospitality management, including community-based tourism (CBT). In Brunei Darussalam, communication barriers have affected the growth of the CBT industry at both local and national levels. By analysing responses from 16 local CBT operators, this investigation focuses on aspects of communication and its channels in securing sustainability and empowerment of the CBT industry. This study found that the CBT ventures in Brunei include: (1) horizontal/lateral communication between CBT operators and sur-rounding communities; and (2) top-down communication and bottom-up accessibility between authorities, community leaders, and CBT owners. This paper further discusses how bureaucracy can impede the success of CBT operations, and how a collaborative approach between stakeholders has inspired the formulation of a new Interactional Model of Leadership and Empowerment among CBT stakeholders, which can be used to measure the efficacy of communication among stakeholders in the CBT industry.
Asian. Oriental, History of Asia
Caught on camera: Field imagery reveals the unexpected importance of vertebrates for biological control of the banana weevil (Cosmopolites sordidus Col. Curculionidae).
Paul Tresson, Philippe Tixier, William Puech
et al.
Understanding of ecological interactions is necessary for the application of biological control. Banana is the second most produced fruit worldwide and the banana weevil (Cosmopolites sordidus) is the most important pest of banana and plantain. Its biological control remains challenging because of the robustness and cryptic behaviour of the adult and the hidden development of larval stages. Researchers therefore tend to favour conservation biological control of this pest. The commonly used methods for measuring the effects of natural enemies on the regulation of this pest focus on invertebrates and may underestimate the role of vertebrates on biological control. Using cameras, we recorded the predation of sentinel adult weevils in banana plots in La Réunion island that differed in weevil infestation levels and in animal biodiversity. To facilitate image analysis, we used background subtraction to isolate moving parts of image sequences and thus detect predators and predation events. Our cameras recorded only vertebrates as predators of adult banana weevils. The most important predator appeared to be the Asian shrew (Suncus murinus), which was responsible for 67% of the predation events. Other predators included the house mouse (Mus musculus), the oriental garden lizard (Calotes versicolor), and the guttural toad (Sclerophrys gutturalis). The exact time of predation events were determined from the images metadata. It was thus possible to identify predator foraging periods that coincided with activity of adult weevils. Our results confirm that images provide useful information for biological and ecological studies. Along with other recent studies, our results suggest that the role of vertebrates in biological control may be underestimated. Based on these results, we advocate for several management implications such as the installation of hedges, grasslands, and ponds to favour these vertebrate predators of the banana weevil, possibly also favouring other vertebrate and invertebrate natural enemies.
Energetics of Arab Countries in the Light of the Green Economy Challenges
V. A. Isaev, A. O. Filonik
Renewable/alternative energy sources (RES), which are often referred to as “green energy” and are intended to create a new generation base, are now the object of close attention in almost all countries of the world. The growing effect of the use of these sources is noted in Europe, the USA, China, India and in a few other countries, but with varying degrees of success. The Arab region does not remain aloof from this direction, although, for obvious reasons, it is more committed to hydrocarbons and conventional energy than others. But persistent conservatism in relation to fossil fuels is inherent not only in Arab society. Even countries that do not have vast oil and gas fields find it difficult to part with traditional energy sources. And yet, the Arab world receives a very serious competitor in the form of green energy, which in the future will have to win back its niche in electricity generation in the Arab part of the international economic space in a rather tough confrontation. Against this background, the alleged rivalry in the Arab region still looks more nominative than real and does not cause much concern in the Arab East, which is now experiencing much more complex political and socio-economic problems.