Santanu Bhadra, Charu Pundir, Maria Mukherjee
et al.
Ethnopharmacology explores bioactive compounds rooted in traditional medical knowledge systems and holds immense promise for drug discovery, cultural preservation, and healthcare innovation. However, fragmented documentation, minimal digitization, and limited integration with biomedical frameworks remain major barriers. The advent of generative artificial intelligence (GenAI), including large language models (LLMs) and molecular generation algorithms, offers transformative solutions to these challenges. This narrative review critically examines the application of GenAI in ethnopharmacology and highlights its role in digitizing traditional knowledge, decoding polyherbal formulations, predicting herb-drug interactions, and accelerating phytopharmaceutical discovery. It synthesizes current literature on GenAI tools and methods relevant to ethnopharmacology, considering natural language processing, knowledge graph construction, molecular modeling, and multimodal data integration. A five-phase strategic framework is proposed for the ethical and effective implementation of GenAI. This review narrates real-world applications from Asian (Ayurvedic, Chinese, Japanese, Thai, Vietnamese), African, and Indigenous American medicines systems demonstrate adaptability across cultures. Stakeholder-specific benefits, spanning academia, healthcare, industry, and indigenous communities, are also discussed, along with methodological innovations and ethical considerations. GenAI offers a significant transition in ethnopharmacology by integrating traditional knowledge systems with advanced computational tools to develop inclusive data-driven innovation across global traditional medicine systems.
We introduce CSAR, an algorithm for inducing morphemes from emergent language corpora of parallel utterances and meanings. It is a greedy algorithm that (1) weights morphemes based on mutual information between forms and meanings, (2) selects the highest-weighted pair, (3) removes it from the corpus, and (4) repeats the process to induce further morphemes (i.e., Count, Select, Ablate, Repeat). The effectiveness of CSAR is first validated on procedurally generated datasets and compared against baselines for related tasks. Second, we validate CSAR's performance on human language data to show that the algorithm makes reasonable predictions in adjacent domains. Finally, we analyze a handful of emergent languages, quantifying linguistic characteristics like degree of synonymy and polysemy.
GenQuest is a generative text adventure game that leverages Large Language Models (LLMs) to facilitate second language learning through immersive, interactive storytelling. The system engages English as a Foreign Language (EFL) learners in a collaborative "choose-your-own-adventure" style narrative, dynamically generated in response to learner choices. Game mechanics such as branching decision points and story milestones are incorporated to maintain narrative coherence while allowing learner-driven plot development. Key pedagogical features include content generation tailored to each learner's proficiency level, and a vocabulary assistant that provides in-context explanations of learner-queried text strings, ranging from words and phrases to sentences. Findings from a pilot study with university EFL students in China indicate promising vocabulary gains and positive user perceptions. Also discussed are suggestions from participants regarding the narrative length and quality, and the request for multi-modal content such as illustrations.
Japanese Graphic Design works have distinctive characteristics in the use of typography. The different writing systems of Japanese and Latin characters cause differences in the formation of typography. The research was conducted to explore the differences in character and arrangement between typography for phonetic symbols, ideography in script and Latin. The research was conducted by experiment through comparison, taking 5 font samples each from 3 different letter forms. The results of the research describe the differences in character formation, the basic axis of letters.
Document-level Relation Extraction (DocRE) is the task of extracting all semantic relationships from a document. While studies have been conducted on English DocRE, limited attention has been given to DocRE in non-English languages. This work delves into effectively utilizing existing English resources to promote DocRE studies in non-English languages, with Japanese as the representative case. As an initial attempt, we construct a dataset by transferring an English dataset to Japanese. However, models trained on such a dataset suffer from low recalls. We investigate the error cases and attribute the failure to different surface structures and semantics of documents translated from English and those written by native speakers. We thus switch to explore if the transferred dataset can assist human annotation on Japanese documents. In our proposal, annotators edit relation predictions from a model trained on the transferred dataset. Quantitative analysis shows that relation recommendations suggested by the model help reduce approximately 50% of the human edit steps compared with the previous approach. Experiments quantify the performance of existing DocRE models on our collected dataset, portraying the challenges of Japanese and cross-lingual DocRE.
The developments that language models have provided in fulfilling almost all kinds of tasks have attracted the attention of not only researchers but also the society and have enabled them to become products. There are commercially successful language models available. However, users may prefer open-source language models due to cost, data privacy, or regulations. Yet, despite the increasing number of these models, there is no comprehensive comparison of their performance for Turkish. This study aims to fill this gap in the literature. A comparison is made among seven selected language models based on their contextual learning and question-answering abilities. Turkish datasets for contextual learning and question-answering were prepared, and both automatic and human evaluations were conducted. The results show that for question-answering, continuing pretraining before fine-tuning with instructional datasets is more successful in adapting multilingual models to Turkish and that in-context learning performances do not much related to question-answering performances.
The GPT (Generative Pre-trained Transformer) language models are an artificial intelligence and natural language processing technology that enables automatic text generation. There is a growing interest in applying GPT language models to university teaching in various dimensions. From the perspective of innovation in student and teacher activities, they can provide support in understanding and generating content, problem-solving, as well as personalization and test correction, among others. From the dimension of internationalization, the misuse of these models represents a global problem that requires taking a series of common measures in universities from different geographical areas. In several countries, there has been a review of assessment tools to ensure that work is done by students and not by AI. To this end, we have conducted a detailed experiment in a representative subject of Computer Science such as Software Engineering, which has focused on evaluating the use of ChatGPT as an assistant in theory activities, exercises, and laboratory practices, assessing its potential use as a support tool for both students and teachers.
Kosuke Matsuzaki, Masaya Taniguchi, Kentaro Inui
et al.
We introduce a Japanese Morphology dataset, J-UniMorph, developed based on the UniMorph feature schema. This dataset addresses the unique and rich verb forms characteristic of the language's agglutinative nature. J-UniMorph distinguishes itself from the existing Japanese subset of UniMorph, which is automatically extracted from Wiktionary. On average, the Wiktionary Edition features around 12 inflected forms for each word and is primarily dominated by denominal verbs (i.e., [noun] +suru (do-PRS)). Morphologically, this form is equivalent to the verb suru (do). In contrast, J-UniMorph explores a much broader and more frequently used range of verb forms, offering 118 inflected forms for each word on average. It includes honorifics, a range of politeness levels, and other linguistic nuances, emphasizing the distinctive characteristics of the Japanese language. This paper presents detailed statistics and characteristics of J-UniMorph, comparing it with the Wiktionary Edition. We release J-UniMorph and its interactive visualizer publicly available, aiming to support cross-linguistic research and various applications.
With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances. We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them. To the best of our knowledge, this is the first investigation of the use of LLMs for Japanese GER, which involves second-pass language modeling on the output transcriptions generated by the ASR system (e.g., N-best hypotheses). Our experiments demonstrated performance improvement in the proposed methods of ASR quality and generalization both in SPREDS-U1-ja and CSJ data.
We present a speech database and a phoneme-level language model of Polish. The database and model are designed for the analysis of prosodic and discourse factors and their impact on acoustic parameters in interaction with predictability effects. The database is also the first large, publicly available Polish speech corpus of excellent acoustic quality that can be used for phonetic analysis and training of multi-speaker speech technology systems. The speech in the database is processed in a pipeline that achieves a 90% degree of automation. It incorporates state-of-the-art, freely available tools enabling database expansion or adaptation to additional languages.
Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
Large Language Models (LLMs) might hallucinate facts, while curated Knowledge Graph (KGs) are typically factually reliable especially with domain-specific knowledge. Measuring the alignment between KGs and LLMs can effectively probe the factualness and identify the knowledge blind spots of LLMs. However, verifying the LLMs over extensive KGs can be expensive. In this paper, we present KGLens, a Thompson-sampling-inspired framework aimed at effectively and efficiently measuring the alignment between KGs and LLMs. KGLens features a graph-guided question generator for converting KGs into natural language, along with a carefully designed importance sampling strategy based on parameterized KG structure to expedite KG traversal. Our simulation experiment compares the brute force method with KGLens under six different sampling methods, demonstrating that our approach achieves superior probing efficiency. Leveraging KGLens, we conducted in-depth analyses of the factual accuracy of ten LLMs across three large domain-specific KGs from Wikidata, composing over 19K edges, 700 relations, and 21K entities. Human evaluation results indicate that KGLens can assess LLMs with a level of accuracy nearly equivalent to that of human annotators, achieving 95.7% of the accuracy rate.
This research objective was to determine the effect of using Educandy learning media on Japanese students’ learning outcomes in a state vocational high school in Jakarta for academic year 2020/2021. This study applied experimental quantitative research method, with a purposive sample technique of 16 students. This research used one group Pre Test–Post Test design, and the data collected then analysed using descriptive and inferential statistics. In the preparation of the instrument, several tests consist of validity tests and reliability tests were conducted. The results showed that based on the Shapiro-Wilk Normality Test, the data was normally distributed due to the Sig value, and the data from Pre-test and Post-test showed that the students’ learning outcomes were increased after using Educandy. Moreover, the Levene Homogeneity Test results showed that the data is more than the significance level or the data was homogeneous, while based on the normality test and homogeneity test, the results showed that the value in the Pre-test and Post-test learning outcomes were lower than the level of significance. Based on the analysis, it can be concluded that H0 is rejected and H1 is accepted, which means that there is a significant difference in Pre-test and Post-test student learning outcomes so the Educandy learning media influences student learning outcomes in Japanese subjects.
Language is used to communicate with fellow human beings on this earth, with language we can understand what other people want to convey to us either in spoken, written, formal or informal language. With the language we perform an action to convey the language to the speech partner or often referred to as our interlocutor in order to capture the meaning of what we convey or we say. In language there are speech acts, speech acts are part of pragmatics. A speech act is an utterance of a sentence to state that a purpose of the speaker is known by the interlocutor. In the speech act itself there are three main parts, called locution, illocution and perlocution. Each of the three main pillars has its own function in performing a speech so that a speech can be easily understood by the interlocutor. This study aims to describe the forms of speech acts and complaining strategies used in a Japanese animation This study uses qualitative descriptive method by examining the utterances that occur in the data. The Data used by the researchers were obtained from the anime entitled Tanaka kun wa Itsumo Kedaruge where in the anime there are many complaints that become the source of research. Data analysis is done by classifying the form of speech acts and complaining strategies used using existing theories. The results of the study found that the form of speech acts was 51 data with the form of direct speech acts as many as 22 data, the form of indirect speech acts as many as 29 data. From the results of the discussion, Complaint data found 51 data with implicit complaints with 17 data cues sub- category (hint), expressions of annoyance/disapproval with 31 data sub-categories annoyance, 2 data of painful consequences sub-category (ill consequences), accusations with category of direct accusation (direct accusation) as much as 1 data, explicitly blame (attitude) explicit blame (behavior) as much as 2 data.
This article focuses on the concept of space and two different structures of space in the mythological chronicles Kojiki and Nihon Shoki. It considers two main invisible divine gods of space, probably connected to Chinese mythology and appearing in the mythological chronicles first – Ame-no-minaka-nushi and Kuni-no-tokotachi. It traces their evolution in history and also deals with obtaining by some of Japanese gods, within the Buddhist worldview, a fantastic appearance and the key role in cosmogenesis. It also deals with the connection of the first verbal descriptions of the appearance of the Japanese lands as a living creature or a symbolic thing seen from above with the ritual of “viewing the realm” (kunimi), and also with a technique of Chinese-Japanese painting wherein an object is portrayed as seen from above (for the purpose of which, in the cases of indoor scenes, a building is depicted without a roof), etc.