Hasil untuk "English language"

Menampilkan 20 dari ~4198901 hasil · dari arXiv, DOAJ, Semantic Scholar

JSON API
S2 Open Access 2019
How many words do we read per minute? A review and meta-analysis of reading rate

M. Brysbaert

Based on the analysis of 190 studies (18,573 participants), we estimate that the average silent reading rate for adults in English is 238 words per minute (wpm) for non-fiction and 260 wpm for fiction. The difference can be predicted by taking into account the length of the words, with longer words in non-fiction than in fiction. The estimates are lower than the numbers often cited in scientific and popular writings. The reasons for the overestimates are reviewed. The average oral reading rate (based on 77 studies and 5,965 participants) is 183 wpm. Reading rates are lower for children, old adults, and readers with English as second language. The reading rates are in line with maximum listening speed and do not require the assumption of reading-specific language processing. Within each group/task there are reliable individual differences, which are not yet fully understood. For silent reading of English non-fiction most adults fall in the range of 175 to 300 wpm; for fiction the range is 200 to 320 wpm. Reading rates in other languages can be predicted reasonably well by taking into account the number of words these languages require to convey the same message as in English.

429 sitasi en Psychology
S2 Open Access 2013
Bilingual Word Embeddings for Phrase-Based Machine Translation

Will Y. Zou, R. Socher, Daniel Matthew Cer et al.

We introduce bilingual word embeddings: semantic embeddings associated across two languages in the context of neural language models. We propose a method to learn bilingual embeddings from a large unlabeled corpus, while utilizing MT word alignments to constrain translational equivalence. The new embeddings significantly out-perform baselines in word semantic similarity. A single semantic similarity feature induced with bilingual embeddings adds near half a BLEU point to the results of NIST08 Chinese-English machine translation task.

582 sitasi en Computer Science
S2 Open Access 2021
Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj et al.

We present Samanantar, the largest publicly available parallel corpora collection for Indic languages. The collection contains a total of 49.7 million sentence pairs between English and 11 Indic languages (from two language families). Specifically, we compile 12.4 million sentence pairs from existing, publicly available parallel corpora, and additionally mine 37.4 million sentence pairs from the Web, resulting in a 4× increase. We mine the parallel sentences from the Web by combining many corpora, tools, and methods: (a) Web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences. Human evaluation of samples from the newly mined corpora validate the high quality of the parallel sentences across 11 languages. Further, we extract 83.4 million sentence pairs between all 55 Indic language pairs from the English-centric parallel corpus using English as the pivot language. We trained multilingual NMT models spanning all these languages on Samanantar which outperform existing models and baselines on publicly available benchmarks, such as FLORES, establishing the utility of Samanantar. Our data and models are available publicly at Samanantar and we hope they will help advance research in NMT and multilingual NLP for Indic languages.

290 sitasi en Computer Science
arXiv Open Access 2025
AutoSign: Direct Pose-to-Text Translation for Continuous Sign Language Recognition

Samuel Ebimobowei Johnny, Blessed Guda, Andrew Blayama Stephen et al.

Continuously recognizing sign gestures and converting them to glosses plays a key role in bridging the gap between the hearing and hearing-impaired communities. This involves recognizing and interpreting the hands, face, and body gestures of the signer, which pose a challenge as it involves a combination of all these features. Continuous Sign Language Recognition (CSLR) methods rely on multi-stage pipelines that first extract visual features, then align variable-length sequences with target glosses using CTC or HMM-based approaches. However, these alignment-based methods suffer from error propagation across stages, overfitting, and struggle with vocabulary scalability due to the intermediate gloss representation bottleneck. To address these limitations, we propose AutoSign, an autoregressive decoder-only transformer that directly translates pose sequences to natural language text, bypassing traditional alignment mechanisms entirely. The use of this decoder-only approach allows the model to directly map between the features and the glosses without the need for CTC loss while also directly learning the textual dependencies in the glosses. Our approach incorporates a temporal compression module using 1D CNNs to efficiently process pose sequences, followed by AraGPT2, a pre-trained Arabic decoder, to generate text (glosses). Through comprehensive ablation studies, we demonstrate that hand and body gestures provide the most discriminative features for signer-independent CSLR. By eliminating the multi-stage pipeline, AutoSign achieves substantial improvements on the Isharah-1000 dataset, achieving an improvement of up to 6.1\% in WER score compared to the best existing method.

en cs.CV, cs.AI
arXiv Open Access 2025
Adding Alignment Control to Language Models

Wenhong Zhu, Weinan Zhang, Rui Wang

Post-training alignment has increasingly become a crucial factor in enhancing the usability of language models (LMs). However, the strength of alignment varies depending on individual preferences. This paper proposes a method to incorporate alignment control into a single model, referred to as CLM. This approach adds one identity layer preceding the initial layers and performs preference learning only on this layer to map unaligned input token embeddings into the aligned space. Experimental results demonstrate that this efficient fine-tuning method performs comparable to full fine-tuning. During inference, the input embeddings are processed through the aligned and unaligned layers, which are then merged through the interpolation coefficient. By controlling this parameter, the alignment exhibits a clear interpolation and extrapolation phenomenon.

en cs.CL
arXiv Open Access 2025
Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

Lujun Li, Yewei Song, Lama Sleem et al.

Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.

en cs.CL
arXiv Open Access 2025
Safe at the Margins: A General Approach to Safety Alignment in Low-Resource English Languages -- A Singlish Case Study

Isaac Lim, Shaun Khoo, Roy Ka-Wei Lee et al.

Ensuring the safety of Large Language Models (LLMs) in diverse linguistic settings remains challenging, particularly for low-resource languages. Existing safety alignment methods are English-centric, limiting their effectiveness. We systematically compare Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO) for aligning SEA-Lion-v2.1-Instruct, a Llama 3-8B variant, to reduce toxicity in Singlish. Our results show that SFT+KTO achieves superior safety alignment with higher sample efficiency than DPO. Additionally, we introduce KTO-S, which enhances stability via improved KL divergence regularization. Our approach reduces Singlish toxicity by 99\%, generalizes to TOXIGEN, and maintains strong performance on standard LLM benchmarks, providing a scalable framework for safer AI deployment in multilingual contexts.

en cs.CL, cs.AI
DOAJ Open Access 2025
Manifestations of Hybridization in the Novel Celestial Bodies by the Omani Novelist Jokha Alharthi

Samaneh Dehghani, Rouhollah Saiiadi nezhad

Sociolinguistics, by studying the reciprocal influence between language and society, seeks to analyze the role of language in human interactions. The novel, as a linguistic discourse encompassing a wide range of literary styles, serves as a window through which the reader can perceive social realities from a linguistic perspective. Hybridization is defined as the blending of language with various literary styles within a single literary work, allowing for an analysis that integrates diverse voices and perspectives. It also enables the study of a broad spectrum of individual and social voices. Celestial Bodies by Jokha Alharthi vividly embodies different forms of hybridization. This study aims to examine and analyze the manifestations of hybridization in this novel using a descriptive-analytical method. The findings indicate that hybridization appears in two main dimensions: unintentional and intentional. In the unintentional dimension, the author unconsciously incorporates borrowed vocabulary from Classical Arabic. In the intentional hybridization, Alharthi elevates the integration of her narrative language with various literary styles by employing intertextual references to the Quran and Hadith, incorporating proverbs and Omani colloquial dialect, and utilizing typographical hybridization. The author employs elements of hybridization to reflect class disparities in society, reinforce religious guidance and moral lessons, and affirm the cultural foundations of her community. This contributes to engaging the reader and enriching the novel, significantly influencing the representation of different societal voices and enabling the reader to grasp the contradictions embedded within its structure.   Keywords: Sociolinguistics, Hybridization, Celestial Bodies, Omani Novel.   Extended summary   Introduction The language constitutes the fundamental material of the novel. Language serves as a primary and prominent tool in embodying narrative and novelistic concepts, as it enables the writer to convey descriptions, emotions, concerns, thoughts, and critiques to the reader. Moreover, the novel functions as an effective medium for reflecting linguistic and literary issues and is capable of portraying social realities through language. Sociolinguistics examines the influences of society on language and vice versa, aiming to analyze the social and cultural roles of language in human interactions, individual and collective identity, social and political changes, and manifestations of power. Studying the impact of society on language, or vice versa, allows for a deeper understanding of characters, narrative realities, and the writer’s intent. It also reveals various aspects of the novel, such as social transformations and cultural exchanges. The study of novels that interlink language with social, cultural, and psychological issues holds significant importance. Research that delves deeply into linguistic interactions gains further value, as it allows for the exploration of a broad spectrum of individual and social voices. The study of the novel Celestial Bodies is of great significance, as it incorporates various forms of linguistic and literary blending. This analysis can provide insights into the author’s motivations and her purpose in utilizing these forms to reflect diverse social and cultural concepts. Examining this novel from a sociolinguistic critical perspective and investigating the relationship between language and different literary forms is crucial and highly valuable due to the importance of language and society as well as the role of language in reflecting social realities and individual and collective voices. Materials & Methods This study aims to apply the sociolinguistic approach to analyze the process of blending and hybridization between different linguistic and literary forms in Celestial Bodies by Jokha Alharthi. The goal is to highlight social and cultural diversity, reflect conflicts and differences within society, and convey valuable social messages to the audience. This research examines Celestial Bodies by Omani author Jokha Alharthi from the perspective of sociolinguistics and hybridization, employing a descriptive-analytical method. The study seeks to identify, present, and scientifically analyze the hybrid elements present in this novel. This research examines the novel Celestial Bodies by the Omani author Jokha Alharthi from the perspective of sociolinguistics and hybridization, using a descriptive-analytical approach. The study aims to identify, present, and scientifically analyze the elements of hybridization present in this novel. Research findings The author of Celestial Bodies, Jokha Alharthi, blends her narrative language with diverse literary techniques. This usage manifests in two dimensions: intentional and unintentional. In the unintentional dimension, the author utilizes borrowed words that have entered the Arabic language unconsciously. She also innovates her narrative language by unconsciously incorporating words from English or other languages. The use of these words enables her to convey her concepts and themes to the reader, enriching her narrative style. In the intentional dimension, the author employs various literary techniques to communicate the ideas she wants to express, such as Quranic and Hadith intertextuality, hybridization (in its various forms, including linguistic nesting), and the use of colloquial language, proverbs, and poetry. The author's deliberate use of hybridization elements, particularly her use of colloquial language, is evident. The use of multiple hybridization techniques in Alharthi's work and their integration with the narrative language demonstrates the author's intentionality and awareness. Through her more deliberate and conscious use of intentional hybridization, as opposed to unintentional hybridization, Alharthi sought to convey specific concepts and ideologies to her audience. Discussion & Conclusion Alharthi created a variety of main and secondary characters in her novel within the framework of linguistic division to showcase class differences through language, blending it with the languages of all social classes. In fact, hybridization in this novel appears in the relationships between the characters and their life experiences, with the author reflecting her use of different forms of literature and the language of the novel through the characters' interactions with the challenges they face. The multiple voices in the novel contribute to building a comprehensive picture of the social reality, revealing tensions and internal conflicts within the community. In fact, the author used various forms of literature within her language in this novel, successfully maintaining the narrative identity while presenting different aspects of society, such as social transformations, class distinctions, cultural exchanges, and strengthening the cultural foundations of her society. This element addresses deep social issues like poverty and exploitation, aiming to provoke the reader into critically contemplating these phenomena. It is clear that the author used colloquial Arabic to depict the village space in her novel, the linguistic division to present class differences within society, proverbs to deliver moral advice, and religious intertextuality with the Quran and Hadith to enhance religious teachings, combat ignorance and superstition, and strengthen the cultural foundations of her society. Overall, Alharthi connected various linguistic and literary tools to the language of her novel, enriching her work through the use of hybridization, such as proverbs, linguistic division, and colloquial language. This not only increased its appeal and diversity but also elevated its maturity and depiction. The author does not just narrate events but integrates literary and linguistic tools to create a text that reflects a complex social reality, highlighting contradictions, injustice, and discrimination within society, allowing the reader to reflect upon and engage with these phenomena.

Oriental languages and literatures
DOAJ Open Access 2025
Code-mixing between Arabic and English among Jordanians on social media

Asma Mohammad Hussein Aburqayiq, Abdel Rahman Mitib Altakhaineh, Anas Hashem Alsariera

This study aims to investigate the types and motivations of code-mixing between Arabic and English in Jordanian social media conversations. Employing both quantitative and qualitative methods, the research examines fifteen recorded videos from various social media platforms like Facebook, Instagram, and YouTube. Instances of code-mixing are categorized using Muysken’s classification, which includes insertional, congruent lexicalization, and alternational types. The findings indicate that insertional code-mixing is the most prevalent, accounting for 80% of instances. This is followed by congruent lexicalization at 18.8% and alternational code-mixing at 1.2%. The study reveals that Jordanians frequently mix English into Arabic conversations, influenced by several factors such as social status, prestige, globalization, rapid advances in technology and artificial intelligence (AI), and education level. The study recommends incorporating code-mixing awareness into educational programs and encourages further research to explore the long-term effects of code-mixing on language development.

Social Sciences
DOAJ Open Access 2025
Integrating Moral Values into EFL learning through a Multimodal Project: A Case Study

Dewi Puspitasari, Sri Wuli Fitriati, Widhiyanto et al.

Background/purpose. The integration of moral values into EFL teaching has been an important issue in Indonesia as multimodal literacy practices can be employed to foster both language learning and moral development among young learners. This study investigated how multimodal learning can be used to promote moral values alongside language skills among young learners in Indonesian primary schools. Materials/methods. This research employed qualitative methods. Data was collected through observations, interviews, and analysis of students’ artifacts. Activities such as storytelling, drawing, and collaborative projects were designed to encourage meaningful engagement with values such as kindness and responsibility. Kress and van Leeuwen’s multimodal framework, Unsworth’s literacy development cycle, and principles of critical literacy were used as analytical frameworks. Results. The findings reveal that multimodal literacy practices not only enhance linguistic competence but also nurture critical literacy and moral-social awareness. Students demonstrated increased autonomy, creativity, and a deeper understanding of moral values in these interactive activities. Conclusion. The study highlights the transformative potential of multimodal approaches in EFL contexts, emphasizing their capacity to develop young learners into autonomous, creative meaning-makers. However, further refinement of theoretical and methodological frameworks is needed to ensure broader applicability across diverse educational settings.

Education, Education (General)
arXiv Open Access 2024
Massively Multilingual Text Translation For Low-Resource Languages

Zhong Zhou

Translation into severely low-resource languages has both the cultural goal of saving and reviving those languages and the humanitarian goal of assisting the everyday needs of local communities that are accelerated by the recent COVID-19 pandemic. In many humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine, but a dedicated text-specific translation engine. For example, healthcare records, hygienic procedures, government communication, emergency procedures and religious texts are all limited texts. While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible and reduce human translation effort. We attempt to leverage translation resources from rich-resource languages to efficiently produce best possible translation quality for well known texts, which are available in multiple languages, in a new, low-resource language. To reach this goal, we argue that in translating a closed text into low-resource languages, generalization to out-of-domain texts is not necessary, but generalization to new languages is. Performance gain comes from massive source parallelism by careful choice of close-by language families, style-consistent corpus-level paraphrases within the same language and strategic adaptation of existing large pretrained multilingual models to the domain first and then to the language. Such performance gain makes it possible for machine translation systems to collaborate with human translators to expedite the translation process into new, low-resource languages.

en cs.CL
arXiv Open Access 2024
Attacks on Third-Party APIs of Large Language Models

Wanru Zhao, Vidit Khazanchi, Haodi Xing et al.

Large language model (LLM) services have recently begun offering a plugin ecosystem to interact with third-party API services. This innovation enhances the capabilities of LLMs, but it also introduces risks, as these plugins developed by various third parties cannot be easily trusted. This paper proposes a new attacking framework to examine security and safety vulnerabilities within LLM platforms that incorporate third-party services. Applying our framework specifically to widely used LLMs, we identify real-world malicious attacks across various domains on third-party APIs that can imperceptibly modify LLM outputs. The paper discusses the unique challenges posed by third-party API integration and offers strategic possibilities to improve the security and safety of LLM ecosystems moving forward. Our code is released at https://github.com/vk0812/Third-Party-Attacks-on-LLMs.

en cs.CR, cs.AI

Halaman 34 dari 209946