Hasil "North Germanic. Scandinavian"

arXiv Open Access 2025

Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation

Michelle Kappl

We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.

en cs.CL

Detail Sumber

arXiv Open Access 2025

taz2024full: Analysing German Newspapers for Gender Bias and Discrimination across Decades

Stefanie Urchs, Veronika Thurner, Matthias Aßenmacher et al.

Open-access corpora are essential for advancing natural language processing (NLP) and computational social science (CSS). However, large-scale resources for German remain limited, restricting research on linguistic trends and societal issues such as gender bias. We present taz2024full, the largest publicly available corpus of German newspaper articles to date, comprising over 1.8 million texts from taz, spanning 1980 to 2024. As a demonstration of the corpus's utility for bias and discrimination research, we analyse gender representation across four decades of reporting. We find a consistent overrepresentation of men, but also a gradual shift toward more balanced coverage in recent years. Using a scalable, structured analysis pipeline, we provide a foundation for studying actor mentions, sentiment, and linguistic framing in German journalistic texts. The corpus supports a wide range of applications, from diachronic language analysis to critical media studies, and is freely available to foster inclusive and reproducible research in German-language NLP.

en cs.CL

Detail Sumber

CrossRef Open Access 2024

The Sparse Inflection of Middle and Modern English

Joseph Emonds, Jan Terje Faarlund

en

Detail DOI Sumber

CrossRef Open Access 2024

The Middle English Lexicon: Cultural Integration Creates Anglicized Norse

Joseph Emonds, Jan Terje Faarlund

en

Detail DOI Sumber

arXiv Open Access 2024

Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

Eyal Liron Dolev, Clemens Fidel Lutz, Noëmi Aepli

Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Schönberger et al., 2021), STT4SG-350 (Plüss et al., 2023), and Swiss Parliaments Corpus (Plüss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.

en cs.CL

Detail Sumber

arXiv Open Access 2024

EASSE-DE: Easier Automatic Sentence Simplification Evaluation for German

Regina Stodden

In this work, we propose EASSE-multi, a framework for easier automatic sentence evaluation for languages other than English. Compared to the original EASSE framework, EASSE-multi does not focus only on English. It contains tokenizers and versions of text simplification evaluation metrics which are suitable for multiple languages. In this paper, we exemplify the usage of EASSE-multi for German TS, resulting in EASSE-DE. Further, we compare text simplification results when evaluating with different language or tokenization settings of the metrics. Based on this, we formulate recommendations on how to make the evaluation of (German) TS models more transparent and better comparable. The code of EASSE-multi and its German specialisation (EASSE-DE) can be found at https://github.com/rstodden/easse-de.

en cs.CL

Detail Sumber

arXiv Open Access 2023

German BERT Model for Legal Named Entity Recognition

Harshil Darji, Jelena Mitrović, Michael Granitzer

The use of BERT, one of the most popular language models, has led to improvements in many Natural Language Processing (NLP) tasks. One such task is Named Entity Recognition (NER) i.e. automatic identification of named entities such as location, person, organization, etc. from a given text. It is also an important base step for many NLP tasks such as information extraction and argumentation mining. Even though there is much research done on NER using BERT and other popular language models, the same is not explored in detail when it comes to Legal NLP or Legal Tech. Legal NLP applies various NLP techniques such as sentence similarity or NER specifically on legal data. There are only a handful of models for NER tasks using BERT language models, however, none of these are aimed at legal documents in German. In this paper, we fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset. To make sure our model is not overfitting, we performed a stratified 10-fold cross-validation. The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset. Finally, we make the model openly available via HuggingFace.

en cs.CL, cs.LG

Detail DOI Sumber

arXiv Open Access 2023

Data and Approaches for German Text simplification -- towards an Accessibility-enhanced Communication

Thorben Schomacker, Michael Gille, Jörg von der Hülls et al.

This paper examines the current state-of-the-art of German text simplification, focusing on parallel and monolingual German corpora. It reviews neural language models for simplifying German texts and assesses their suitability for legal texts and accessibility requirements. Our findings highlight the need for additional training data and more appropriate approaches that consider the specific linguistic characteristics of German, as well as the importance of the needs and preferences of target groups with cognitive or language impairments. The authors launched the interdisciplinary OPEN-LS project in April 2023 to address these research gaps. The project aims to develop a framework for text formats tailored to individuals with low literacy levels, integrate legal texts, and enhance comprehensibility for those with linguistic or cognitive impairments. It will also explore cost-effective ways to enhance the data with audience-specific illustrations using image-generating AI. For more and up-to-date information, please visit our project homepage https://open-ls.entavis.com

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2023

Spaiche: Extending State-of-the-Art ASR Models to Swiss German Dialects

Clement Sicard, Kajetan Pyszkowski, Victor Gillioz

Recent breakthroughs in NLP largely increased the presence of ASR systems in our daily lives. However, for many low-resource languages, ASR models still need to be improved due in part to the difficulty of acquiring pertinent data. This project aims to help advance research in ASR models for Swiss German dialects, by providing insights about the performance of state-of-the-art ASR models on recently published Swiss German speech datasets. We propose a novel loss that takes into account the semantic distance between the predicted and the ground-truth labels. We outperform current state-of-the-art results by fine-tuning OpenAI's Whisper model on Swiss-German datasets.

en cs.CL, cs.LG

Detail Sumber

arXiv Open Access 2022

The link between North Atlantic tropical cyclones and ENSO in seasonal forecasts

Robert Doane-Solomon, Daniel Befort, Joanne Camp et al.

This study assesses the ability of six European seasonal forecast models to simulate the observed teleconnection between ENSO and tropical cyclones (TCs) over the North Atlantic. While the models generally capture the basin-wide observed link, its magnitude is overestimated in all forecast models compared to reanalysis. Furthermore, the ENSO - TC relationship in the Caribbean is poorly simulated. It is shown that incorrect forecasting of wind shear appears to affect the representation of the teleconnection in some models, however it is not a completely sufficient explanation for the overestimation of the link.

en physics.ao-ph

Detail Sumber

CrossRef Open Access 2021

Palatalization and assibilation of /k/ in English and Scottish place-names

Stephen Laker

AbstractThere are very few place-names with initialCh-[tʃ-] in Scotland, Northern England and much of the East Midlands. Names that do exist are almost exclusively late formations and usually consist of French rather than Old English place-name elements. This article investigates the reasons why assibilation is either present or absent from specific areas and why. The results lead to a reassessment of several points, including: (1) the phonetic and phonological development of the voiceless velar in Early English in particular environments; (2) the extent to which external influence counteracted palatalization and assibilation in some areas; (3) the disparities between the place-name and dialectal evidence.

3 sitasi en

Detail DOI Sumber

CrossRef Open Access 2021

A multifactorial account of analogical developments in Old English nominal paradigms

Elżbieta Adamczyk

AbstractThe present paper focuses on the interaction of factors that conditioned analogical developments in Old English nominal paradigms. They include especially the absolute and relative frequency of occurrence, the salience of inflectional exponents, the formal inflectional overlap across paradigms, functional factors, semantics and syllable structure (stem weight). They could work in two opposite directions, namely towards retaining the etymological inflections or they could facilitate the adoption of analogical endings. The significance of individual factors for the reorganisation of nominal paradigms is investigated by employing a statistical analysis (multivariate logistic regression) which allows us to rank them. The analysis demonstrates that the attested inflectional patterns can largely be explained by an interaction of three factors, namely salience and frequency, which can be linked to the cognitive aspects of storage and retrieval of linguistic information, and the overlap of inflectional forms across paradigms, which is a manifestation of analogical pressure in the paradigms.

1 sitasi en

Detail DOI Sumber

CrossRef Open Access 2021

Reference in Discourse and Models of Article Grammaticalization

Dominika Skrzypek, Alicja Piotrowska, Rafał Jaworski

en

Detail DOI Sumber

CrossRef Open Access 2021

The Diachrony of (In)definiteness—A Quantitative Study

Dominika Skrzypek, Alicja Piotrowska, Rafał Jaworski

en

Detail DOI Sumber

CrossRef Open Access 2020

Editorial introduction

en

Detail DOI Sumber

arXiv Open Access 2020

A Massive Young Runaway Star in W49 North

Luis F. Rodriguez, Roberto Galvan-Madrid, Joel Sanchez-Bermudez et al.

We analyzed high angular resolution 45.5 GHz images of the W49 North massive star forming region obtained in 1998 and 2016 with the Very Large Array. Most of the ultracompact HII regions show no detectable changes over the time interval of the observations. However, subcomponents B1, B2, G2a and G2c have increased its peak flux densities by values in the range of 3.8 to 21.4 \%. Most interestingly, the cometary region C clearly shows proper motions that at the distance of the region are equivalent to a velocity of 76$\pm$6 km s$^{-1}$ in the plane of the sky. We interpret this region as the ionized bowshock produced by a runaway O6 ZAMS star that was ejected from the eastern edge of Welch's ring about 6,400 years ago.

en astro-ph.SR, astro-ph.GA

Detail DOI Sumber

arXiv Open Access 2020

Distribution of interseismic coupling along the North and East Anatolian Faults inferred from InSAR and GPS data

Quentin Bletery, Olivier Cavalié, Jean-Mathieu Nocquet et al.

The North Anatolian Fault (NAF) has produced numerous major earthquakes. After decades of quiescence, the $\rm{M_w}$ 6.8 Elazığ earthquake (January 24, 2020) has recently reminded us that the East Anatolian Fault (EAF) is also capable of producing significant earthquakes. To better estimate the seismic hazard associated with these two faults, we jointly invert Interferometric Synthetic Aperture Radar (InSAR) and GPS data to image the spatial distribution of interseismic coupling along the eastern part of both the North and East Anatolian Faults. We perform the inversion in a Bayesian framework, enabling to estimate uncertainties on both long-term relative plate motion and coupling. We find that coupling is high and deep (0-20 km) on the NAF and heterogeneous and superficial (0-5 km) on the EAF. Our model predicts that the Elazığ earthquake released between 200 and 250 years of accumulated moment, suggesting a bi-centennial recurrence time.

en physics.geo-ph

Detail DOI Sumber

arXiv Open Access 2012

Early Cosmic-Ray Work Published in German

Joerg R. Hoerandel

The article gives an overview on early cosmic-ray work, published in German in the period from around 1910 to about 1940.

en physics.hist-ph, astro-ph.HE

Detail DOI Sumber

arXiv Open Access 2012

On an Ethical Use of Neural Networks: A Case Study on a North Indian Raga

Ripunjai Kumar Shukla, Soubhik Chakraborty

The paper gives an artificial neural network (ANN) approach to time series modeling, the data being instance versus notes (characterized by pitch) depicting the structure of a North Indian raga, namely, Bageshree. Respecting the sentiments of the artists' community, the paper argues why it is more ethical to model a structure than try and "manufacture" an artist by training the neural network to copy performances of artists. Indian Classical Music centers on the ragas, where emotion and devotion are both important and neither can be substituted by such "calculated artistry" which the ANN generated copies are ultimately up to.

en cs.NE, cs.SD

Detail Sumber

CrossRef Open Access 2007

Rethinking Scandinavian verb movement

Anna-Lena Wiklund, Gunnar Hrafn Hrafnbjargarson, Kristine Bentzen et al.

54 sitasi en

Detail DOI Sumber

Hasil untuk "North Germanic. Scandinavian"