In this work, we introduce a German version for ColBERT, a late interaction multi-dense vector retrieval method, with a focus on RAG applications. We also present the main features of our package for ColBERT models, supporting both retrieval and fine-tuning workflows.
Lukas Birkenmaier, Laureen Sieber, Felix Bergstein
This paper presents a novel dataset of public broadcast interviews featuring high-ranking German politicians. The interviews were sourced from YouTube, transcribed, processed for speaker identification, and stored in a tidy and open format. The dataset comprises 99 interviews with 33 different German politicians across five major interview formats, containing a total of 28,146 sentences. As the first of its kind, this dataset offers valuable opportunities for research on various aspects of political communication in the (German) political contexts, such as agenda-setting, interviewer dynamics, or politicians' self-presentation.
Starting from the governing equations for geophysical flows, by means of a thin-shell approximation and a tangent plane approximation, we derive the equations describing, at leading order, the nonlinear ice-drift flow for regions centered around the North Pole. An exact solution is derived in the material/Lagrangian formalism, describing a superposition of oscillations, a mean Ekman flow and a geostrophic current.
One of the characteristic features of Old Norse is the loss of unstressed preverbs and hence a great reduction in the number of compound verbs inherited from Proto-Germanic. Yet interpretations that assume the existence of verb-particle constructions are often invoked by runologists suggesting the development of a new type of verbal formation in early North Germanic. The appearance of both phrasal and compound verbs in early runic inscriptions may represent evidence of a language where both object-verb and verb-object phenomena were common. This paper reassesses the early runic evidence for phrasal and compound verbs, particularly in light of a 1999 find from the Nydam bog and recent syntactic scholarship.
The radio-quiet quasar SDSS J1240+1455 lies at a redshift of z=3.11, is surrounded by a Ly-alpha blob (LAB), and is absorbed by a proximate damped Ly-alpha system. In order to better define the morphology of the blob and determine its emission mechanism, we gathered deep narrow-band images isolating the Ly-alpha line of this object in linearly polarized light. We provide a deep intensity image of the blob, showing a filamentary structure extending up to 16'' (or ~122 physical kpc) in diameter. No significant polarization signal could be extracted from the data, but 95% probability upper limits were defined through simulations. They vary between ~3% in the central 0.75'' disk (after subtraction of the unpolarized quasar continuum) and ~10% in the 3.8-5.5'' annulus. The low polarization suggests that the Ly-alpha photons are emitted mostly in situ, by recombination and de-excitation in a gas largely ionized by the quasar ultraviolet light, rather than by a central source and scattered subsequently by neutral hydrogen gas. This blob shows no detectable polarization signal, contrary to LAB1, a brighter and more extended blob that is not related to the nearby active galactic nucleus (AGN) in any obvious way, and where a significant polarization signal of about 18% was detected.
A measles outbreak occurs when the number of cases of measles in the population exceeds the typical level. Outbreaks that are not detected and managed early can increase mortality and morbidity and incur costs from activities responding to these events. The number of measles cases in the Province of North Cotabato, Philippines, was used in this study. Weekly reported cases of measles from January 2016 to December 2021 were provided by the Epidemiology and Surveillance Unit of the North Cotabato Provincial Health Office. Several integer-valued autoregressive (INAR) time series models were used to explore the possibility of detecting and identifying measles outbreaks in the province along with the classical ARIMA model. These models were evaluated based on goodness of fit, measles outbreak detection accuracy, and timeliness. The results of this study confirmed that INAR models have the conceptual advantage over ARIMA since the latter produces non-integer forecasts, which are not realistic for count data such as measles cases. Among the INAR models, the ZINGINAR (1) model was recommended for having a good model fit and timely and accurate detection of outbreaks. Furthermore, policymakers and decision-makers from relevant government agencies can use the ZINGINAR (1) model to improve disease surveillance and implement preventive measures against contagious diseases beforehand.
The advancement of natural language processing has paved the way for automated scoring systems in various languages, such as German (e.g., German BERT [G-BERT]). Automatically scoring written responses to science questions in German is a complex task and challenging for standard G-BERT as they lack contextual knowledge in the science domain and may be unaligned with student writing styles. This paper presents a contextualized German Science Education BERT (G-SciEdBERT), an innovative large language model tailored for scoring German-written responses to science tasks and beyond. Using G-BERT, we pre-trained G-SciEdBERT on a corpus of 30K German written science responses with 3M tokens on the Programme for International Student Assessment (PISA) 2018. We fine-tuned G-SciEdBERT on an additional 20K student-written responses with 2M tokens and examined the scoring accuracy. We then compared its scoring performance with G-BERT. Our findings revealed a substantial improvement in scoring accuracy with G-SciEdBERT, demonstrating a 10.2% increase of quadratic weighted Kappa compared to G-BERT (mean difference = 0.1026, SD = 0.069). These insights underline the significance of specialized language models like G-SciEdBERT, which is trained to enhance the accuracy of contextualized automated scoring, offering a substantial contribution to the field of AI in education.
Abstract This paper investigates the properties of nominal phrases and demonstratives used as verbal anaphora in Norwegian, Danish, English, and Scots-English, e.g. English Can John make good curry? – That he can; Norwegian Anja ligger godt an, det samme gjør Madelène lit. ‘Anja is in a good position, Madelène does the same [thing]’. Following Lødrup (Proceedings of NELS 24, 1994), Houser et al. (Proceedings of WECOL 34, 2007), Bentzen et al. (J Comp Ger Linguist 16:91–125, 2013), these anaphoric expressions are argued to be surface anaphora and to conceal elided vPs. Contrary to previous analyses, the nominal phrases are argued to themselves be contributing meaning beyond the vPs they conceal; they are argued to be overt background arguments for an ellipsis-licensing head with semantics similar to Rooth’s ∼ operator (Nat Lang Semant 1(1):75–116, 1992). The paper also explores cross-linguistic variation in the discourse/antecedence conditions on such anaphora, and their fronting behavior. In Danish and (general) English, such anaphora must generally topicalize, whereas in Norwegian and Scots-English, they can more freely appear in situ (in post-auxiliary position). Developing Mikkelsen’s (J Linguist 51(3):595–643, 2015) analysis of Danish det, this behavior is encoded as a feature [uTop] which must be checked; Norwegian is argued to have more possibilities to check this feature in situ than Danish, while in Scots-English, that is argued to be a propositional anaphor, lacking the relevant feature.
AbstractThis paper examines definiteness marking in American Norwegian (AmNo), a heritage variety of Norwegian spoken in the US. The description adds another language to the much-studied variation within Scandinavian nominal phrases. It builds on established syntactic analysis of Scandinavian and investigates aspects that are (un)like Norwegian spoken in the homeland. A central finding is that the core syntax of Norwegian noun phrases is retained in AmNo, while the morphophonological spell-out is sometimes different. Indefinite determiners, for example, are obligatory in AmNo, but some speakers produce them with non-homeland-like gender agreement. One systematic change is observed: double definiteness has been partially lost. The typical AmNo modified definite phrase lacks the prenominal determiner that is obligatory for varieties in Norway. I argue that this is a syntactic change which allows the realization of D to be optional. This is a pattern not found in the other Scandinavian languages. At the same time, this innovative structure in AmNo is not like English, the dominant language of the AmNo speakers. This demonstrates heritage language change that is distinct from both the homeland language and the dominant language.
Statistical modeling of high-dimensional matrix-valued data motivates the use of a low-rank representation that simultaneously summarizes key characteristics of the data and enables dimension reduction. Low-rank representations commonly factor the original data into the product of orthonormal basis functions and weights, where each basis function represents an independent feature of the data. However, the basis functions in these factorizations are typically computed using algorithmic methods that cannot quantify uncertainty or account for basis function correlation structure a priori. While there exist Bayesian methods that allow for a common correlation structure across basis functions, empirical examples motivate the need for basis function-specific dependence structure. We propose a prior distribution for orthonormal matrices that can explicitly model basis function-specific structure. The prior is used within a general probabilistic model for singular value decomposition to conduct posterior inference on the basis functions while accounting for measurement error and fixed effects. We discuss how the prior specification can be used for various scenarios and demonstrate favorable model properties through synthetic data examples. Finally, we apply our method to two-meter air temperature data from the Pacific Northwest, enhancing our understanding of the Earth system's internal variability.
In this paper, we apply transformer-based Natural Language Generation (NLG) techniques to the problem of text simplification. Currently, there are only a few German datasets available for text simplification, even fewer with larger and aligned documents, and not a single one with narrative texts. In this paper, we explore to which degree modern NLG techniques can be applied to German narrative text simplifications. We use Longformer attention and a pre-trained mBART model. Our findings indicate that the existing approaches for German are not able to solve the task properly. We conclude on a few directions for future research to address this problem.
The Next Generation Very Large Array (ngVLA) is a planned radio interferometer providing unprecedented sensitivity at wavelengths between 21 cm and 3 mm. Its 263 antenna element array will be spatially distributed across North America to enable both superb low surface brightness recovery and sub-milliarcsecond angular resolution imaging. The project was developed by the international astronomy community under the lead of the National Radio Astronomy Observatory (NRAO), and is anticipated to be built between 2027 and 2037. Two workshops have been held in 2022 and 2023 with the goal to discuss and consolidate the scientific interests in the ngVLA within the German astronomical community. This community paper constitutes a collection of 48 science ideas which the German community aims to pursue with the ngVLA in the 2030s. This is not a complete list and the ideas are not developed at the level of a "Science Book", such that the present document is mainly meant provide a basis for further discussion within the community. As such, additional contributions are welcome, and will be considered for inclusion in future revisions.
Mitchell L. Krock, Adam H. Monahan, Michael L. Stein
Current models for spatial extremes are concerned with the joint upper (or lower) tail of the distribution at two or more locations. Such models cannot account for teleconnection patterns of two-meter surface air temperature ($T_{2m}$) in North America, where very low temperatures in the contiguous Unites States (CONUS) may coincide with very high temperatures in Alaska in the wintertime. This dependence between warm and cold extremes motivates the need for a model with opposite-tail dependence in spatial extremes. This work develops a statistical modeling framework which has flexible behavior in all four pairings of high and low extremes at pairs of locations. In particular, we use a mixture of rotations of common Archimedean copulas to capture various combinations of four-corner tail dependence. We study teleconnected $T_{2m}$ extremes using ERA5 reanalysis of daily average two-meter temperature during the boreal winter. The estimated mixture model quantifies the strength of opposite-tail dependence between warm temperatures in Alaska and cold temperatures in the midlatitudes of North America, as well as the reverse pattern. These dependence patterns are shown to correspond to blocked and zonal patterns of mid-tropospheric flow. This analysis extends the classical notion of correlation-based teleconnections to considering dependence in higher quantiles.
Joshua S. North, Christopher K. Wikle, Erin M. Schliep
Many real-world scientific processes are governed by complex nonlinear dynamic systems that can be represented by differential equations. Recently, there has been increased interest in learning, or discovering, the forms of the equations driving these complex nonlinear dynamic system using data-driven approaches. In this paper we review the current literature on data-driven discovery for dynamic systems. We provide a categorization to the different approaches for data-driven discovery and a unified mathematical framework to show the relationship between the approaches. Importantly, we discuss the role of statistics in the data-driven discovery field, describe a possible approach by which the problem can be cast in a statistical framework, and provide avenues for future work.
AbstractThis paper deals with two runic inscriptions that are highly relevant to language history. 1. The runic sequences on the three Weser rune bones, which date to the first half of the 5th century, are not entirely clear. However, West Germanic (Pre-Old Saxon) linguistic features such as gemination byj(kunni‘kin, clan’) and loss of -a< *-az(hari‘army’) are obvious. By far the most interesting linguistic form is the preteritedede‘did’ that reflects PGmc. *-dai(cf. PNorsetalgi-dai‘carved’ on the Nøvling fibula) as opposed to PGmc. *-dǣ(d)(PNorse-da, OHG -taetc.). Apparently, we are dealing with two distinct endings, *-daideriving from an PIE middle in *-(t)ói̯. 2. The legendska2nomodu(a2 = ᚪ) on a solidus of unknown provenance (ca. 600) renders the dithematic anthroponymSkānɔmōdə̣(or *Skānɵmōdə̣), presumably the name of the moneyer. It seems that medialostands for [ɔ] or [ɵ], an allophonic variant of the linking element /a/ before a labial consonant; parallels can be found in Old Germanic naming. Two linguistic features, viz.ā< WGmc. *auand the nominative ending -ə̣< WGmc. -a< PGmc. *-az, indicate that the language of the inscription is Pre-Old Frisian.
Abstract:The walrus ivory trade constituted a cornerstone of Norse Greenlandic society, its exchange providing goods essential for the isolated settlement's survival until the fifteenth century. The journey to the walrus hunting grounds at Disko Bay drew European settlers into encounters with their Indigenous neighbours the and more-than-human world of Greenland. As walrus tusks moved through the North Atlantic trade sphere, from Disko Bay to the Scandinavian Greenland settlements and on over the North Atlantic to Europe, it materialized a particular kind of Nordic prestige, was a gift between traders and kings, and was carved into objects of devotion as Scandinavia converted to Christianity. Tracing the journey of tusks along these trade routes through the case study of Gunhild's cross, a devotional cross likely carved in Denmark in the early twelfth century, reveals both the depth and complexity of object memory and the vibrancy and scope of Nordic trade in the medieval period.
Pelin Dogan-Schönberger, Julian Mäder, Thomas Hofmann
Swiss German is a dialect continuum whose natively acquired dialects significantly differ from the formal variety of the language. These dialects are mostly used for verbal communication and do not have standard orthography. This has led to a lack of annotated datasets, rendering the use of many NLP methods infeasible. In this paper, we introduce the first annotated parallel corpus of spoken Swiss German across 8 major dialects, plus a Standard German reference. Our goal has been to create and to make available a basic dataset for employing data-driven NLP applications in Swiss German. We present our data collection procedure in detail and validate the quality of our corpus by conducting experiments with the recent neural models for speech synthesis.
Julia Pritzen, Michael Gref, Dietlind Zühlke
et al.
Anglicisms are a challenge in German speech recognition. Due to their irregular pronunciation compared to native German words, automatically generated pronunciation dictionaries often include faulty phoneme sequences for Anglicisms. In this work, we propose a multitask sequence-to-sequence approach for grapheme-to-phoneme conversion to improve the phonetization of Anglicisms. We extended a grapheme-to-phoneme model with a classifier to distinguish Anglicisms from native German words. With this approach, the model learns to generate pronunciations differently depending on the classification result. We used our model to create supplementary Anglicism pronunciation dictionaries that are added to an existing German speech recognition model. Tested on a dedicated Anglicism evaluation set, we improved the recognition of Anglicisms compared to a baseline model, reducing the word error rate by 1 % and the Anglicism error rate by 3 %. We show that multitask learning can help solving the challenge of Anglicisms in German speech recognition.
S Sidhartha Narayan, Malavika Ranjan, Madhumitha Raghuraman
This paper attempts to analyse policymaking in the field of Intellectual Property (IP) as an instrument of economic growth across the Global North and South. It begins by studying the links between economic growth and IP, followed by an understanding of Intellectual Property Rights (IPR) development in the US, a leading proponent of robust IPR protection internationally. The next section compares the IPR in the Global North and South and undertakes an analysis of the diverse factors that result in these differences. The paper uses the case study of the Indian Pharmaceutical Industry to understand how IPR may differentially affect economies and conclude that there may not yet be a one size fits all policy for the adoption of Intellectual Property Rights.