Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.
This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag.
Lorenz Wendlinger, Christian Braun, Abdullah Al Zubaer
et al.
We show that current open-source foundational LLMs possess instruction capability and German legal background knowledge that is sufficient for some legal analysis in an educational context. However, model capability breaks down in very specific tasks, such as the classification of "Gutachtenstil" appraisal style components, or with complex contexts, such as complete legal opinions. Even with extended context and effective prompting strategies, they cannot match the Bag-of-Words baseline. To combat this, we introduce a Retrieval Augmented Generation based prompt example selection method that substantially improves predictions in high data availability scenarios. We further evaluate the performance of pre-trained LLMs on two standard tasks for argument mining and automated essay scoring and find it to be more adequate. Throughout, pre-trained LLMs improve upon the baseline in scenarios with little or no labeled data with Chain-of-Thought prompting further helping in the zero-shot case.
Rafael Ortiz, Rogier A. Windhorst, Seth H. Cohen
et al.
The first public 0.9-4.4μm NIRCam images of the North Ecliptic Pole (NEP) Time Domain Field (TDF) uncovered galaxies displaying point-source features in their cores as seen in the longer wavelength filters. We visually identified a sample of 66 galaxies (~1 galaxy per arcmin2) with point-like cores and have modeled their two-dimensional light profiles with GalFit, identifying 16 galactic nuclei with measurable point-source components. GalFit suggests the visual sample is a mix of both compact stellar bulge and point-source galaxy cores. This core classification is complemented by spectral energy distribution (SED) modeling to infer the sample's active galactic nucleus (AGN) and host-galaxy parameters. For galaxies with measurable point-source components, the median fractional AGN contribution to their 0.1-30.0μm flux is 0.44, and 14/16 are color-classified AGN. We conclude that near-infrared point-source galaxy cores are signatures of AGN. In addition, we define an automated sample-selection criterion to identify these point-source features. These criteria can be used in other extant and future NIRCam images to streamline the search for galaxies with unresolved IR-luminous AGN. The James Webb Space Telescope's superb angular resolution and sensitivity at infrared wavelengths is resurrecting the morphological identification of AGN.
Tropical Cyclones (TCs) have devastating effects on several coastal regions worldwide. Precautionary knowledge about TC characteristics such as wind direction, wind speed, epicenter position, condensed vapor pressure measure, and radius of maximum can be highly valuable in disaster management and economic planning. Existing literature has focused on TC wind direction, intensity, cloud shape, and epicenter position, but there has been limited research on estimation of the Radius of Maximum Wind (RMW). Accurate estimation of RMW is crucial as errors can significantly impact wind and storm surge assessments and forecasts. In this study, our objective is to determine the RMW over the North Indian Ocean (NIO). We chose this region due to its location surrounded by the Bay of Bengal and Arabian Sea, making it one of the six globally prominent areas prone to TCs. Our study is on the relationship between the center of the TC, the estimated pressure drop at the center, and the RMW, using historical observations and mathematical correlations. To address missing parameters in the best track database of the Indian Meteorological Department, we employ a local regression model. We validate the accuracy of our developed method using two statistical measures: error percentage and T-test. Numerous TC cases are discussed in the paper over the NIO. Our findings indicates that the suggested method exhibits an error percentage ranging from approximately $-63\%$ to $50\%$ when compared to the best track data provided by the Indian Meteorological Department (IMD). In contrast, the error percentages for two other references \cite{bib21, bib22} with the same best track data range from approximately $-26\%$ to $200\%$. Moreover, the T-test results demonstrate that our method outperforms than the other approaches in terms of statistical significance.
In this paper, we design novel interactive deep learning methods to improve semantic interactions in visual analytics applications. The ability of semantic interaction to infer analysts' precise intents during sensemaking is dependent on the quality of the underlying data representation. We propose the $\text{DeepSI}_{\text{finetune}}$ framework that integrates deep learning into the human-in-the-loop interactive sensemaking pipeline, with two important properties. First, deep learning extracts meaningful representations from raw data, which improves semantic interaction inference. Second, semantic interactions are exploited to fine-tune the deep learning representations, which then further improves semantic interaction inference. This feedback loop between human interaction and deep learning enables efficient learning of user- and task-specific representations. To evaluate the advantage of embedding the deep learning within the semantic interaction loop, we compare $\text{DeepSI}_{\text{finetune}}$ against a state-of-the-art but more basic use of deep learning as only a feature extractor pre-processed outside of the interactive loop. Results of two complementary studies, a human-centered qualitative case study and an algorithm-centered simulation-based quantitative experiment, show that $\text{DeepSI}_{\text{finetune}}$ more accurately captures users' complex mental models with fewer interactions.
AbstractThis is a study of the semantics of definiteness marking and of its applicability to the Old East Scandinavian linguistic scenario. Contrary to the Modern Continental Scandinavian languages, Old East Scandinavian varieties did not possess fully-fledged definite articles, although all three demonstrative systems (hinn, -inn, sá/þænn, andsjá/þænni) show some evidence of being used as markers of definiteness. A semantic analysis of these forms in extracts from theScanian LawandGuta Lagreveals the differentiated intermediate stages along the cline of grammaticalisation these definiteness markers found themselves in during this linguistic period. This confirms, in turn, that despite not being employed as definite articles proper,hinn, -innandsá/þænnshow some degree of semantic bleaching.
Forecasts by the European Centre for Medium-Range Weather Forecasts (ECMWF; EC for short) can provide a basis for the establishment of maritime-disaster warning systems, but they contain some systematic biases.The fifth-generation EC atmospheric reanalysis (ERA5) data have high accuracy, but are delayed by about 5 days. To overcome this issue, a spatiotemporal deep-learning method could be used for nonlinear mapping between EC and ERA5 data, which would improve the quality of EC wind forecast data in real time. In this study, we developed the Multi-Task-Double Encoder Trajectory Gated Recurrent Unit (MT-DETrajGRU) model, which uses an improved double-encoder forecaster architecture to model the spatiotemporal sequence of the U and V components of the wind field; we designed a multi-task learning loss function to correct wind speed and wind direction simultaneously using only one model. The study area was the western North Pacific (WNP), and real-time rolling bias corrections were made for 10-day wind-field forecasts released by the EC between December 2020 and November 2021, divided into four seasons. Compared with the original EC forecasts, after correction using the MT-DETrajGRU model the wind speed and wind direction biases in the four seasons were reduced by 8-11% and 9-14%, respectively. In addition, the proposed method modelled the data uniformly under different weather conditions. The correction performance under normal and typhoon conditions was comparable, indicating that the data-driven mode constructed here is robust and generalizable.
Nico Blokker, André Blessing, Erenay Dayanik
et al.
Newspaper reports provide a rich source of information on the unfolding of public debate on specific policy fields that can serve as basis for inquiry in political science. Such debates are often triggered by critical events, which attract public attention and incite the reactions of political actors: crisis sparks the debate. However, due to the challenges of reliable annotation and modeling, few large-scale datasets with high-quality annotation are available. This paper introduces DebateNet2.0, which traces the political discourse on the European refugee crisis in the German quality newspaper taz during the year 2015. The core units of our annotation are political claims (requests for specific actions to be taken within the policy field) and the actors who make them (politicians, parties, etc.). The contribution of this paper is twofold. First, we document and release DebateNet2.0 along with its companion R package, mardyR, guiding the reader through the practical and conceptual issues related to the annotation of policy debates in newspapers. Second, we outline and apply a Discourse Network Analysis (DNA) to DebateNet2.0, comparing two crucial moments of the policy debate on the 'refugee crisis': the migration flux through the Mediterranean in April/May and the one along the Balkan route in September/October. Besides the released resources and the case-study, our contribution is also methodological: we talk the reader through the steps from a newspaper article to a discourse network, demonstrating that there is not just one discourse network for the German migration debate, but multiple ones, depending on the topic of interest (political actors, policy fields, time spans).
The processes which regulate the star-formation within molecular clouds are still not well understood. Various star-formation scaling relations have been proposed to explain this issue by formulating a relation between star-formation rate surface density ($\rm Σ_{SFR}$) and the underlying gas surface density ($\rm Σ_{gas}$). In this work, we test various star formation scaling relations, such as Kennicutt-Schmidt relation, volumetric star-formation relation, orbital time model, crossing time model, and multi free-fall time scale model towards the North American and Pelican Nebulae complexes and in cold clumps associated with them. Measuring stellar-mass from young stellar objects and gaseous mass from CO measurements, we estimated mean $\rm Σ_{SFR}$, star formation rate per free-fall time, and star formation efficiency (SFE) for clumps to be 1.5 $\rm M{_\odot}~yr^{-1}~kpc^{-2}$, 0.009, 2.0$\%$, respectively, while for the entire NAN complex the values are 0.6 $\rm M{_\odot}~yr^{-1}~kpc^{-2}$, 0.0003, and 1.6$\%$, respectively. For clumps, we notice that the observed properties are in line with the correlation obtained between $\rm Σ_{SFR}$ and $\rm Σ_{gas}$, and between $\rm Σ_{SFR}$ and $\rm Σ_{gas}$ per free-fall time and orbital time for Galactic clouds. At the same time, we do not observe any correlation with $\rm Σ_{gas}$ per crossing time and multi free-fall time. Even though we see correlations in former cases, however, all models agree with each other within a factor of 0.5 dex, and discriminating between these models is not possible due to the current uncertainties in the input observables. We also test the variation of $\rm Σ_{SFR}$ versus the dense gas, but due to low statistics, a weak correlation is seen in our analysis.
Severe local storm (SLS) activity is known to occur within specific thermodynamic and kinematic environments. These environments are commonly associated with key synoptic-scale features--including southerly Great Plains low-level jets, drylines, elevated mixed layers, and extratropical cyclones--that link the large-scale climate to SLS environments. This work analyzes spatiotemporal distributions of both the environmental parameters and synoptic-scale features in ERA5 reanalysis and in Community Atmosphere Model version 6 (CAM6) during 1980--2014 over North America. Compared to radiosondes, ERA5 successfully reproduces SLS environments, with strong spatiotemporal correlations and low biases, especially over the Great Plains. Both ERA5 and CAM6 reproduce the climatology of SLS environments over the central United States as well as its strong seasonal and diurnal cycles. ERA5 and CAM6 also reproduce the climatological occurrence of the synoptic-scale features, with the distribution pattern similar to that of SLS environments. Compared to ERA5, CAM6 exhibits a high bias in Convective Available Potential Energy over the eastern United States primarily due to a high bias in surface moisture, and to a lesser extent, storm-relative helicity due to enhanced low-level winds. Composite analysis indicates consistent synoptic anomaly patterns favorable for significant SLS environments over much of the eastern half of the United States in both ERA5 and CAM6, though the pattern differs for the southeastern United States. Overall, results indicate that both ERA5 and CAM6 are capable of reproducing SLS environments as well as the synoptic-scale features and transient events that generate them.
This paper explores the impact of the Covid-19 crisis at the level of individual information behavior, based on an online survey among 308 participants from the German-speaking countries Austria, Germany and Switzerland in April and May 2020. The results show first that the Covid-19 crisis has led to an increased demand for reliable information. This goes alongside a significant increased use of public broadcasting, newspapers and information provided by public organizations. Second, the majority (84%) of the participants reported being satisfied with the information supply during the Covid-19 crisis. Participants who were less satisfied with the information supply used reliable sources significantly less frequently, specifically public television, national newspapers and information provided by public organizations. Third, the amount of Covid-19-related information led some participants to a feeling of information overload, which resulted in a reduction of information seeking and media use.
Vivien Macketanz, Eleftherios Avramidis, Aljoscha Burchardt
et al.
We present an analysis of 16 state-of-the-art MT systems on German-English based on a linguistically-motivated test suite. The test suite has been devised manually by a team of language professionals in order to cover a broad variety of linguistic phenomena that MT often fails to translate properly. It contains 5,000 test sentences covering 106 linguistic phenomena in 14 categories, with an increased focus on verb tenses, aspects and moods. The MT outputs are evaluated in a semi-automatic way through regular expressions that focus only on the part of the sentence that is relevant to each phenomenon. Through our analysis, we are able to compare systems based on their performance on these categories. Additionally, we reveal strengths and weaknesses of particular systems and we identify grammatical phenomena where the overall performance of MT is relatively low.
Animation is ubiquitous in visualization systems, and a common technique for creating these animations is the transition. In the transition approach, animations are created by smoothly interpolating a visual attribute between a start and end value, reaching the end value after a specified duration. This approach works well when each transition for an attribute is allowed to finish before the next is triggered, but performs poorly when a new transition is triggered before the current transition has finished. In particular, interruptions introduce velocity discontinuities, and frequent interruptions can slow down the resulting animation. To solve these problems, we model the problem of animation as a signal processing problem. In our technique, animations are produced by transformations of signals, or functions over time. In particular, an animation is produced by transforming an input signal, a function from time to target attribute value, into an output signal, a function from time to displayed attribute value. We show that well-known signal-processing techniques can be applied to produce animations that are free from velocity discontinuities even when interrupted.
We find numerical and empirical evidence for dynamical, structural and topological phase transitions on the (German) Frankfurt Stock Exchange (FSE) in the temporal vicinity of the worldwide financial crash. Using the Minimal Spanning Tree (MST) technique, a particularly useful canonical tool of the graph theory, two transitions of the topology of a complex network representing FSE were found. First transition is from a hierarchical scale-free MST representing the stock market before the recent worldwide financial crash, to a superstar-like MST decorated by a scale-free hierarchy of trees representing the market's state for the period containing the crash. Subsequently, a transition is observed from this transient, (meta)stable state of the crash, to a hierarchical scale-free MST decorated by several star-like trees after the worldwide financial crash. The phase transitions observed are analogous to the ones we obtained earlier for the Warsaw Stock Exchange and more pronounced than those found by Onnela-Chakraborti-Kaski-Kertész for S&P 500 index in the vicinity of Black Monday (October 19, 1987) and also in the vicinity of January 1, 1998. Our results provide an empirical foundation for the future theory of dynamical, structural and topological phase transitions on financial markets.
Stephen Serjeant, Veronique Buat, Denis Burgarella
et al.
We propose a co-ordinated multi-observatory survey at the North Ecliptic Pole. This field is the natural extragalactic deep field location for most space observatories (e.g. containing the deepest Planck, WISE and eROSITA data), is in the continuous viewing zones for e.g. Herschel, HST, JWST, and is a natural high-visibility field for the L2 halo orbit of SPICA with deep and wide-field legacy surveys already planned. The field is also a likely deep survey location for the forthcoming Euclid mission. It is already a multi-wavelength legacy field in its own right (e.g. AKARI, LOFAR, SCUBA-2): the outstanding and unparalleled continuous mid-IR photometric coverage in this field and nowhere else enables a wide range of galaxy evolution diagnostics unachievable in any other survey field, by spanning the wavelengths of redshifted PAH and silicate features and the peak energy output of AGN hot dust. We argue from the science needs of Euclid and JWST, and from the comparative multiwavelength depths, that the logical approach is (1) a deep (H-UDF) UV/optical tile in the NEP over ~10 square arcminutes, and (2) an overlapping wide-field UV/optical HST survey tier covering >100 square arcminutes, with co-ordinated submm SPIRE mapping up to or beyond the submm point source confusion limit over a wider area and PACS data over the shallower HST tier.