Hasil untuk "Newspapers"

Menampilkan 20 dari ~225678 hasil · dari arXiv, DOAJ, CrossRef, Semantic Scholar

JSON API
arXiv Open Access 2025
Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models

Jonathan Bourne

Oscar Wilde said, "The difference between literature and journalism is that journalism is unreadable, and literature is not read." Unfortunately, The digitally archived journalism of Oscar Wilde's 19th century often has no or poor quality Optical Character Recognition (OCR), reducing the accessibility of these archives and making them unreadable both figuratively and literally. This paper helps address the issue by performing OCR on "The Nineteenth Century Serials Edition" (NCSE), an 84k-page collection of 19th-century English newspapers and periodicals, using Pixtral 12B, a pre-trained image-to-text language model. The OCR capability of Pixtral was compared to 4 other OCR approaches, achieving a median character error rate of 1%, 5x lower than the next best model. The resulting NCSE v2.0 dataset features improved article identification, high-quality OCR, and text classified into four types and seventeen topics. The dataset contains 1.4 million entries, and 321 million words. Example use cases demonstrate analysis of topic similarity, readability, and event tracking. NCSE v2.0 is freely available to encourage historical and sociological research. As a result, 21st-century readers can now share Oscar Wilde's disappointment with 19th-century journalistic standards, reading the unreadable from the comfort of their own computers.

en cs.CL, cs.DL
arXiv Open Access 2025
Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

Keerthana Murugaraj, Salima Lamsiyah, Marten During et al.

Extracting coherent and human-understandable themes from large collections of unstructured historical newspaper archives presents significant challenges due to topic evolution, Optical Character Recognition (OCR) noise, and the sheer volume of text. Traditional topic-modeling methods, such as Latent Dirichlet Allocation (LDA), often fall short in capturing the complexity and dynamic nature of discourse in historical texts. To address these limitations, we employ BERTopic. This neural topic-modeling approach leverages transformerbased embeddings to extract and classify topics, which, despite its growing popularity, still remains underused in historical research. Our study focuses on articles published between 1955 and 2018, specifically examining discourse on nuclear power and nuclear safety. We analyze various topic distributions across the corpus and trace their temporal evolution to uncover long-term trends and shifts in public discourse. This enables us to more accurately explore patterns in public discourse, including the co-occurrence of themes related to nuclear power and nuclear weapons and their shifts in topic importance over time. Our study demonstrates the scalability and contextual sensitivity of BERTopic as an alternative to traditional approaches, offering richer insights into historical discourses extracted from newspaper archives. These findings contribute to historical, nuclear, and social-science research while reflecting on current limitations and proposing potential directions for future work.

en cs.CL, cs.AI
DOAJ Open Access 2025
Η μεσοπολεμική Θεσσαλονίκη, μια πόλη σε αναβρασμό: Από τη Μεγάλη πλατεία του Νίκου Μπακόλα στον Ήλιο με ξιφολόγχες του Γιώργου Σκαμπαρδώνη

Βενετία Αποστολίδου

Interwar Thessaloniki has already been the subject of literary representation, mainly by Nikos Bakolas and George Ioannou, in the context of their general relationship with the city and its imprint in their prose. Skabardoni’s latest book, Sun with Bayonets (2023), based on thorough research in the newspapers of the period and, of course, in historical books, focuses exclusively on interwar Thessaloniki and, indeed, on a certain point in time, the first half of 1931. The paper will try to approach this book as a development, evolution and deepening of the image of the interwar city, making of course the necessary connections with the author’s earlier works.

Language and Literature
DOAJ Open Access 2025
Atitudinea ministrului afacerilor străine, Alexandru Marghiloman, relatată de presa românească și internațională în cazul complotului împotriva regelui Carol I și asasinatele de tip terorist care au avut loc în București, în anul 1900

Daniel Silviu NICULAE

The assassinations that took place in Bucharest, at the initiative of the Supreme Macedonian Committee based in Sofia, in mid-1900, they impressed the Romanian public opinion whose first source of information were the newspapers that reported in detail the evolution of the events. A strong emotion, however, produced news reports that the same Bulgarian anarchist organization planned and organized an attack against King Carol I. In this context, the Romanian-Bulgarian bilateral relations, which were tense due to the conflict of the Danube ostrovs that were formed or disappeared simply as a result of the alluviums and the water flow, thus altering the talveg and implicitly the river border between Romania and Bulgaria, they got so bad that by the end of 1900, an armed confrontation seemed impossible to avoid.

Literature (General)
arXiv Open Access 2024
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction

Laura Manrique-Gómez, Tony Montes, Arturo Rodríguez-Herrera et al.

This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.

en cs.CL, cs.DL
arXiv Open Access 2024
Quantifying Memorization and Detecting Training Data of Pre-trained Language Models using Japanese Newspaper

Shotaro Ishihara, Hiromu Takahashi

Dominant pre-trained language models (PLMs) have demonstrated the potential risk of memorizing and outputting the training data. While this concern has been discussed mainly in English, it is also practically important to focus on domain-specific PLMs. In this study, we pre-trained domain-specific GPT-2 models using a limited corpus of Japanese newspaper articles and evaluated their behavior. Experiments replicated the empirical finding that memorization of PLMs is related to the duplication in the training data, model size, and prompt length, in Japanese the same as in previous English studies. Furthermore, we attempted membership inference attacks, demonstrating that the training data can be detected even in Japanese, which is the same trend as in English. The study warns that domain-specific PLMs, sometimes trained with valuable private data, can ''copy and paste'' on a large scale.

en cs.CL
arXiv Open Access 2024
Detecting Sexism in German Online Newspaper Comments with Open-Source Text Embeddings (Team GDA, GermEval2024 Shared Task 1: GerMS-Detect, Subtasks 1 and 2, Closed Track)

Florian Bremm, Patrick Gustav Blaneck, Tobias Bornheim et al.

Sexism in online media comments is a pervasive challenge that often manifests subtly, complicating moderation efforts as interpretations of what constitutes sexism can vary among individuals. We study monolingual and multilingual open-source text embeddings to reliably detect sexism and misogyny in German-language online comments from an Austrian newspaper. We observed classifiers trained on text embeddings to mimic closely the individual judgements of human annotators. Our method showed robust performance in the GermEval 2024 GerMS-Detect Subtask 1 challenge, achieving an average macro F1 score of 0.597 (4th place, as reported on Codabench). It also accurately predicted the distribution of human annotations in GerMS-Detect Subtask 2, with an average Jensen-Shannon distance of 0.301 (2nd place). The computational efficiency of our approach suggests potential for scalable applications across various languages and linguistic contexts.

en cs.CL
DOAJ Open Access 2024
Cambiar algo para que nada cambie. La reforma decorativa de las elecciones primarias obligatorias en Perú

Ignacio Santoro

Objective/context: In 2019, promoted by President Martín Vizcarra, the Congress of the Republic of Peru sanctioned a law on mandatory primary elections for voters and political parties. However, the regulation was not implemented, given that it was suspended by the Legislative Branch for the general elections in 2021 and the regional elections in 2022. Methodology: This paper studies the reform process of mandatory, simultaneous, and open primaries (PASO, for its acronym in Spanish) in Peru and its counter-reform based on its suspension on two occasions (2021 and 2022). Twenty-three key actors were interviewed for this research, including congresspersons, ministers, electoral officials, and researchers, in addition to the revision of secondary sources and newspapers of legislative sessions. Conclusions: The paper concludes that PASO were a “decorative” reform, i.e., a law passed without the intention of implementing it. Its sanction was the product of a reformist coalition between civil society and the President of the nation, where the objectives of increasing popular support for President Vizcarra and the need to improve the legitimacy of the political system converged. The reform was resisted by congresspersons who formed two counter-reformist coalitions that suspended the PASO on two occasions. Originality: The research shows that these counter-reform coalitions are intrinsically linked to the atomization and personalization of the Peruvian party system, characterized by hyper-personalism and a lack of party structures. In addition, it generates theoretical contributions to studying the processes of reforms and counter-reforms in the context of institutional weakness.

International relations, Political science (General)
DOAJ Open Access 2024
A Generalised Additive Model and Deep Learning Method for Cross-Validating the North Atlantic Oscillation Index

Md Wahiduzzaman, Alea Yeasmin

This study introduces an innovative analytical methodology for examining the interconnections among the atmosphere, ocean, and society. The primary area of interest pertains to the North Atlantic Oscillation (NAO), a notable phenomenon characterised by daily to decadal fluctuations in atmospheric conditions over the Northern Hemisphere. The NAO has a prominent impact on winter weather patterns in North America, Europe, and to some extent, Asia. This impact has significant ramifications for civilization, as well as for marine, freshwater, and terrestrial ecosystems, and food chains. Accurate predictions of the surface NAO hold significant importance for society in terms of energy consumption planning and adaptation to severe winter conditions, such as winter wind and snowstorms, which can result in property damage and disruptions to transportation networks. Moreover, it is crucial to improve climate forecasts in order to bolster the resilience of food systems. This would enable producers to quickly respond to expected changes and make the required modifications, such as adjusting their food output or expanding their product range, in order to reduce potential hazards. The forecast centres prioritise and actively research the predictability and variability of the NAO. Nevertheless, it is increasingly evident that conventional analytical methods and prediction models that rely solely on scientific methodologies are inadequate in comprehensively addressing the transdisciplinary dimension of NAO variability. This includes a comprehensive view of research, forecasting, and social ramifications. This study introduces a new framework that combines sophisticated Big Data analytic techniques and forecasting tools using a generalised additive model to investigate the fluctuations of the NAO and the interplay between the ocean and atmosphere. Additionally, it explores innovative approaches to analyze the socio-economic response associated with these phenomena using text mining tools, specifically modern deep learning techniques. The analysis is conducted on an extensive corpora of free text information sourced from media outlets, public companies, government reports, and newspapers. Overall, the result shows that the NAO index has been reproduced well by the Deep-NAO model with a correlation coefficient of 0.74.

Meteorology. Climatology
arXiv Open Access 2023
Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Hermann Kroll, Christin Katharina Kreutz, Mirjam Cuper et al.

Digital libraries oftentimes provide access to historical newspaper archives via keyword-based search. Historical figures and their roles are particularly interesting cognitive access points in historical research. Structuring and clustering news articles would allow more sophisticated access for users to explore such information. However, real-world limitations such as the lack of training data, licensing restrictions and non-English text with OCR errors make the composition of such a system difficult and cost-intensive in practice. In this work we tackle these issues with the showcase of the National Library of the Netherlands by introducing a role-based interface that structures news articles on historical persons. In-depth, component-wise evaluations and interviews with domain experts highlighted our prototype's effectiveness and appropriateness for a real-world digital library collection.

en cs.DL
arXiv Open Access 2023
Newspaper Story Problems and Other Tasks for Context Based Physics Education: A Research Based report on classroom practice

Jochen Kuhn, Andreas Müller, Patrik Vogt

Background: Context Based Science Education (CBSE) has a long-standing tradition and is discussed as a highly promising approach in science education. It is supposed that CBSE can foster pupils engagement and learning. However, classroom implementations of CBSE based on solid empirical evidence are surprisingly scarce. Purpose: The present research-based report of practice seeks to bridge this theory practice gap for some specific forms of CBSE. We examine the use of science problems based on newspaper articles and the real-life contexts they provide (newspaper story problems, NSP). Design and Methods: While the research background has been reported elsewhere, the main objective of the present contribution is to provide a detailed account of the practical aspects of the approach. Two concrete, curriculum-relevant classroom teaching experiments based on newspaper story problems are reported, combined with a quasi-experimental study comparing NSPs against conventional textbook problems. The implementation of the teaching learning sequence in classroom practice is described in detail. Additionally, tasks types similar to NSPs, but using other ways of contextualisation (e.g. advertisements) will be discussed as perspective. Results: A considerable improvement in motivation was found, which proved stable at least in the medium term. Learning turned out to also be fostered to a sizeable extent, including the educationally important issue of transfer. Conclusions: The use of NSPs as a form of CBSE can have large positive, robust, and sustainable effects of both motivation and learning. Being flexible and practical to implement, they appear thus highly suited to classroom application. In perspective, a series of similar forms of tasks implementing CBSE is presented, such as by experimental of aesthetic contexts.

en physics.ed-ph
arXiv Open Access 2023
Multilingual Event Extraction from Historical Newspaper Adverts

Nadav Borenstein, Natalia da Silva Perez, Isabelle Augenstein

NLP methods can aid historians in analyzing textual materials in greater volumes than manually feasible. Developing such methods poses substantial challenges though. First, acquiring large, annotated historical datasets is difficult, as only domain experts can reliably label them. Second, most available off-the-shelf NLP models are trained on modern language texts, rendering them significantly less effective when applied to historical corpora. This is particularly problematic for less well studied tasks, and for languages other than English. This paper addresses these challenges while focusing on the under-explored task of event extraction from a novel domain of historical texts. We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period reporting on enslaved people who liberated themselves from enslavement. We find that: 1) even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task and leveraging existing datasets and models for modern languages; and 2) cross-lingual low-resource learning for historical languages is highly challenging, and machine translation of the historical datasets to the considered target languages is, in practice, often the best-performing solution.

en cs.CL, cs.LG
arXiv Open Access 2023
Topic Modelling of Swedish Newspaper Articles about Coronavirus: a Case Study using Latent Dirichlet Allocation Method

Bernadeta Griciūtė, Lifeng Han, Goran Nenadic

Topic Modelling (TM) is from the research branches of natural language understanding (NLU) and natural language processing (NLP) that is to facilitate insightful analysis from large documents and datasets, such as a summarisation of main topics and the topic changes. This kind of discovery is getting more popular in real-life applications due to its impact on big data analytics. In this study, from the social-media and healthcare domain, we apply popular Latent Dirichlet Allocation (LDA) methods to model the topic changes in Swedish newspaper articles about Coronavirus. We describe the corpus we created including 6515 articles, methods applied, and statistics on topic changes over approximately 1 year and two months period of time from 17th January 2020 to 13th March 2021. We hope this work can be an asset for grounding applications of topic modelling and can be inspiring for similar case studies in an era with pandemics, to support socio-economic impact research as well as clinical and healthcare analytics. Our data and source code are openly available at https://github. com/poethan/Swed_Covid_TM Keywords: Latent Dirichlet Allocation (LDA); Topic Modelling; Coronavirus; Pandemics; Natural Language Understanding; BERT-topic

en cs.CL, cs.SI
DOAJ Open Access 2023
Modèle et contre-modèles olympiques dans l’entre-deux-guerres.

Karen Bretin-Maffiuletti, Benoît Caritey

This article deals with the history of the international workers’ games ("International Workers’ Olympiads" and "Spartakiads"), set up during the Interwar period by the Socialist Workers’ Sport International and the Red Sport International in order to counter the growing hegemony of the Olympic Games. These workers’ sporting events were an undeniable success, despite the hostility of the authorities and the reluctance of the main newspapers (specialized in sport or not) to inform their readers about them. The international workers’ sport federations firstly used these games to promote an alternative sport culture, focused on mass practice and international fraternity. Then, these workers’ organizations became defenders of an Olympic ideal since they considered that it had gone astray. Both evolutions in the Olympic Games criticism and in the workers’ events celebration in the socialist and communist press is linked to this shift.

History (General) and history of Europe
arXiv Open Access 2022
Optical character recognition quality affects perceived usefulness of historical newspaper clippings

Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen et al.

Introduction. We study effect of different quality optical character recognition in interactive information retrieval with a collection of one digitized historical Finnish newspaper. Method. This study is based on the simulated interactive information retrieval work task model. Thirty-two users made searches to an article collection of Finnish newspaper Uusi Suometar 1869-1918 with ca. 1.45 million auto segmented articles. Our article search database had two versions of each article with different quality optical character recognition. Each user performed six pre-formulated and six self-formulated short queries and evaluated subjectively the top-10 results using graded relevance scale of 0-3 without knowing about the optical character recognition quality differences of the otherwise identical articles. Analysis. Analysis of the user evaluations was performed by comparing mean averages of evaluations scores in user sessions. Differences of query results were detected by analysing lengths of returned articles in pre-formulated and self-formulated queries and number of different documents retrieved overall in these two sessions. Results. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles positively. Conclusions. We were able to show that improvement in optical character recognition quality of documents leads to higher mean relevance evaluation scores of query results in our historical newspaper collection. To the best of our knowledge this simulated interactive user-task is the first one showing empirically that users' subjective relevance assessments are affected by a change in the quality of optically read text.

en cs.CL
arXiv Open Access 2022
Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Nicolas Gutehrlé, Iana Atanassova

Background. In recent years, libraries and archives led important digitisation campaigns that opened the access to vast collections of historical documents. While such documents are often available as XML ALTO documents, they lack information about their logical structure. In this paper, we address the problem of Logical Layout Analysis applied to historical documents in French. We propose a rule-based method, that we evaluate and compare with two Machine-Learning models, namely RIPPER and Gradient Boosting. Our data set contains French newspapers, periodicals and magazines, published in the first half of the twentieth century in the Franche-Comté Region. Results. Our rule-based system outperforms the two other models in nearly all evaluations. It has especially better Recall results, indicating that our system covers more types of every logical label than the other two models. When comparing RIPPER with Gradient Boosting, we can observe that Gradient Boosting has better Precision scores but RIPPER has better Recall scores. Conclusions. The evaluation shows that our system outperforms the two Machine Learning models, and provides significantly higher Recall. It also confirms that our system can be used to produce annotated data sets that are large enough to envisage Machine Learning or Deep Learning approaches for the task of Logical Layout Analysis. Combining rules and Machine Learning models into hybrid systems could potentially provide even better performances. Furthermore, as the layout in historical documents evolves rapidly, one possible solution to overcome this problem would be to apply Rule Learning algorithms to bootstrap rule sets adapted to different publication periods.

arXiv Open Access 2022
OCR quality affects perceived usefulness of historical newspaper clippings -- a user study

Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen et al.

Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the effects of artificially degraded OCR quality (see, e.g., [1-2]) or utilized test collections containing texts based on authentic low quality OCR data (see, e.g., [3]). In this paper the effects of OCR quality are studied in a user-oriented information retrieval setting. Thirty-two users evaluated subjectively query results of six topics each (out of 30 topics) based on pre-formulated queries using a simulated work task setting. To the best of our knowledge our simulated work task experiment is the first one showing empirically that users' subjective relevance assessments of retrieved documents are affected by a change in the quality of optically read text. Users of historical newspaper collections have so far commented effects of OCR'ed data quality mainly in impressionistic ways, and controlled user environments for studying effects of OCR quality on users' relevance assessments of the retrieval results have so far been missing. To remedy this The National Library of Finland (NLF) set up an experimental query environment for the contents of one Finnish historical newspaper, Uusi Suometar 1869-1918, to be able to compare users' evaluation of search results of two different OCR qualities for digitized newspaper articles. The query interface was able to present the same underlying document for the user based on two alternatives: either based on the lower OCR quality, or based on the higher OCR quality, and the choice was randomized. The users did not know about quality differences in the article texts they evaluated. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles significantly. The mean average evaluation score for the improved OCR results was 7.94% higher than the mean average evaluation score of the old OCR results.

en cs.IR, cs.CL

Halaman 4 dari 11284