Hasil untuk "Biography"

Menampilkan 20 dari ~345451 hasil · dari CrossRef, Semantic Scholar, DOAJ, arXiv

JSON API
DOAJ Open Access 2025
The Scholarly Pathfinder: Andō Hideharu’s “Weba kikō” (Max Weber: A Travelogue, 1972)

Wolfgang Schwentker

In 1969/70, Andō Hideharu (1921–1998), a Japanese historian of ideas, was a visiting professor at the Max Weber-Institute in Munich, Germany, for a period of one year. He was a harsh critic of Marianne Weber’s 1926 biography of her husband. During his tenure, he travelled to a number of places associated with Max Weber, with the aim of reconstructing his personal history. Andō literally followed Weber’s path from the cradle to the grave, though not necessarily in a chronological order. In a travelogue published in 1972, Andō recounted his experiences in Germany, Switzerland, Austria, and France, with a particular emphasis on interviews conducted with contemporaries of Max Weber who were still alive in 1969/70. Andō’s primary concern was in the personality of Max Weber, with a secondary focus on Weber’s work and the adaptation of Weber’s sociology for the study of Japanese modernity. The following article reconstructs Andō’s travel experiences in Europe by analysing his “Weber Travelogue”. It then discusses a bitter controversy that arose between Andō and certain colleagues in the context of Japanese Weber studies upon his return to Japan. Finally, the article assesses the merits and limitations of Andō’s “time travel” into Max Weber’s life through the lens of “Motivenforschung” (study of motives), a concept he drew from Weber’s methodological writings.

Sociology (General)
arXiv Open Access 2025
StorySage: Conversational Autobiography Writing Powered by a Multi-Agent Framework

Shayan Talaei, Meijin Li, Kanu Grover et al.

Every individual carries a unique and personal life story shaped by their memories and experiences. However, these memories are often scattered and difficult to organize into a coherent narrative, a challenge that defines the task of autobiography writing. Existing conversational writing assistants tend to rely on generic user interactions and pre-defined guidelines, making it difficult for these systems to capture personal memories and develop a complete biography over time. We introduce StorySage, a user-driven software system designed to meet the needs of a diverse group of users that supports a flexible conversation and a structured approach to autobiography writing. Powered by a multi-agent framework composed of an Interviewer, Session Scribe, Planner, Section Writer, and Session Coordinator, our system iteratively collects user memories, updates their autobiography, and plans for future conversations. In experimental simulations, StorySage demonstrates its ability to navigate multiple sessions and capture user memories across many conversations. User studies (N=28) highlight how StorySage maintains improved conversational flow, narrative completeness, and higher user satisfaction when compared to a baseline. In summary, StorySage contributes both a novel architecture for autobiography writing and insights into how multi-agent systems can enhance human-AI creative partnerships.

en cs.HC, cs.AI
arXiv Open Access 2025
When Life Paths Cross: Extracting Human Interactions in Time and Space from Wikipedia

Zhongyang Liu, Ying Zhang, Xiangyi Xiao et al.

Interactions among notable individuals -- whether examined individually, in groups, or as networks -- often convey significant messages across cultural, economic, political, scientific, and historical perspectives. By analyzing the times and locations of these interactions, we can observe how dynamics unfold across regions over time. However, relevant studies are often constrained by data scarcity, particularly concerning the availability of specific location and time information. To address this issue, we mine millions of biography pages from Wikipedia, extracting 685,966 interaction records in the form of (Person1, Person2, Time, Location) interaction quadruplets. The key elements of these interactions are often scattered throughout the heterogeneous crowd-sourced text and may be loosely or indirectly associated. We overcome this challenge by designing a model that integrates attention mechanisms, multi-task learning, and feature transfer methods, achieving an F1 score of 86.51%, which outperforms baseline models. We further conduct an empirical analysis of intra- and inter-party interactions among political figures to examine political polarization in the US, showcasing the potential of the extracted data from a perspective that may not be possible without this data. We make our code, the extracted interaction data, and the WikiInteraction dataset of 4,507 labeled interaction quadruplets publicly available.

en cs.SI, cs.CY
arXiv Open Access 2025
On Optimal Steering to Achieve Exact Fairness

Mohit Sharma, Amit Jayant Deshpande, Chiranjib Bhattacharyya et al.

To fix the 'bias in, bias out' problem in fair machine learning, it is important to steer feature distributions of data or internal representations of Large Language Models (LLMs) to ideal ones that guarantee group-fair outcomes. Previous work on fair generative models and representation steering could greatly benefit from provable fairness guarantees on the model output. We define a distribution as ideal if the minimizer of any cost-sensitive risk on it is guaranteed to have exact group-fair outcomes (e.g., demographic parity, equal opportunity)-in other words, it has no fairness-utility trade-off. We formulate an optimization program for optimal steering by finding the nearest ideal distribution in KL-divergence, and provide efficient algorithms for it when the underlying distributions come from well-known parametric families (e.g., normal, log-normal). Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility (and sometimes even improve utility). We demonstrate affine steering of LLM representations to reduce bias in multi-class classification, e.g., occupation prediction from a short biography in Bios dataset (De-Arteaga et al.). Furthermore, we steer internal representations of LLMs towards desired outputs so that it works equally well across different groups.

en cs.LG, cs.AI
DOAJ Open Access 2024
Un centenar între adevăruri și mistificări: Eugen Barbu la judecata de apoi a posterității

Alexandru Dumitriu

This study concerns the controversy surrounding the life and work of Eugen Barbu (1924–1993), a writer at the forefront of the Romanian literature after World War II, whose birth centennial is this year. On this occasion, the “Eugen Barbu case” has once again been in the centre of attention, causing different reactions, in which the mistruth has overlapped the truth. In the first part, the study analyzes the accusations of plagiarism against one of Barbu’s benchmark novels. In 2016, the “România literară” journal published a file to let it be understood that the novel Groapa (1957) would have encapsulated the micro-novel Hotel Maidan (1935) signed by Stoian Gh. Tudor (1911–1941). Dissected point by point, the evidence on which the demonstration from “România literară” relies, which has been embraced – without a minimal verification – by respectable intellectual public figures on Eugen Barbu’s centennial, borders on the ridicule. At the same time, the paper discusses, through documentation recourse, the plagiarism suspicions surrounding other works by the author of Groapa. The same perspective is employed for reviewing the scandal triggered by the third volume (1978) of the Incognito tetralogy (1976–1980), for which the authour used, as he had specified in the first volume of the series, sequences from books of other foreign and Romanian writers. In its second part, the paper examines some of Eugen Barbu’s radical attitudes by recourse to important moments in his biography, using entries from both the writer’s original diaries and his wife’s diaries, actress Marga Barbu (1929–2009), documents in the collection of the author of this investigation. The research reveals significant nuances, so far overlooked in the discussion about the “Eugen Barbu case”.

Language and Literature
DOAJ Open Access 2024
Historical and biographical essay. Memories of life of Dovlet Pozadovich Achegu, a surgeon of a superior expert category (1919–1963)

S. T. Chamokova, E. V. Chamokov

The relevance of the research is due to the growing interest of researchers and readers in the experience of the Soviet period of history, through life paths, biographies, and the results of the activities of ordinary people.   The problem of the research is to analyze sociocultural processes through sources of personal origin (documents, photos, memories, etc.).   The purpose of the research is using the biography of Achegu D.P. to understand social, cultural, professional and psychological aspects of the Soviet period.   The historical and biographical method became the basis of the research methodology for examining the life and work of a superior expert category Dovlet Pozadovich Achegu (1919-1963), the first head of the surgical department of the Adyghe Regional Hospital. The methods of scientific analysis and generalization were also used.   The research results. The description and analysis of the life story of Dovlet Pozadovich allows us to generalize and understand social, cultural and psychological phenomena of a certain historical period. The fate of Achegu included the main milestones of the Soviet period. He experienced all the joys and hardships of difficult military-political and ideological processes. All significant historical events of those years run like a red thread through his life. Undoubtedly, Dovlet was not unique; he was formed as a highly qualified specialist and an interesting person under the influence of the system of Soviet culture, education, norms of behavior and attitude towards the profession.   Key findings. The younger generation of doctors and medical students need to know such stories and biographies as an example of service to the profession and duty of a doctor. The life and work of Achegu D.P. can serve as an example for the formation of personal meanings and value guidelines for modern youth. Such “microhistories” related to the humanistic direction of the study of history, analyze the cultural, professional, and family values of people. They describe small elements that are important for knowledge and understanding of a certain era, which help answer big questions and build the continuity of generations.

Special aspects of education
arXiv Open Access 2024
Unfamiliar Finetuning Examples Control How Language Models Hallucinate

Katie Kang, Eric Wallace, Claire Tomlin et al.

Large language models are known to hallucinate when faced with unfamiliar queries, but the underlying mechanism that govern how models hallucinate are not yet fully understood. In this work, we find that unfamiliar examples in the models' finetuning data -- those that introduce concepts beyond the base model's scope of knowledge -- are crucial in shaping these errors. In particular, we find that an LLM's hallucinated predictions tend to mirror the responses associated with its unfamiliar finetuning examples. This suggests that by modifying how unfamiliar finetuning examples are supervised, we can influence a model's responses to unfamiliar queries (e.g., say ``I don't know''). We empirically validate this observation in a series of controlled experiments involving SFT, RL, and reward model finetuning on TriviaQA and MMLU. Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations. We find that, while hallucinations from the reward model can significantly undermine the effectiveness of RL factuality finetuning, strategically controlling how reward models hallucinate can minimize these negative effects. Leveraging our previous observations on controlling hallucinations, we propose an approach for learning more reliable reward models, and show that they improve the efficacy of RL factuality finetuning in long-form biography and book/movie plot generation tasks.

en cs.LG, cs.AI
arXiv Open Access 2024
Multi-group Uncertainty Quantification for Long-form Text Generation

Terrance Liu, Zhiwei Steven Wu

While past works have shown how uncertainty quantification can be applied to large language model (LLM) outputs, the question of whether resulting uncertainty guarantees still hold within sub-groupings of data remains open. In our work, given some long-form text generated by an LLM, we study uncertainty at both the level of individual claims contained within the output (via calibration) and across the entire output itself (via conformal prediction). Using biography generation as a testbed for this study, we derive a set of (demographic) attributes (e.g., whether some text describes a man or woman) for each generation to form such "subgroups" of data. We find that although canonical methods for both types of uncertainty quantification perform well when measuring across the entire dataset, such guarantees break down when examining particular subgroups. Having established this issue, we invoke group-conditional methods for uncertainty quantification -- multicalibration and multivalid conformal prediction -- and find that across a variety of approaches, additional subgroup information consistently improves calibration and conformal prediction within subgroups (while crucially retaining guarantees across the entire dataset). As the problems of calibration, conformal prediction, and their multi-group counterparts have not been extensively explored in the context of long-form text generation, we consider these results to form a benchmark for this setting.

en cs.CL, cs.AI
arXiv Open Access 2024
Jan Veth's paintings of Jacobus Kapteyn

Pieter C. van der Kruit

Jacobus C. Kapteyn is regarded as one of the coryfees of the University of Groningen. Part of his legacy is two paintings of him by Dutch painter Jan Pieter Veth. One, showing him at his desk, decorates the Kapteyn Room in the Kapteyn Astronomical Institute, and the other one, displaying him in academic attire, is in the University's gallery of professors in the central Academy Building. The first was offered to the Kapteyns on the occasion of his 40-th anniversary as professor in 1918 and the second to the University after his retirement in 1921. It has been suggested that there must have been a third portrait that now is lost. Former director Adriaan Blaauw has proposed that the one in the Academy Building actually was first offered in 1918, but at Mrs. Kapteyn's request replaced by the one now in the Kapteyn Room. The first version was then later adapted to the requirements of the gallery of professors by Veth himself by overpainting it with academic attire. A preliminary trial version by Veth, in the possession of Kapteyn's greatgrandson, shows what it would have looked like before the adaption. The following reports on new evidence: the biography of Jan Veth that historian Johan Huizinga, friend of Veth, wrote, and letters Veth wrote to his wife while he was working on these paintings. This provides strong support of Blaauw's sequence of events with a few modifications. No third painting has ever been produced.

en physics.hist-ph
arXiv Open Access 2024
Linguistic Calibration of Long-Form Generations

Neil Band, Xuechen Li, Tengyu Ma et al.

Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.

en cs.LG, cs.AI
arXiv Open Access 2024
CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models

Guangzhi Sun, Potsawee Manakul, Adian Liusie et al.

Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems' susceptibility to hallucinations. To assess system hallucination robustness, hallucination ranking approaches have been developed for specific tasks such as image captioning, question answering, summarization, or biography generation. However, these approaches typically compare model outputs to gold-standard references or labels, limiting hallucination benchmarking for new domains. This work proposes "CrossCheckGPT", a reference-free universal hallucination ranking for multimodal foundation models. The core idea of CrossCheckGPT is that the same hallucinated content is unlikely to be generated by different independent systems, hence cross-system consistency can provide meaningful and accurate hallucination assessment scores. CrossCheckGPT can be applied to any model or task, provided that the information consistency between outputs can be measured through an appropriate distance metric. Focusing on multimodal large language models that generate text, we explore two information consistency measures: CrossCheck-explicit and CrossCheck-implicit. We showcase the applicability of our method for hallucination ranking across various modalities, namely the text, image, and audio-visual domains. Further, we propose the first audio-visual hallucination benchmark, "AVHalluBench", and illustrate the effectiveness of CrossCheckGPT, achieving correlations of 98% and 89% with human judgements on MHaluBench and AVHalluBench, respectively.

en cs.CL
DOAJ Open Access 2023
Gender Gaps in Mode Usage, Vehicle Ownership, and Spatial Mobility When Entering Parenthood: A Life Course Perspective

Hung-Chia Yang, Ling Jin, Alina Lazar et al.

Entry into parenthood is a major disruptive event to travel behavior, and gender gaps in mobility choices are often widened during parenthood. The exact timing of gender gap formation and their long-term effects on different subpopulations are less studied in the literature. Leveraging a longitudinal dataset from the 2018 WholeTraveler Study, this paper examines the effects of parenthood on a diverse set of short- to long-term outcomes related to the three hierarchical domains of mobility biography: mode choice, vehicle ownership, spatial mobility, and career decisions. The progress of the effects is evaluated over a sequential set of parenting stages and differentiated across three subpopulations. We find that individuals classified as “Have-it-alls”, who start their careers, partner up, and have children concurrently and early, significantly increase their car uses two years prior to childbirth (“nesting period”), and they then relocate to less transit-accessible areas and consequently reduce their reliance on public transportation while they have children in the household. In contrast, individuals categorized as “Couples”, who start careers and partnerships early but delay parenthood, and “Singles”, who postpone partnership and parenthood, have less pronounced changes in travel behavior throughout the parenting stages. The cohort-level effects are found to be driven primarily by women, whose career development is on average more negatively impacted by parenting events than men, regardless of their life course trajectory. Early career decisions made by women upon entering parenthood contribute to gender gaps in mid- to longer-term mobility decisions, signifying the importance of early intervention.

Systems engineering, Technology (General)
DOAJ Open Access 2023
Navigating Second Generation Memory and Auto/ biography in Home Video. A Video Collection of Hojda Stojka, Son of Artist and Survivor of the Porajmos Ceija Stojka

Renée Winter

This paper investigates home videos made by Hojda Stojka, the son of Ceija Stojka (1933-2013), artist and survivor of Auschwitz, Ravensbrück, and Bergen-Belsen. Based on a close reading of significant video sequences, a narrative interview with the videographer, and Ceija Stojka’s publications and films, it analyses how the auto/biographical videos relate to the persecution of the parents’ generation. The paper focuses on the importance of spaces like kitchens, cars and stages, on the value assigned to auto/biographical audiovisual recordings, and the recontextualization and integration of photographs and television recordings into the family memory.

History of Austria. Liechtenstein. Hungary. Czechoslovakia
arXiv Open Access 2023
Are fairness metric scores enough to assess discrimination biases in machine learning?

Fanny Jourdan, Laurent Risser, Jean-Michel Loubes et al.

This paper presents novel experiments shedding light on the shortcomings of current metrics for assessing biases of gender discrimination made by machine learning algorithms on textual data. We focus on the Bios dataset, and our learning task is to predict the occupation of individuals, based on their biography. Such prediction tasks are common in commercial Natural Language Processing (NLP) applications such as automatic job recommendations. We address an important limitation of theoretical discussions dealing with group-wise fairness metrics: they focus on large datasets, although the norm in many industrial NLP applications is to use small to reasonably large linguistic datasets for which the main practical constraint is to get a good prediction accuracy. We then question how reliable are different popular measures of bias when the size of the training set is simply sufficient to learn reasonably accurate predictions. Our experiments sample the Bios dataset and learn more than 200 models on different sample sizes. This allows us to statistically study our results and to confirm that common gender bias indices provide diverging and sometimes unreliable results when applied to relatively small training and test samples. This highlights the crucial importance of variance calculations for providing sound results in this field.

en cs.CL, cs.CY
arXiv Open Access 2023
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Zeyuan Allen-Zhu, Yuanzhi Li

Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia? In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling, translations) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning. To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text. This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late.

en cs.CL, cs.AI
DOAJ Open Access 2022
How "Madame Bovary" is written (Flaubert and his narrator) Part 1

Vasily Tolmatchoff

In a paper, written in polemics with mythology of the Flauberian studies of XX, particularly of the Soviet origin (the so-called realism or new realism, objectivism, impersonality, rationality of Flaubertian manner in “Madame Bovary”, identification of the author and his skills with the narrator, etc.), a figure of the narrator in this novel is discussed in full detail for the first time. This figure, a concrete person and a creator of narrative, is corresponded with Flaubert on various levels (biographical, gender, psychological, rationally creative, intertextual, subconscious) as well as with the characters (including the narrator on his own, his self-reflexion, psychological and psychic complexes, aims, methods and stylistics of narration). Antifeminine dimension of the narrative. «We» of the narrative is interpreted as a controversial sum of “I”, “non-I”, “other I-ies”, shadow projections of “I”, theatrical metamorphoses of “I”. In context of a special position of the narrator bourgeoisness, literatureness of the social consciousness, romanticism, palimpsest of the narrative, the double ending, and also the motives of story-telling and justification of creative efforts are analyzed. Symbolically the main events of the narrator’s world are death and positioning of himself as an inventor of fictions (“the lies”), a highly personal narrative the roots of which are in his school childhood, his mania of additional details and of endless improvement of his text. The narrator as a madman and an author of the madman’s diaries. The characters of the novel as artists. A study of the poetics of repetitions, mirror scenes, colours (blue, red, green), erotics, nature, historical details permits V. M. Tolmatchoff to introduce a rather new interpretation of Flaubert’s work as belonging to romanticism (partially baroqian, partially classical). The paper reconstructs chronology of events, age of the characters and poses a problem of Flaubert as inventor, of a meaning of non-correspondence in his novel between purely fictional time-space and strict historical details.

Philology. Linguistics, Literature (General)
DOAJ Open Access 2022
Petr Vasilievich Polezhaev: lawyer, philanthropist, writer

M.I. Rodnov

Background. The most important problem of Russian historiography is the reconstruction of authentic biographical data of famous public figures of the Russian Empire. The purpose of the work is to study the administrative (legal), charitable and literary activities of the active Privy Councilor Petr Vasilievich Polezhaev, who lived in Penza, Ufa and Saint Petersburg. Materials and methods. The implementation of the project became possible on the basis of a comprehensive involvement of a wide range of sources, their comparison and critical analysis, which made it possible to superimpose genealogical information on information from archives and periodicals. This made it possible to understand the reasons and logic behind the career of P.V. Polezhaev. Results. Scientific biography of one of the most popular writers of the turn of the 19th–20th centuries. Until now, it has not been possible to create, primarily because of the peculiarities of the service of P.V. Polezhaev, information on which is scattered in various sources. The main, most fruitful period of activity of the lawyer P.V. Polezhaev fell on the time the family lived in Ufa (1858–1874). But almost the entire literary career took place in Saint Petersburg. Reconstruction of the overall picture made it possible to see the continuity and mutual conditioning of all stages of Polezhaev’s administrative and creative activities. Conclusions. A native of Penza province, Petr Vasilievich Polezhaev received a good education and began his career as a lawyer in his homeland. Then the family leaves for Ufa, where almost all children were born and raised, and Polezhaev makes a successful career, holding the positions of provincial prosecutor and head of the highest judicial body of the Ufa province – the Ufa Chamber of Criminal and Civil Court. The broad public interests of Polezhaev led him to the leadership of the oldest charitable structure in Ufa – the Ufa Trustee Committee for the Poor. Polezhaev's charitable service caused a wide resonance, he successfully implemented several projects, and in the memory of the inhabitants of Ufa, his name still remains as the creator of the Polezhaev shelter for poor students of the local male gymnasium. Having left Ufa due to problems in the service, Polezhaev in Saint Petersburg unexpectedly takes a great interest in literary creativity, his historical novels withstand three editions, disperse throughout the country. The artistic heritage of P.V. Polezhaev influenced subsequent generations of novelists, including modern cinema.

History (General)
DOAJ Open Access 2022
Character Education: Comparison Analysis Between The Thinking of Ki Hajar Dewantara and Abdullah Nasih Ulwan

Kasmuri Selamat, Adripen Adripen, Jamaluddin Jamaluddin

This research was inspired by the fact that the affective domain has not become a priority scale and is a serious concern of the three domains (cognitive, affective, and psychomotor) in the teaching and learning process in schools. The purpose of this research was to describe the biography, work, and character education of Ki Hajar Dewantara and Abdullah Nasih Ulwan This research was library research. The data was collected by finding some relevant sources. The collected data was analyzed by comparing the sources concerning the three issues. In conculusion, both figures had different biography, work, and character education. But they also had same way of thinking concerningcharacter education.

Education (General)
arXiv Open Access 2022
Guidelines and a Corpus for Extracting Biographical Events

Marco Antonio Stranisci, Enrico Mensa, Ousmane Diakite et al.

Despite biographies are widely spread within the Semantic Web, resources and approaches to automatically extract biographical events are limited. Such limitation reduces the amount of structured, machine-readable biographical information, especially about people belonging to underrepresented groups. Our work challenges this limitation by providing a set of guidelines for the semantic annotation of life events. The guidelines are designed to be interoperable with existing ISO-standards for semantic annotation: ISO-TimeML (ISO-24617-1), and SemAF (ISO-24617-4). Guidelines were tested through an annotation task of Wikipedia biographies of underrepresented writers, namely authors born in non-Western countries, migrants, or belonging to ethnic minorities. 1,000 sentences were annotated by 4 annotators with an average Inter-Annotator Agreement of 0.825. The resulting corpus was mapped on OntoNotes. Such mapping allowed to to expand our corpus, showing that already existing resources may be exploited for the biographical event extraction task.

en cs.CL, cs.AI
arXiv Open Access 2022
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Aman Kumar, Himani Shrotriya, Prachi Sahu et al.

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at https://ai4bharat.iitm.ac.in/indicnlg-suite.

en cs.CL, cs.AI

Halaman 8 dari 17273