Hasil "Encyclopedias"

S2 Open Access 2020

Nicola De Cao, Gautier Izacard, Sebastian Riedel et al.

Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. One way to understand current approaches is as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity information such as descriptions. This approach leads to several shortcomings: i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions between the two; ii) a large memory footprint is needed to store dense representations when considering large entity sets; iii) an appropriately hard set of negative data has to be subsampled at training time. We propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion, and conditioned on the context. This enables to mitigate the aforementioned technical issues: i) the autoregressive formulation allows us to directly capture relations between context and entity name, effectively cross encoding both; ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; iii) the exact softmax loss can be efficiently computed without the need to subsample negative data. We show the efficacy of the approach with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new SOTA, or very competitive results while using a tiny fraction of the memory of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their unambiguous name.

525 sitasi en Computer Science, Mathematics

Detail Sumber

S2 Open Access 2023

From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models

Shangbin Feng, Chan Young Park, Yuhan Liu et al.

Language models (LMs) are pretrained on diverse data sources—news, discussion forums, books, online encyclopedias. A significant portion of this data includes facts and opinions which, on one hand, celebrate democracy and diversity of ideas, and on the other hand are inherently socially biased. Our work develops new methods to (1) measure media biases in LMs trained on such corpora, along social and economic axes, and (2) measure the fairness of downstream NLP models trained on top of politically biased LMs. We focus on hate speech and misinformation detection, aiming to empirically quantify the effects of political (social, economic) biases in pretraining data on the fairness of high-stakes social-oriented tasks. Our findings reveal that pretrained LMs do have political leanings which reinforce the polarization present in pretraining corpora, propagating social biases into hate speech predictions and media biases into misinformation detectors. We discuss the implications of our findings for NLP research and propose future directions to mitigate unfairness.

351 sitasi en Computer Science

Detail DOI Sumber

arXiv Open Access 2026

Is Grokipedia Right-Leaning? Comparing Political Framing in Wikipedia and Grokipedia on Controversial Topics

Philipp Eibl, Erica Coppolillo, Simone Mungari et al.

Online encyclopedias are central to contemporary information infrastructures and have become focal points of debates over ideological bias. Wikipedia, in particular, has long been accused of left-leaning bias, while Grokipedia, an AI-generated encyclopedia launched by xAI, has been framed as a right-leaning alternative. This paper presents a comparative analysis of Wikipedia and Grokipedia on well-established politically contested topics. Specifically, we examine differences in semantic framing, political orientation, and content prioritization. We find that semantic similarity between the two platforms decays across article sections and diverges more strongly on controversial topics than on randomly sampled ones. Additionally, we show that both encyclopedias predominantly exhibit left-leaning framings, although Grokipedia exhibits a more bimodal distribution with increased prominence of right-leaning content. The experimental code is publicly available.

en cs.IR, cs.AI

Detail Sumber

arXiv Open Access 2026

Wikipedia and Grokipedia: A Comparison of Human and Generative Encyclopedias

Ortal Hadad, Edoardo Loru, Jacopo Nudo et al.

We present a comparative analysis of Wikipedia and Grokipedia to examine how generative mediation alters content selection, textual rewriting, narrative structure, and evaluative framing in encyclopedic content. We model page inclusion in Grokipedia as a function of Wikipedia page popularity, density of reference, and recent editorial activity. Inclusion is non-uniform: pages with higher visibility and greater editorial conflict in Wikipedia are more likely to appear in Grokipedia. For included pages, we distinguish between verbatim reproduction and generative rewriting. Rewriting is more frequent for pages with higher reference density and recent controversy, while highly popular pages are more often reproduced without modification. We compare editing activity across the two platforms and estimate page complexity using a fitness-complexity framework to assess whether generative mediation alters patterns of editorial participation. To assess narrative organization, we construct actor-relation networks from article texts using abstract meaning representation. Across multiple topical domains, including U.S. politics, geopolitics, and conspiracy-related narratives, narrative structure remains largely consistent between the two sources. Analysis of lead sections shows broadly correlated framing, with localized shifts in laudatory and conflict-oriented language for some topics in Grokipedia. Overall, generative systems preserve the main structural organization of encyclopedic content, while affecting how content is selected, rewritten, and framed.

en cs.CY, cs.SI

Detail Sumber

CrossRef Open Access 2025

Dictionaries and Encyclopedias: Relationship

Keith Allan

en

Detail DOI Sumber

DOAJ Open Access 2025

The spatial representation of the Croatian Encyclopedia of Technology: from idea to fruition

Jasmina Tolj Smolčić

Highly informative content, objectivity, credibility, and the organisation of knowledge are just some of the characteristics that make encyclopedias and online encyclopedic projects reliable works that provide users with efficient access to information related to their area of interest. As valuable sources for acquiring new knowledge, it is essential that they keep pace with today’s websites not only in terms of quality, but also in presentation. One form of presentation is spatial representation, which should be further explored in the encyclopedic context. This practice is already present in various projects, and such a new way of presenting online encyclopedic content would allow for more effective access to information and improved navigation when retrieving content. In encyclopedistics, spatial indicators, as the basis of spatial representation, are not in systematic use, and their full searching and browsing options are not fully enabled. Associating spatial indicators with encyclopedic content would make the creation of searchable and browsable virtual cartographic representations of encyclopedic knowledge possible. To this end, the content of the Croatian Technical Encyclopedia was used for a qualitative analysis of its articles, through which the types of data with spatial attributes typical for each article category were determined, followed by the identification of which types of data should be recorded for each category. The research established that article categories are indeed the key to creating a future model for spatial tagging and representation of encyclopedic knowledge. Based on the creation of unified spatial indicators (geotags) for individual article categories, a set of metadata was proposed, and a model for the spatial representation of encyclopedic knowledge was developed. Based on this model, the Atlas of Croatian Technology Heritage was created as a spatial representation of the knowledge found in the Croatian Encyclopedia of Technology.

Lexicography

Detail DOI Sumber

DOAJ Open Access 2025

Explanation of Avicenna’s Personality and Works from the Point of View of His Unknown Student, Abu Muḥammad ʿAbdullāh al-ʾAzdī

Moallemi Moallemi, Morteza Darabinia, Hossein Jalahi

Avicenna’s is well known in the world as one of the famous physicians and philosophers. After the publication of the book Al-Mā’a authored by Abu Muḥammad al-ʾAzdī (d. 456 AH), it became clear that Abu Muḥammad al-ʾAzdī was one of Avicenna’s students. In his book, ʾAzdī has mentioned Avicenna nearly a hundred times as his teacher in medicine. The recognition of Avicenna is followed in this study from the point of view of ʾAzdī.This research employs a content analysis method with a historical-descriptive and analytical approach, utilizing library resources for data collection. The process of data collection primarily focuses on the book Al-Mā’a (three volumes). This study was carried out in multiple stages of finding keywords, searching in the text of Kitab al-Mā’a, the historical books, and Islamic encyclopedias, as supplementary data (to find the related data), along with searching in Persian and Latin electronic databases, categorizing, sorting, and analyzing the content.There are numerous quotations from Avicenna in the book Al-Mā’a written by ʾAzdī. New data is obtained from the mentioned book that needs more attention. ʾAzdī has interpreted the medical approach of his master. His work contains unique medical insights attributed to Avicenna that do not appear in Avicenna’s extant works, such as the Canon of Medicine. By migrating to Andalusia and staying in Balansīya (now Valencia), ʾAzdī introduced Avicenna and his Canon of Medicine and other works there.

Medicine, History of medicine. Medical expeditions

Detail DOI Sumber

DOAJ Open Access 2025

Comparative Evaluation of AI-based Systems for Tinnitus

Abdulaziz Yalınkılıç, Mehmet Zeki Erdem

INTRODUCTION: Today, with the development of technology, the variety of information sources has increased. It is now possible to access information obtained from encyclopedias in seconds with a few clicks of a button. Rapid developments in artificial intelligence (AI) and the widespread use of large language models (LLMs) such as ChatGPT, Gemini, and Perplexity have revolutionized access to medical information. However, the accuracy and readability of the answers provided by these models are critical, especially in the healthcare domain. This study evaluates the performance of ChatGPT, Gemini, and Perplexity in addressing frequently asked questions about tinnitus, a common symptom in otolaryngology practice. METHODS: Twenty frequently asked questions about tinnitus were posed to the models and their responses were evaluated by two otolaryngologists using global quality (GQS) and Likert scales for accuracy and reliability and the Gunning-Fog Index (GFI) for readability. RESULTS: The findings reveal no significant difference in the reliability and quality of information between the models (median: 5.00, p=0.159), but it was observed that Gemini came out ahead in readability (p=0.001) and ChatGPT in accuracy (p=0.007). However, Perplexity lagged in both metrics. These results highlight the varying strengths and weaknesses of LLMs, emphasizing the importance of model selection based on user needs. For example, ChatGPT is ideal for complex medical information, while Gemini is more accessible to wider audiences. DISCUSSION AND CONCLUSION: This study demonstrates the potential of AI-enabled systems in healthcare; however, we suggest that future improvements should increase both accuracy and accessibility.

Medicine

Detail DOI Sumber

arXiv Open Access 2025

The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Seongyun Lee, Seungone Kim, Minju Seo et al.

Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

en cs.CL, cs.AI

Detail Sumber

arXiv Open Access 2025

Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base

Yu Li, Yuan Huang, Tao Wang et al.

Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search -- retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.

en cs.AI, cs.LG

Detail Sumber

arXiv Open Access 2025

Data-driven Methods of Extracting Text Structure and Information Transfer

Shinichi Honna, Taichi Murayama, Akira Matsui

The Anna Karenina Principle (AKP) holds that success requires satisfying a small set of essential conditions, whereas failure takes diverse forms. We test AKP, its reverse, and two further patterns described as ordered and noisy across novels, online encyclopedias, research papers, and movies. Texts are represented as sequences of functional blocks, and convergence is assessed in transition order and position. Results show that structural principles vary by medium: novels follow reverse AKP in order, Wikipedia combines AKP with ordered patterns, academic papers display reverse AKP in order but remain noisy in position, and movies diverge by genre. Success therefore depends on structural constraints that are specific to each medium, while failure assumes different shapes across domains.

en cs.AI, cs.LG

Detail Sumber

DOAJ Open Access 2024

The Ethnic Content of the Values Son and Daughter in the Kazakh Culture: Ethnopsycholinguistic Study of Family and Relative Values

Гульжана Кузембаєва

Purpose. The study is aimed at exploring the ethnic content of values ұл / son and қыз / daughter in the Kazakh language and culture bearers’ linguistic consciousness and determining their place in the system of family and relative values in the Kazakh culture at the level of the state. The problem under study is especially relevant in the context of transformation of Kazakhstani society, heterogeneity of language situation and cultural diversity across regions. Methods. The study employed an ethnographic research design, which included a component analysis, an association experiment, and a ranking of values ұл / son and қыз / daughter by the representatives of the Kazakh nation. The study materials comprised linguistic data, obtained from dictionaries and encyclopedias of the Kazakh language, Kazakh paremias, and data of the association experiment. Semantic components of the values were defined and modelling of the association fields was conducted with further categorization of semantic gestalts. Results. The analysis of Kazakh dictionaries, encyclopedias and paremias revealed the importance of raising children based on good examples in Kazakh families. A son in Kazakh culture is considered great happiness, luck, and wealth, as he is the successor of his father and the heir of the family house in the future, whereas a daughter is raised in especially strict conditions, though treated as a guest, considered the prosperity in the family, pampered and valued for her beauty and girlish behaviour. The association experiment’s results were in line with the analysis of proverbs, showing the common and different components in the association fields and semantic zones of the values. The ranking of values ұл / son and қыз / daughter confirmed our hypothesis. Conclusions. Children are of utmost importance in the Kazakh culture, though the treatment and relation to a son and a daughter, as well as their place in the family differs because of their different roles and responsibilities.

Language. Linguistic theory. Comparative grammar

Detail DOI Sumber

DOAJ Open Access 2024

Online Resource Usage by Agricultural Science Lecturers in Nasarawa State, Nigeria

Fadlullah Olayiwola ISSA, Emmanuel Tomiwa Omisope, Jamila Rabe Mani et al.

The study assessed the use of online resources by agricultural science lecturers in Nasarawa State, Nigeria. Sixty-six lecturers were randomly sampled from two universities, and one college of agriculture. A structured questionnaire was adopted to elicit relevant data for the study. Frequency counts, percentages, mean scores, standard deviations, regression analysis, and analysis of variance (ANOVA) were used for data analysis. The main types of online resources used included electronic journals, search engines, and video/picture files, with a grand mean of 2.0 (SD=0.76). Online resources that were perceived to be useful to lecturers included search engines, electronic journals, and encyclopedias, with a grand mean of 2.15 (SD=0.90). Socio-institutional factors that significantly influenced the use of online resources were age (t=-8.560) and education (t=1.790). Major constraints in the use of online resources by the lecturers included the high cost of access to online resources ( =3.36), non?subscription for relevant online resources by institutions ( =3.34), and connectivity problems ( =3.33). This study concludes that the use of online resources is determined by the age and level of education of the lecturers. Tertiary institutions should ensure reliable internet connectivity and an adequate subscription to relevant online resources to support teaching, learning, and research.

Agriculture (General)

Detail Sumber

DOAJ Open Access 2023

Electronic dictionaries in the information age: Examples of solutions in dictionaries of Croatian and other languages

Ivana Filipović Petrović, Jelena Parizoska

Electronic lexicography is a branch of lexicography that deals with theoretical and practical aspects of online dictionaries, additionally encompassing encyclopedias, lexicons, glossaries, terminological repositories, and other online information tools. From the user’s perspective, electronic dictionaries offer fast access, entries enriched with new types of information such as sounds and images as well as links to other software featuring embedded teaching and learning, games, etc. From the lexicographic perspective, e-dictionaries overcome limitations of alphabetical ordering as well as those of linear text by supplying entries with links and the show less/more functionality. However, despite these advantages, not all national lexicographies offer a wide range of born-digital dictionaries. In response to this, contemporary e-lexicography seeks to offer solutions to the challenges arising from the seemingly ideal digital conditions, including the danger of overwhelming the user with an abundance of unstructured data leading to the loss of information. E-lexicography also seeks to answer the question of what lexicographers can do to induce users to obtain information from dictionaries rather than from online databases or fast search engines. This article provides an overview of the current state-of-the-art in electronic lexicography from a theoretical and technological perspective. It starts by presenting the main theoretical postulates related to the features of e-dictionaries and electronic lexicography theory. On the one hand, electronic lexicography is based on existing lexicographic theory with its linguistic underpinnings, where dictionary microstructure crucially depends on the lexicographer’s practice. On the other hand, it involves technological, functional and aesthetic aspects of creating a digital dictionary. The overview is followed by examples from electronic dictionaries of Croatian and other languages illustrating how the challenges of the information age and of balancing all the features of e-lexicography with user needs may be addressed. The article concludes that current lexicographic practice in creating dictionaries and other lexicographic tools is turned towards keeping up with rapid technological developments and social needs.

Lexicography

Detail DOI Sumber

DOAJ Open Access 2023

The Shevchenko Scientific Society Encyclopedia in view of describing the history of Ukrainian medicine

Lesya Mateshuk-Vatseba

This communication presents a brief overview of the first five volumes of the Shevchenko Scientific Society Encyclopedia (2012–2022). The author, a professor in Medicine, naturally focuses on highlighting the medical themes within the encyclopedia in a retrospective dimension, emphasizing the significant importance of each published volume for the history of Ukrainian science

Encyclopedias

Detail DOI Sumber

DOAJ Open Access 2023

Participation of Natives of the Belarus in the Battle of Stalingrad

Aliaksey Litwin

Introduction. The main information about our compatriots who distinguished themselves in the Battle of Stalingrad is contained in the award documents, which have not yet become the subject of study by Belarusian researchers. Separate facts had been appearing in the press already during the hostilities. For example, the feat of the Red Army soldier Alexey Vashchenko, who closed the embrasure of the enemy’s gates with his breast on September 5, 1942, was the subject of a publication in the division newspaper and a leaflet issued by the political department. However, the main information on the topic emerged only in the post-war time. Methods and Materials. The article is based on the information from scientific and educational literature, encyclopedias and reference books, from the memories of Soviet military leaders who took part in the Battle of Stalingrad, publications in the regional historical collections “Memory”, from the Belarusian central and local periodicals. The goal is to identify information about the participation of natives of Belarus in the Battle of Stalingrad. Results. The study showed that the Belarusians were represented both among the privates and command staff, in almost all types and branches of the military. Many of them were awarded high state decorations for military distinctions during the Battle of Stalingrad.

History of Russia. Soviet Union. Former Soviet Republics, International relations

Detail DOI Sumber

arXiv Open Access 2023

Wiki-based Communities of Interest: Demographics and Outliers

Hiba Arnaout, Simon Razniewski, Jeff Z. Pan

In this paper, we release data about demographic information and outliers of communities of interest. Identified from Wiki-based sources, mainly Wikidata, the data covers 7.5k communities, such as members of the White House Coronavirus Task Force, and 345k subjects, e.g., Deborah Birx. We describe the statistical inference methodology adopted to mine such data. We release subject-centric and group-centric datasets in JSON format, as well as a browsing interface. Finally, we forsee three areas this research can have an impact on: in social sciences research, it provides a resource for demographic analyses; in web-scale collaborative encyclopedias, it serves as an edit recommender to fill knowledge gaps; and in web search, it offers lists of salient statements about queried subjects for higher user engagement.

en cs.SI, cs.CY

Detail Sumber

DOAJ Open Access 2022

Cognition as Morphological/Morphogenetic Embodied Computation In Vivo

Gordana Dodig-Crnkovic

Cognition, historically considered uniquely human capacity, has been recently found to be the ability of all living organisms, from single cells and up. This study approaches cognition from an info-computational stance, in which structures in nature are seen as information, and processes (information dynamics) are seen as computation, from the perspective of a cognizing agent. Cognition is understood as a network of concurrent morphological/morphogenetic computations unfolding as a result of self-assembly, self-organization, and autopoiesis of physical, chemical, and biological agents. The present-day human-centric view of cognition still prevailing in major encyclopedias has a variety of open problems. This article considers recent research about morphological computation, morphogenesis, agency, basal cognition, extended evolutionary synthesis, free energy principle, cognition as Bayesian learning, active inference, and related topics, offering new theoretical and practical perspectives on problems inherent to the old computationalist cognitive models which were based on abstract symbol processing, and unaware of actual physical constraints and affordances of the embodiment of cognizing agents. A better understanding of cognition is centrally important for future artificial intelligence, robotics, medicine, and related fields.

Science, Astrophysics

Detail DOI Sumber

arXiv Open Access 2022

Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia Image-Caption Matching

Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli et al.

With the increased accessibility of web and online encyclopedias, the amount of data to manage is constantly increasing. In Wikipedia, for example, there are millions of pages written in multiple languages. These pages contain images that often lack the textual context, remaining conceptually floating and therefore harder to find and manage. In this work, we present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle, whose objective is to use data associated with images (URLs and visual data) to find the correct caption among a large pool of available ones. A system able to perform this task would improve the accessibility and completeness of multimedia content on large online encyclopedias. Specifically, we propose a cascade of two models, both powered by the recent Transformer model, able to efficiently and effectively infer a relevance score between the query image data and the captions. We verify through extensive experimentation that the proposed two-model approach is an effective way to handle a large pool of images and captions while maintaining bounded the overall computational complexity at inference time. Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.

en cs.CV

Detail Sumber

arXiv Open Access 2022

A Library Perspective on Nearly-Unsupervised Information Extraction Workflows in Digital Libraries

Hermann Kroll, Jan Pirklbauer, Florian Plötzky et al.

Information extraction can support novel and effective access paths for digital libraries. Nevertheless, designing reliable extraction workflows can be cost-intensive in practice. On the one hand, suitable extraction methods rely on domain-specific training data. On the other hand, unsupervised and open extraction methods usually produce not-canonicalized extraction results. This paper tackles the question how digital libraries can handle such extractions and if their quality is sufficient in practice. We focus on unsupervised extraction workflows by analyzing them in case studies in the domains of encyclopedias (Wikipedia), pharmacy and political sciences. We report on opportunities and limitations. Finally we discuss best practices for unsupervised extraction workflows.

en cs.CL, cs.DL

Detail DOI Sumber

Hasil untuk "Encyclopedias"