Genomics is a Big Data science and is going to get much bigger, very soon, but it is not known whether the needs of genomics will exceed other Big Data domains. Projecting to the year 2025, we compared genomics with three other major generators of Big Data: astronomy, YouTube, and Twitter. Our estimates show that genomics is a “four-headed beast”—it is either on par with or the most demanding of the domains analyzed here in terms of data acquisition, storage, distribution, and analysis. We discuss aspects of new technologies that will need to be developed to rise up and meet the computational challenges that genomics poses for the near future. Now is the time for concerted, community-wide planning for the “genomical” challenges of the next decade.
Many sciences have made significant breakthroughs by adopting online tools that help organize, structure and mine information that is too detailed to be printed in journals. In this paper, we introduce OpenML, a place for machine learning researchers to share and organize data in fine detail, so that they can work more effectively, be more visible, and collaborate with others to tackle harder problems. We discuss how OpenML relates to other examples of networked science and what benefits it brings for machine learning research, individual scientists, as well as students and practitioners.
Context The concept of citizen science (CS) is currently referred to by many actors inside and outside science and research. Several descriptions of this purportedly new approach of science are often heard in connection with large datasets and the possibilities of mobilizing crowds outside science to assists with observations and classifications. However, other accounts refer to CS as a way of democratizing science, aiding concerned communities in creating data to influence policy and as a way of promoting political decision processes involving environment and health. Objective In this study we analyse two datasets (N = 1935, N = 633) retrieved from the Web of Science (WoS) with the aim of giving a scientometric description of what the concept of CS entails. We account for its development over time, and what strands of research that has adopted CS and give an assessment of what scientific output has been achieved in CS-related projects. To attain this, scientometric methods have been combined with qualitative approaches to render more precise search terms. Results Results indicate that there are three main focal points of CS. The largest is composed of research on biology, conservation and ecology, and utilizes CS mainly as a methodology of collecting and classifying data. A second strand of research has emerged through geographic information research, where citizens participate in the collection of geographic data. Thirdly, there is a line of research relating to the social sciences and epidemiology, which studies and facilitates public participation in relation to environmental issues and health. In terms of scientific output, the largest body of articles are to be found in biology and conservation research. In absolute numbers, the amount of publications generated by CS is low (N = 1935), but over the past decade a new and very productive line of CS based on digital platforms has emerged for the collection and classification of data.
This paper describes the organization of the database and the catalog data products from the Pan-STARRS1 3π Steradian Survey. The catalog data products are available in the form of an SQL-based relational database from MAST, the Mikulski Archive for Space Telescopes at STScI. The database is described in detail, including the construction of the database, the provenance of the data, the schema, and how the database tables are related. Examples of queries for a range of science goals are included.
Data sharing is one of the cornerstones of modern science that enables large-scale analyses and reproducibility. We evaluated data availability in research articles across nine disciplines in Nature and Science magazines and recorded corresponding authors’ concerns, requests and reasons for declining data sharing. Although data sharing has improved in the last decade and particularly in recent years, data availability and willingness to share data still differ greatly among disciplines. We observed that statements of data availability upon (reasonable) request are inefficient and should not be allowed by journals. To improve data sharing at the time of manuscript acceptance, researchers should be better motivated to release their data with real benefits such as recognition, or bonus points in grant and job applications. We recommend that data management costs should be covered by funding agencies; publicly available research data ought to be included in the evaluation of applications; and surveillance of data sharing should be enforced by both academic publishers and funders. These cross-discipline survey data are available from the plutoF repository.
J. M. Heberling, Joseph T. Miller, Daniel Noesgaard
et al.
Significance As anthropogenic impacts to Earth systems accelerate, biodiversity knowledge integration is urgently required to support responses to underpin a sustainable future. Consolidating information from disparate sources (e.g., community science programs, museums) and data types (e.g., environmental, biological) can connect the biological sciences across taxonomic, disciplinary, geographical, and socioeconomic boundaries. In an analysis of the research uses of the world’s largest cross-taxon biodiversity data network, we report the emerging roles of open-access data aggregation in the development of increasingly diverse, global research. These results indicate a new biodiversity science landscape centered on big data integration, informing ongoing initiatives and the strategic prioritization of biodiversity data aggregation across diverse knowledge domains, including environmental sciences and policy, evolutionary biology, conservation, and human health. The accessibility of global biodiversity information has surged in the past two decades, notably through widespread funding initiatives for museum specimen digitization and emergence of large-scale public participation in community science. Effective use of these data requires the integration of disconnected datasets, but the scientific impacts of consolidated biodiversity data networks have not yet been quantified. To determine whether data integration enables novel research, we carried out a quantitative text analysis and bibliographic synthesis of >4,000 studies published from 2003 to 2019 that use data mediated by the world’s largest biodiversity data network, the Global Biodiversity Information Facility (GBIF). Data available through GBIF increased 12-fold since 2007, a trend matched by global data use with roughly two publications using GBIF-mediated data per day in 2019. Data-use patterns were diverse by authorship, geographic extent, taxonomic group, and dataset type. Despite facilitating global authorship, legacies of colonial science remain. Studies involving species distribution modeling were most prevalent (31% of literature surveyed) but recently shifted in focus from theory to application. Topic prevalence was stable across the 17-y period for some research areas (e.g., macroecology), yet other topics proportionately declined (e.g., taxonomy) or increased (e.g., species interactions, disease). Although centered on biological subfields, GBIF-enabled research extends surprisingly across all major scientific disciplines. Biodiversity data mobilization through global data aggregation has enabled basic and applied research use at temporal, spatial, and taxonomic scales otherwise not possible, launching biodiversity sciences into a new era.
Policy‐makers increasingly demand robust measures of biodiversity change over short time periods. Long‐term monitoring schemes provide high‐quality data, often on an annual basis, but are taxonomically and geographically restricted. By contrast, opportunistic biological records are relatively unstructured but vast in quantity. Recently, these data have been applied to increasingly elaborate science and policy questions, using a range of methods. At present, we lack a firm understanding of which methods, if any, are capable of delivering unbiased trend estimates on policy‐relevant time‐scales. We identified a set of candidate methods that employ data filtering criteria and/or correction factors to deal with variation in recorder activity. We designed a computer simulation to compare the statistical properties of these methods under a suite of realistic data collection scenarios. We measured the Type I error rates of each method–scenario combination, as well as the power to detect genuine trends. We found that simple methods produce biased trend estimates, and/or had low power. Most methods are robust to variation in sampling effort, but biases in spatial coverage, sampling effort per visit, and detectability, as well as turnover in community composition, all induced some methods to fail. No method was wholly unaffected by all forms of variation in recorder activity, although some performed well enough to be useful. We warn against the use of simple methods. Sophisticated methods that model the data collection process offer the greatest potential to estimate timely trends, notably Frescalo and occupancy–detection models. The potential of these methods and the value of opportunistic data would be further enhanced by assessing the validity of model assumptions and by capturing small amounts of information about sampling intensity at the point of data collection.
Abstract Purpose The purpose of this study is to assess the coverage of the scientific literature in Scopus and Web of Science from the perspective of research evaluation. Design/methodology/approach The academic communities of Norway have agreed on certain criteria for what should be included as original research publications in research evaluation and funding contexts. These criteria have been applied since 2004 in a comprehensive bibliographic database called the Norwegian Science Index (NSI). The relative coverages of Scopus and Web of Science are compared with regard to publication type, field of research and language. Findings Our results show that Scopus covers 72 percent of the total Norwegian scientific and scholarly publication output in 2015 and 2016, while the corresponding figure for Web of Science Core Collection is 69 percent. The coverages are most comprehensive in medicine and health (89 and 87 percent) and in the natural sciences and technology (85 and 84 percent). The social sciences (48 percent in Scopus and 40 percent in Web of Science Core Collection) and particularly the humanities (27 and 23 percent) are much less covered in the two international data sources. Research limitation Comparing with data from only one country is a limitation of the study, but the criteria used to define a country’s scientific output as well as the identification of patterns of field-dependent partial representations in Scopus and Web of Science should be recognizable and useful also for other countries. Originality/value The novelty of this study is the criteria-based approach to studying coverage problems in the two data sources.
Given the emergence of data science and machine learning throughout all aspects of society, but particularly in the scientific domain, there is increased importance placed on obtaining data. Data in materials science are particularly heterogeneous, based on the significant range in materials classes that are explored and the variety of materials properties that are of interest. This leads to data that range many orders of magnitude, and these data may manifest as numerical text or image-based information, which requires quantitative interpretation. The ability to automatically consume and codify the scientific literature across domains—enabled by techniques adapted from the field of natural language processing—therefore has immense potential to unlock and generate the rich datasets necessary for data science and machine learning. This review focuses on the progress and practices of natural language processing and text mining of materials science literature and highlights opportunities for extracting additional information beyond text contained in figures and tables in articles. We discuss and provide examples for several reasons for the pursuit of natural language processing for materials, including data compilation, hypothesis development, and understanding the trends within and across fields. Current and emerging natural language processing methods along with their applications to materials science are detailed. We, then, discuss natural language processing and data challenges within the materials science domain where future directions may prove valuable.
Story Slides Slide 1 W3C eGovernment Community: Data Science Slide 2 Agenda Slide 3 The Changing Landscape of Federal Information Technology Slide 4 Cloud: SOA, Semantic, & Data Science: September 10-11th Slide 5 Opportunities for Data Science Slide 6 Discussion 1 Slide 7 Discussion 2 Slide 8 Discussion 3 Slide 9 Discussion 4 Slide 10 Discussion 5 Spotfire Dashboard Research Notes Joshua Tauberer’s Blog Open Government Data 1 Big Data Meets Open Government Figure 1 The New Federal Register 2.0 Figure 2 This animated visualization of live wind speeds and directions Figure 3 Data is like refrigerator poetry Figure 4 http://www.GovTrack.us Figure 5 John Oliver parodies Schoolhouse Rock’s “I’m Just a Bill” References 1 2 3 4 5 6 7
Matthias Scheffler, M. Aeschlimann, M. Albrecht
et al.
The prosperity and lifestyle of our society are very much governed by achievements in condensed matter physics, chemistry and materials science, because new products for sectors such as energy, the environment, health, mobility and information technology (IT) rely largely on improved or even new materials. Examples include solid-state lighting, touchscreens, batteries, implants, drug delivery and many more. The enormous amount of research data produced every day in these fields represents a gold mine of the twenty-first century. This gold mine is, however, of little value if these data are not comprehensively characterized and made available. How can we refine this feedstock; that is, turn data into knowledge and value? For this, a FAIR (findable, accessible, interoperable and reusable) data infrastructure is a must. Only then can data be readily shared and explored using data analytics and artificial intelligence (AI) methods. Making data 'findable and AI ready' (a forward-looking interpretation of the acronym) will change the way in which science is carried out today. In this Perspective, we discuss how we can prepare to make this happen for the field of materials science. A findable, accessible, interoperable and reusable (FAIR) data infrastructure is discussed to turn the large amount of research data generated by the field of materials science into knowledge and value.
Financial fraudsters use Gen AI, digital channels, global networks, and synthetic identities making it complex to identify the fraudulent activities. Traditional rule-based systems relying on traditional methods do not identify frauds which use multi-step transaction routing with multiple institutions and across borders. Graph database using Labelled Property Graphs, represents customers, accounts, and transactions as interconnected nodes and edges. By ingesting live transaction data, they apply pattern-matching and community-detection to expose suspicious subgraphs. Money-laundering rings or collusive clusters—and let investigators trace multi-hop links to “hub” accounts with clear visual audit trails. Machine learning models trained on vast historical datasets, using supervised classifiers (e.g., gradient boosting) and unsupervised anomaly detectors. Features like transaction amounts, geolocation consistency, device fingerprints, and temporal sequences feed these models, while recurrent architectures capture evolving fraud tactics. Yet they often suffer from concept drift, require extensive labelled data, underperform on imbalanced cases, and behave as opaque black boxes, generating false positives and hampering trust. A hybrid framework combines relational graph insights with statistical scoring, boosting detection accuracy, reducing false alarms, and enhancing investigators’ confidence in fraud detection and prevention.
BackgroundIn recent years, although numerous meta-analyses/systematic reviews (MAs/SRs) have explored the therapeutic effect of noninvasive brain stimulation (NIBS) on poststroke depression, the reliability and quality of its clinical evidence remain uncertain. Therefore, this study aims to conduct an overview of systematic reviews to evaluate the effectiveness and safety of NIBS for PSD, thereby providing evidence-based support for clinical decision-making.MethodsA comprehensive search of multiple databases, including PubMed, EMBASE, the Cochrane Library, Web of Science, CNKI, VIP, Wan Fang, and CBM, was conducted to identify systematic reviews and meta-analyses of NIBS for PSD. Following the literature screen and data extraction, the quality of the included studies was assessed using the PRISMA statement for reporting quality, the AMSTAR-2 tool for methodological quality, and the GRADE system for evidence quality. We extracted the main outcome indicators of depressive symptoms and the secondary outcome indicators of neurological function, cognitive function, daily living ability, anxiety symptoms, clinical efficiency and adverse reactions for analysis.ResultsA total of 20 MAs/SRs were included. According to the PRISMA statement, 6 (30%) reports were relatively complete, 13 (65%) had problems, and 1 (5%) had serious information deficiencies. The results of the AMSTAR-2 evaluation revealed that 3 articles (15%) were of moderate quality, 5 (25%) were of low quality, and 12 (60%) were of critically low quality. Thus, the overall quality was not high. The GRADE evaluation revealed that a total of 66 evidence bodies were included: 9 (13.6%) had moderate evidence, 15 (22.8%) had low-level evidence, and 42 (63.6%) had very low-level evidence, and no high-quality outcome indicator was identified.ConclusionThe included studies revealed that NIBS is an effective and safe treatment for PSD. However, because the methodology and results of the MAs/SRs were generally not of high quality, the reliability of the conclusions is limited to a certain extent. Future research should focus on conducting more high-quality, large-sample, multicenter follow-up studies to further verify the value of applying NIBS in PSD treatment.Systematic review registrationhttps://www.crd.york.ac.uk/PROSPERO/view/CRD42025633044, identifier CRD42025633044.