Metadata management for distributed data sources is a long-standing but ever-growing problem. To counter this challenge in a research-data and library-oriented setting, this work constructs a data architecture, derived from the data-lake: the metadata-lake. A proof-of-concept implementation of this proposed metadata aggregator is presented, too, and also evaluated.
We draw a connection between data modeling and visualization, namely that a visualization specification defines a mapping from database constraints to visual representations of those constraints. Using this formalism, we show how many visualization design decisions are, in fact, data modeling choices and extend data visualization from single-dataset visualizations to database visualization
Query determinacy is decidable for project-select views and a project-select-join query with no self joins, as long as the selection predicates are in a first-order theory for which satisfiability is decidable.
We define edit distance for hierarchically structured data compatible with the hierarchical multi-instance learning paradigm. Example of such data is dataset represented in JSON format where inner Array objects are interpreted as unordered bags of elements. We prove correct analytical properties of the defined distance.
In this report the conceptual query language ConQuer-92 is introduced. This query language serves as the backbone of InfoAssistant's query facilities. Furthermore, this language can also be used for the specification of derivation rules (e.g. subtype defining rules) and textual constraints in InfoModeler. This report is solely concerned with a formal definition, and the explanation thereof, of ConQuer-92. The implementation of ConQuer-92 in SQL-92 will be treated in a separate report.
This report is part of the DataflowOpt project on optimization of modern dataflows and aims to introduce a data quality-aware cost model that covers the following aspects in combination: (1) heterogeneity in compute nodes, (2) geo-distribution, (3) massive parallelism, (4) complex DAGs and (5) streaming applications. Such a cost model can be then leveraged to devise cost-based optimization solutions that deal with task placement and operator configuration.
This paper presents TXSC, a framework that provides smart contract developers with transaction primitives. These primitives allow developers to write smart contracts without the need to reason about the anomalies that can arise due to concurrent smart contract function executions.
In this technical report, we describe the historic development of the zooarchaeological database OssoBook and the resulting framework xBook, a generic infrastructure for distributed, relational data management that is mainly designed for the needs of scientific data. We describe the concepts of the architecture and its most important features. We especially point out the Server-Client architecture, the synchronization process, the Launcher application, and the structure and features of the application.
The aim of this article is to present an overview of the major families of state-of-the-art index and materialized view selection methods, and to discuss the issues and future trends in data warehouse performance optimization. We particularly focus on data mining-based heuristics we developed to reduce the selection problem complexity and target the most pertinent candidate indexes and materialized views.
RDF is increasingly being used to encode data for the semantic web and for data exchange. There have been a large number of works that address RDF data management. In this paper we provide an overview of these works.
A novel fast algorithm for finding quasi identifiers in large datasets is presented. Performance measurements on a broad range of datasets demonstrate substantial reductions in run-time relative to the state of the art and the scalability of the algorithm to realistically-sized datasets up to several million records.
Bounded Derivation Depth property (BDD) and Finite Controllability (FC) are two properties of sets of datalog rules and tuple generating dependencies (known as Datalog +/- programs), which recently attracted some attention. We conjecture that the first of these properties implies the second, and support this conjecture by some evidence proving, among other results, that it holds true for all theories over binary signature.
Our earlier work titled: "Win-move is Coordination-Free (Sometimes)" has shown that the classes of queries that can be distributedly computed in a coordination-free manner form a strict hierarchy depending on the assumptions of the model for distributed computations. In this paper, we further characterize these classes by revealing a tight relationship between them and novel weakened forms of monotonicity.
Releasing Web query logs which contain valuable information for research or marketing, can breach the privacy of search engine users. Therefore rendering query logs to limit linking a query to an individual while preserving the data usefulness for analysis, is an important research problem. This survey provides an overview and discussion on the recent studies on this direction.
A systematic study of the probability distribution of superimposed random codes is presented through the use of generating functions. Special attention is paid to the cases of either uniformly distributed but not necessarily independent or non uniform but independent bit structures. Recommendations for optimal coding strategies are derived.
This survey paper examines the effective model theory obtained with the BSS model of real number computation. It treats the following topics: computable ordinals, satisfaction of computable infinitary formulas, forcing as a construction technique, effective categoricity, effective topology, and relations with other models for the effective theory of uncountable structures.
A $2k computer can execute about 8k transactions per second. This is 80x more than one of the largest US bank's 1970's traffic - it approximates the total US 1970's financial transaction volume. Very modest modern computers can easily solve yesterday's problems.