arXiv Open Access 2021

DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis Laura Di Rocco Wolfgang Gatterbauer Renée J. Miller Mirek Riedewald

Lihat Sumber

Abstrak

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.

Topik & Kata Kunci

cs.DB

Penulis (5)

Aristotelis Leventidis

Laura Di Rocco

Wolfgang Gatterbauer

Renée J. Miller

Mirek Riedewald

Format Sitasi

APA MLA BibTeX

Leventidis, A., Rocco, L.D., Gatterbauer, W., Miller, R.J., Riedewald, M. (2021). DomainNet: Homograph Detection for Data Lake Disambiguation. https://arxiv.org/abs/2103.09940

Akses Cepat

Lihat di Sumber

Informasi Jurnal

Tahun Terbit: 2021
Bahasa: en
Sumber Database: arXiv
Akses: Open Access ✓