arXiv Open Access 2021

DomainNet: Homograph Detection for Data Lake Disambiguation

Aristotelis Leventidis Laura Di Rocco Wolfgang Gatterbauer Renée J. Miller Mirek Riedewald
Lihat Sumber

Abstrak

Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.

Topik & Kata Kunci

Penulis (5)

A

Aristotelis Leventidis

L

Laura Di Rocco

W

Wolfgang Gatterbauer

R

Renée J. Miller

M

Mirek Riedewald

Format Sitasi

Leventidis, A., Rocco, L.D., Gatterbauer, W., Miller, R.J., Riedewald, M. (2021). DomainNet: Homograph Detection for Data Lake Disambiguation. https://arxiv.org/abs/2103.09940

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2021
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓