arXiv Open Access 2022

SANTOS: Relationship-based Semantic Table Union Search

Aamod Khatiwada Grace Fan Roee Shraga Zixuan Chen Wolfgang Gatterbauer +2 lainnya
Lihat Sumber

Abstrak

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search.

Topik & Kata Kunci

Penulis (7)

A

Aamod Khatiwada

G

Grace Fan

R

Roee Shraga

Z

Zixuan Chen

W

Wolfgang Gatterbauer

R

Renée J. Miller

M

Mirek Riedewald

Format Sitasi

Khatiwada, A., Fan, G., Shraga, R., Chen, Z., Gatterbauer, W., Miller, R.J. et al. (2022). SANTOS: Relationship-based Semantic Table Union Search. https://arxiv.org/abs/2209.13589

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2022
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓