arXiv Open Access 2023

Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

Yiming Sun Daniel M. German Stefano Zacchiroli
Lihat Sumber

Abstrak

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases.We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

Topik & Kata Kunci

Penulis (3)

Y

Yiming Sun

D

Daniel M. German

S

Stefano Zacchiroli

Format Sitasi

Sun, Y., German, D.M., Zacchiroli, S. (2023). Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code. https://arxiv.org/abs/2305.14837

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2023
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓