Semantic Scholar Open Access 2015

Relation: The Missing Container

Scott James J. Larkin

Abstrak

The humble mathematical relation1, a fundamental (if implicit) component in computational algorithms, is conspicuously absent in most standard container collections, including Python’s. In this paper, we present the basics of a relation container, and why you might use it instead of other methods. The concept is simple to implement and easy to use. We will walk through with code examples using our implementation of a relation (https://pypi.python.org/pypi/relate) Background: It’s the Little Things In our work in surface and aviation traffic simulation we deal with many moving pieces, terabytes of streaming information. Managing this much information pieces requires, unsurprisingly, some significant computational machinery: clusters of multiprocessors; different interworking database topologies: HDF5, NoSQL and SQL; compiled code, scripted code; COTS tools, commercial and open source code libraries. For the Python components of our work, we are fortunate to have data crunching libraries: numpy, pandas etc... However, we kept finding that, despite this wealth of machinery, we would get caught up on the little things. There may be thousands of flights in the air at any one time, but there are far fewer types of aircraft. There may be millions of vehicles on the road, but only a handful of vehicle categories. Whereas we could place these mini-databases into our data crunching tools as auxiliary tables, we didn’t. It didn’t make sense to perform a table merge with streaming data when we could do a quick lookup, on-the-fly, when we needed to. We didn’t want to create a table with ten rows and two columns when we could easily put that information into a dictionary, or a list. We didn’t want to implement our transient, sparse table with a graph database or create tables with an ’other’ column which we would then have to parse anyhow. And besides the traffic specific information, there were all those other pesky details: file tags, user aliases, color maps. Instead we cobbled together our mini-databases with what we had within easy mental reach: lists, sets and dictionaries. And when we needed to do a search, or invert keys/values, or assure uniqueness of mappings, we would create a loop, a list comprehension or a helper class. * Corresponding author: scott.james@noblis.org ‡ Noblis Copyright © 2015 Scott James et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. After some time it occurred to us that what we were really doing with our less-than-big data was reinventing a mathematical relation, ... over and over again. Once we realized that, we replaced the bookkeeping code managing our mini-databases with relation instances. This resulted in a variety of good things: reduced coding overhead, increased clarity of purpose and, oddly, improved computational efficiency. What is a relation and what is it good for? A relation a simply a pairing of elements of one set, the domain, with another, the range. Rephrasing more formally, a relation is a collection of tuples (x,y) where x is in the domain and y is in the range. A relation, implemented as code, can perform a variety of common tasks: • Inversion: quickly find the values(range) associated with a key(domain) • Partitioning: group values into unique buckets • Aliasing: maintain a unique pairing between keys and values • Tagging: associate two sets in an arbitrary manner These roughly correspond to the four cardinalities of a relation: • Many-to-one (M:1): a function, each range value having possibly multiple values in the domain • One-to-many (1:M): a categorization, where each element in the domain is associated with a unique group of values in the range • One-to-one (1:1): an isomorphism, where each element in the domain is uniquely identified with a single range value • Many-to-many (M:N): an unrestricted pairing of domain and range What is it not good for? The relation, at least as we have implemented it, is a chisel, not a jack-hammer. It is meant for the less-than-big data not the actuallybig data. When computational data is well-structured, vectorized or large enough to be concerned about storage, we use existing computational and relational libraries. A relation, by contrast, is useful when the data is loosely structured, transient, and in no real danger of overloading memory. The API Using a relation should be easy, as easy as using any fundamental container. It should involve as little programming friction as possible. It should feel natural and familiar. To accomplish these 172 PROC. OF THE 14th PYTHON IN SCIENCE CONF. (SCIPY 2015) Method Comment __init__ establish the cardinality and ordering of a Relation __setitem__ assign a range element to a domain element __getitem__ retrieve range element(s) for a domain element __delitem__ remove a domain element and all associated range pairings. If the range element has no remaining pairings, delete it. extend combine two Relation objects values return the domain keys returns list of domains __invert__swap domain and range

Penulis (2)

S

Scott James

J

J. Larkin

Format Sitasi

James, S., Larkin, J. (2015). Relation: The Missing Container. https://doi.org/10.25080/majora-7b98e3ed-01a

Akses Cepat

Informasi Jurnal
Tahun Terbit
2015
Bahasa
en
Sumber Database
Semantic Scholar
DOI
10.25080/majora-7b98e3ed-01a
Akses
Open Access ✓