Skip to main content

Data model fever and the case of the missing applications

The production of new data models, and data represented in them, is often treated as an end in itself, but data and data models are only as good as what you can do with them.

According to Noy and McGuinness, "the best solution [to modeling a domain] almost always depends on the application that you have in mind and the extensions that you anticipate."1 Yet applications are often treated as an afterthought in the data modeling process, on the unspoken assumption that a sufficiently rigorous model will serve any purpose. Novice data modelers in particular have a tendency to become overzealous about the conceptual aspects of a model, at the expense of more practical concerns. Even experienced ontology engineers are often unable or unwilling to eat their own dog food by developing applications, or collaborating with application developers, to road-test new models on specific tasks.

Data models designed in isolation from applications may impress on paper but be challenging to adopt in practice. In the cultural heritage space alone there are a number of mature data models that fit that description, such as VRA Core, "a data standard for the description of works of visual culture as well as the images that document them"

As Andrew Tanenbaum once said, "the good thing about standards is that there are so many to choose from". Thousands of person-hours have been spent producing VRA Core, and other comprehensive models in the cultural heritage space, yet they've seen limited use beyond a smattering of one-off, grant-funded prototypes and proprietary systems.

There is a widespread assumption that the use of technologies such as RDF and JSON-LD and adherence to the Linked Data and FAIR principles obviate the need for additional tooling to support a data model, such as software libraries. The implication is that data and data models that embrace these technologies and principles will automatically be useful to existing applications and libraries. However, it is perfectly possible, and common, to create data silos with data represented in RDF.

The absence of a supporting software ecosystem around data models like VRA Core exacerbates the problems of integration, interoperability, and reuse that the models are ostensibly trying to solve. Few publicly-available applications and software libraries appear to have adopted these data models. A software developer who wishes to use one of the models has to write new code for that purpose. The same developer might have to create new data as well, since there are few publicly-available datasets for any given data model. These are common challenges for emerging standards, which is why established standards bodies such as the W3C and the IETF insist on multiple implementations, test suites, and other processes that demonstrate a standard's viability.

Contrast life sciences data models such as PubChem and the BioPortal ontologies, which are embedded in rich ecosystems of software libraries and applications. Wikidata and schema.org are even more ambitious examples. The data models for these systems were not developed in isolation, but driven by the needs of users and task-specific applications beyond browsing and searching the data with relatively little interpretation.

Conceiving of collections as data and representing them as Linked Data are good first steps for the cultural heritage sector, but collection data is only as useful as what you can do with it. Galleries, libraries, archives, and museums are rich in data (and data models) and poor in engaging uses of them. No one needs another siloed collections management system with an incompatible data model, no matter how nice the interface looks. Instead, we should be answering the question: what can we do with collection data integrated from multiple sources that we can't do with any one of those sources? Creating another union catalog is an answer, but not a good one; the best applications need richer data models than the lowest common denominator union catalogs provide. We can do better than that.