Metadata
As if the weight of big data isn’t staggering enough, it is also necessary to create data about data! This is an avenue that has long been a niche for librarians. We speak in controlled vocabularies and cataloging systems and are often driven by a desire to organize and categorize information in all of it’s forms. It makes sense then that as information and resources shift and evolve, librarians will be applying their knowledge to metadata, which simply put is data about data. As we use standardized cataloging systems, i.e. Dewey Decimal and Library of Congress, so to are there standard schema to apply to metadata. While you can develop your own schema, and people, particularly researchers often do, it is not as conducive to sharing at a later date.
“In general, the fewer metadata schemas, the better. We use standards to improve interoperability and to reduce unnecessary variation. It is better and easier to adopt something that already exists, is well modelled, and comprehensively supported. If you build one, then you will also have manage and support it for the life span of the records. This includes updates, backwards and forwards compatibility, metadata about the metadata schema, registry and other infrastructure to support its implementation, etc.”
In other words a big headache and one that almost ensures that anyone trying to access your data and it’s unique metadata will also end up with a big headache.
Anyone familiar with library cataloging has experience with MARC, whose formats are standards for the representation and communication of bibliographic and related information in machine-readable form. There are also several schema that are specific to metadata generated from different disciplines.
The following map highlights the most widely used schema for varying disciplines.
Dublin Core
Many of these schema, especially in the sciences have been adapted from, Dublin Core, which was developed in the mid-90′s and seems to be the de facto metadata schema. It contains 15 distinct elements.
| Dublin Core Elements | |
|---|---|
title creator subject description publisher contributor date type |
format identifier source language relation coverage rights |
There are some disadvantages to Dublin Core, in that there are no cataloging rules that determine how data will be entered in the fields. “Creator = Amelia Vaughan”, and “creator = Vaughan, Amelia” are both accepted. This allows people who are adopting Dublin Core to make use of whatever rules are common in their community, but it does mean that there is no consistency across different uses of Dublin Core. On the whole however, it is a widely used and understood system that has been refined and updated over the course of it’s almost 20 year life span. However the simple version of DC has proven inadequate for addressing the complexities of scientific data and as such new schema have been developed that greatly expanded on the core elements of DC.
Darwin Core
One such schema is Darwin Core, whose primary purpose is to create a common language for sharing biodiversity data that is complementary to and reuses metadata standards from other domains wherever possible.
An article in PLOS ONE states that, “creating this common language can be particularly challenging, since natural history data curation practices have been developed locally and organically over hundreds of years, have varied between disciplines as well as institutions, and have had limited culture of data sharing.” A key component to creating an open access scientific data community will involve changing the culture of research to include a plan for managing data in such a way that it can be shared for years to come with those outside of the project.
But I digress! Back to Darwin Core… which hypothetically “greatly increases the value and re-use of freely available and accessible biodiversity data so that they can be effectively mobilized, integrated, and incorporated into other “grand challenge” scientific endeavors.”
Here is an example of some of the added elements in Darwin Core.
Other Scientific Data Schema Include:
Access to Biological Collection Data, Dublin Core derivative and widely used along with Darwin Core. Schema was extended in 2006 to include “Extensions For Geo sciences data”.
The National Science Digital Library Metadata Guidelines (NSDL_DC) and is also based off of Dublin Core.
As data sets continue to grow exponentially and the push for open access becomes more forceful, librarians can use their natural skills as outreach and cataloging specialists to lend support and guidance towards implementing metadata schema that will allow data to be shared efficiently and easily for years to come.
Helpful Links
-A guide to metadata schema and standards
http://www.kcoyle.net/meta_purpose.html
-A succinct explanation of metadata and how it applies to libraries
http://researchers.tulane.edu/metadata
-A cohesive list of commonly used metadata schema for various disciplines
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0029715
-Rationale for continuing to develop and implement Darwin Core