Friday, December 9, 2011

Managing metadata in data.fao.org

One day, I was interviewing a data manager when I arrived at the question, “How do you manage metadata in your system?” He looked at me with the smile of a boy caught with his hands in the jam. “Well, we wanted to store metadata, but we didn’t have enough time to spend on the issue so we don’t handle it at all”. When people think about metadata, there are two general ideas that come to mind. The first one is that metadata is beautiful, valuable and crucial for a nice design. The second idea is that metadata is difficult to handle, boring, and not really necessary. The last idea is particularly apparent when working on statistical data systems where managers are driven by the urgency of producing and disseminating a great deal of numbers as fast as possible.

The good news is that most of the people handle metadata even when they think they are not. Traditionally metadata is defined as “data about data”. The need to define more information about data relies on the different purposes that metadata meets. How can a specific resource be found in a catalog? How can a number be interpreted? How has this number been produced? Can I distribute it to other people? How reliable is it? These are all examples of questions that metadata is able to answer. Generally speaking, metadata should help managers and users to:

  • Retrieve a resource from a catalog.
  • Explain how the resource has been produced and all the concepts that are referred by it.
  • Describe the quality and limits of the resource.
  • Define copyrights and terms of use.

Due to the complexity of functions, several attempts exist to classify metadata into different types. The first distinction can be made between “structural” metadata and “reference” metadata. Structural metadata refers to concepts used in the identification of a resource. Reference metadata, on the other hand, describes, explains, and qualifies resources more generally. A very popular type of metadata that is emerging within social networks is the social metadata added by users, such as, tags, ratings, votes, comments, and so on. Social metadata can help other users browse data and classify it into categories, find current topics, and help assess data quality.

Data.fao.org manages different types of resources such as statistical data, maps and various digital assets. This requires the adaptation of different metadata structures for each resource type because a metadata concept may not be applicable across all resources, while some other metadata concepts can be available to various levels of the resource. Statistical data, for instance, is usually organized into a set of databases, tables, fields and rows, each one of them carrying specific metadata concepts. In the multidimensional environment of Data.fao.org this complexity can reach even higher levels of organization. This implies the use of a different metadata structure for each level of organization of a resource, known as the "attachment level" of a metadata concept. The Statistical Data and Metadata Exchange initiative (SDMX) defines the attachment level as “a property of attributes defining the object to which data or metadata are linked.”

Managing different resources and metadata structures is a two-step process. First, an inventory of existing resource types is carried out. Afterward, available standard metadata is identified to accommodate the variability of existing information in data.fao.org from the various domains and, at the same time, permit easy communication and sharing with external data sources.

The following table shows resources, attachment levels and existing standard metadata used.

RESOURCE TYPE DESCRIPTION METADATA STANDARD
Statistical data
Database A thematically organized set of data, normally disseminated to users through a single point of dissemination (for example: website, publication). It is the highest level of metadata information. Technically, a database has a corresponding predefined working environment (workspace) and namespace. It has one or more datasets. SDMX concepts
Dataset A container for statistics sharing a common structure, workflow, visibility and release schedule. Approximately equating to a single OLAP cube. A dataset may be updated through a scheduled release or ongoing update. A dataset has a defined number of dimensions and measures. SDMX concepts
Dimension An element that categorizes datasets into non-overlapping regions. Dimensions provide structured labeling information to otherwise unordered numeric measures. For example, "Country", "Date", and "Products" are all dimensions that could be applied meaningfully to a trade dataset. It is similar to a categorical variable in statistics. Dimension items are the possible values of the dimension. SDMX concepts
Measure A property on which calculations (for example: sum, count, average, minimum, maximum) can be made using pre-computed aggregates. SDMX concepts
Constituent A set of facts (one or more) that has a structure that allows it to be uploaded into a specific dataset. A constituent has a defined data source which can act as data producer or data disseminator. While a dataset defines the structure where statistical data is stored, a constituent defines the content of this data. Data source varies according to the data collection methodology of a statistical system. It can be a national statistical office, an expert in the field, a specific publication or document. SDMX concepts
Digital assets
Documents   Dublin Core, MODS
Photographs   IPTC
Video    
GeoLayers    
  All vector and raster geospatial information. ISO19115

Author: Gianluca Franceschini

No comments:

Post a Comment