Friday, December 9, 2011

Managing metadata in data.fao.org

One day, I was interviewing a data manager when I arrived at the question, “How do you manage metadata in your system?” He looked at me with the smile of a boy caught with his hands in the jam. “Well, we wanted to store metadata, but we didn’t have enough time to spend on the issue so we don’t handle it at all”. When people think about metadata, there are two general ideas that come to mind. The first one is that metadata is beautiful, valuable and crucial for a nice design. The second idea is that metadata is difficult to handle, boring, and not really necessary. The last idea is particularly apparent when working on statistical data systems where managers are driven by the urgency of producing and disseminating a great deal of numbers as fast as possible.

The good news is that most of the people handle metadata even when they think they are not. Traditionally metadata is defined as “data about data”. The need to define more information about data relies on the different purposes that metadata meets. How can a specific resource be found in a catalog? How can a number be interpreted? How has this number been produced? Can I distribute it to other people? How reliable is it? These are all examples of questions that metadata is able to answer. Generally speaking, metadata should help managers and users to:

  • Retrieve a resource from a catalog.
  • Explain how the resource has been produced and all the concepts that are referred by it.
  • Describe the quality and limits of the resource.
  • Define copyrights and terms of use.

Due to the complexity of functions, several attempts exist to classify metadata into different types. The first distinction can be made between “structural” metadata and “reference” metadata. Structural metadata refers to concepts used in the identification of a resource. Reference metadata, on the other hand, describes, explains, and qualifies resources more generally. A very popular type of metadata that is emerging within social networks is the social metadata added by users, such as, tags, ratings, votes, comments, and so on. Social metadata can help other users browse data and classify it into categories, find current topics, and help assess data quality.

Data.fao.org manages different types of resources such as statistical data, maps and various digital assets. This requires the adaptation of different metadata structures for each resource type because a metadata concept may not be applicable across all resources, while some other metadata concepts can be available to various levels of the resource. Statistical data, for instance, is usually organized into a set of databases, tables, fields and rows, each one of them carrying specific metadata concepts. In the multidimensional environment of Data.fao.org this complexity can reach even higher levels of organization. This implies the use of a different metadata structure for each level of organization of a resource, known as the "attachment level" of a metadata concept. The Statistical Data and Metadata Exchange initiative (SDMX) defines the attachment level as “a property of attributes defining the object to which data or metadata are linked.”

Managing different resources and metadata structures is a two-step process. First, an inventory of existing resource types is carried out. Afterward, available standard metadata is identified to accommodate the variability of existing information in data.fao.org from the various domains and, at the same time, permit easy communication and sharing with external data sources.

The following table shows resources, attachment levels and existing standard metadata used.

RESOURCE TYPE DESCRIPTION METADATA STANDARD
Statistical data
Database A thematically organized set of data, normally disseminated to users through a single point of dissemination (for example: website, publication). It is the highest level of metadata information. Technically, a database has a corresponding predefined working environment (workspace) and namespace. It has one or more datasets. SDMX concepts
Dataset A container for statistics sharing a common structure, workflow, visibility and release schedule. Approximately equating to a single OLAP cube. A dataset may be updated through a scheduled release or ongoing update. A dataset has a defined number of dimensions and measures. SDMX concepts
Dimension An element that categorizes datasets into non-overlapping regions. Dimensions provide structured labeling information to otherwise unordered numeric measures. For example, "Country", "Date", and "Products" are all dimensions that could be applied meaningfully to a trade dataset. It is similar to a categorical variable in statistics. Dimension items are the possible values of the dimension. SDMX concepts
Measure A property on which calculations (for example: sum, count, average, minimum, maximum) can be made using pre-computed aggregates. SDMX concepts
Constituent A set of facts (one or more) that has a structure that allows it to be uploaded into a specific dataset. A constituent has a defined data source which can act as data producer or data disseminator. While a dataset defines the structure where statistical data is stored, a constituent defines the content of this data. Data source varies according to the data collection methodology of a statistical system. It can be a national statistical office, an expert in the field, a specific publication or document. SDMX concepts
Digital assets
Documents   Dublin Core, MODS
Photographs   IPTC
Video    
GeoLayers    
  All vector and raster geospatial information. ISO19115

Author: Gianluca Franceschini

Thursday, December 1, 2011

data.fao.org API implementations


The data.fao.org website is being designed and developed following the Service Oriented Architecture (SOA) approach that includes a clear decoupling between the presentation and the process layer (see previous article for more information). This means that many resources such as information, maps, documents, photos, and more, can be easily shown in data.fao.org. It also means that operations performed by the website will be provided by a service. These processes will also be accessible directly through data.fao.org API implementations, without the intermediation of the Web interface.

The objective of the project is to provide APIs in a useful way in order to achieve our goal of bringing data together from many different FAO sources. There are two kinds of APIs provided by data.fao.org:

  • Resource management APIs
  • Process APIs

Resource Management APIs

The resource management APIs will allow you to add, retrieve, update, delete, browse and search resources inside the data.fao.org catalog, through the Web process in RESTful style (http://en.wikipedia.org/wiki/RESTful).

Resource management APIs provide the following two ways to access a resource:

  • By a Universally Unique Identifier (UUID). For example: 22fcb0de-64bc-4ae0-a8f1-f2a190c79c1b
  • By a “friendly name”, composed by the resource type collection name (for example: datasets, images, photos, documents, etc.), the database name (for example: glipha, countryprofiles, aquastat, etc.) and the resource name (for example: livestock-production, flag-argentina, etc.).

The UUID is more suitable for application-to-application communication because it is more efficient, compact, immutable and will never change even if the resource is renamed or moved to another database, unlike the “friendly name.”

The “friendly name” instead, is more easy to use by a “human” user because it is a significant identifier, unique only in the data.fao.org context, while the UUID is a “machine” code and therefore hard to remember.

In data.fao.org, a resource is a complex object that can be composed by more than one component. For example, a photo is a resource composed by a set of metadata and one or more binary streams containing the image files at different resolutions, a statistical dataset can be composed by metadata, statistical data and data structure, etc. The resource components in data.fao.org are called “datastream” and can be managed through the resource management APIs.

The following URL patterns are used to access the resource descriptors and the data streams:

  • http://data.fao.org/uri/<resource-uuid>[/<datastream>]?<parameters>
  • http://data.fao.org/resources/<resource-type-collection-name>/<database>/<resource-name>[/<datastream>]?<parameters>

Process APIs

Process APIs provide additional processes not strictly related to the CRUD (create, read, update and delete) operations on the data.fao.org resources.

Some examples of process APIs are:

  • Multidimensional query - executes a multidimensional query on the data.fao.org data warehouse end and returns the result in different formats (for example: XML, JSON, CSV, etc.).
  • Document conversion - performs the conversion of a document from one format to another (for example: PDF, HTML, DOC, ODT, etc.).
  • Text translation - automatically translates a text fragment from one of the six official FAO language to another, using FAO specialized vocabulary.
  • Map rendering - renders a map image using static geospatial layers and statistical information.
  • And so on.

The following URL pattern can be used to access the process APIs: http://data.fao.org/api/<api-category>/<api-name>?<parameters>

Invoking APIs

To invoke resource management and process APIs, it is necessary to specify an API key called “authKey”, for example, c6fe9f2e-43ec-4fbe-b849-7ec361975d6c, that can be obtained after a self registration procedure directly from the data.fao.org website. This key can be specified either as a query string parameter or as an HTTP header and allow the human user or the client application to access the data.fao.org information or process. In the case of a client application, the key is bound to an IP address or web domain and the security layer of data.fao.org rejects any call coming from a different machine.

The requests and the responses of the APIs can be localized in different languages. To obtain a result in a specific language it is necessary to specify the query string parameter “lang” or the HTTP header “Accept-Language” (for example: en, fr, ar, etc.).

Moreover, it is also possible to obtain the result of a data.fao.org API call in different formats by specifying the required format in the “output” query string parameter (for example: XML, JSON, JSONP, CSV, etc.) or by specifying the relative MIME-Type in the “Accept” HTTP header (for example: application/xml, application/json, text/csv, etc.).

Other parameters and HTTP headers can be specified by calling a data.fao.org API. These are described in the API technical documentation.

Author: Dario Dentale