Monday, February 13, 2012

Statistical Data and Metadata Exchange



The Statistical Data and Metadata Exchange (SDMX) initiative (http://www.sdmx.org) sets standards that can facilitate the exchange of statistical data and metadata. SDMX is an ISO standard that is currently being used mainly by large public institutes (BIS, ECB, EUROSTAT, IMF, OECD, UN, World Bank, and so on), many national statistical organizations, and individual researchers.

SDMX is important because it provides a more in-depth solution for dealing with statistical data. Currently, there is a need to move away from formats like CSV and Microsoft Excel, as well as from ad hoc and proprietary approaches that are very costly, non-standardized and not user friendly. Instead, we need an infrastructure that enables the automation and harmonization of the statistical process. SDMX provides a model for meeting this need in a way that is conceptual and technical.

SDMX and the semantic web

SDMX is part of the attempt to create a semantic web that is machine readable. Conceptually, SDMX follows the same principles as other semantic web technologies. The semantic web is all about providing information in the most well-formed way possible. The ontology of SDMX is the extremely valuable SDMX Information Model, which is part of the overall SDMX specification. The SDMX Information Model determines how data and metadata are exchanged.

One difference is that SDMX is not expressed in RDF, a common format used in semantic web technologies. The purpose of SDMX is to exchange information in an automated and formal way. SDMX works with ‘contracts’ and versions in order to be unambiguous. The contracts, for example, are the so called data structure definitions. SDMX is extremely reliable and unambiguous for what it covers but is less contextualized than RDF. There is the possibility, however, that SDMX and RDF will coexist in the future, as there is currently a very interesting initiative to implement it. One initiative includes trying to link SDMX artifacts to RDF artifacts and vice versa. In doing so, SDMX artifacts could be enriched with RDF metadata.

More information about RDF and SDMX can be found at the following website: http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/html/index.html

Implementing SDMX

There are some challenges that developers face when implementing SDMX in any environment. For example, reading the SDMX specifications is quite difficult and frustrating for new developers who are learning it for the first time. Despite the challenges, adopting SDMX means injecting highly valuable knowledge and practices into your organization, following a real and mature standard, due to the fact that SDMX is developed by senior experts in the field of statistical data.

If you have data and want to use SDMX, you can start by designing the SDMX DataStructure, also referred to as Data Structure Definition(DSD). In the DataStructure you define your dimensions, attributes and measures. All of them are the so called SDMX concepts. Every concept (dimension, attribute or measure) can be associated with a CodeList. Respecting this SDMX DataStructure, you can then generate a SDMX DataSet. You can publish the SDMX DataStructure, CodeList and DataSet through a SDMX web service. Start with SDMX CodeList, DataStructure and DataSet and forget the rest, until you feel like you need it.

There are some things that are missing from SDMX from a development standpoint, although, initiatives are currently being developed to solve most issues. The following list highlights some negative aspects of developing a SDMX solution:

  • SDMX can be verbose and therefore voluminous. Initiatives such as SDMX&JSON are on the way in order to make it more compact.
  • Support is missing for statistical analyses tools such as R, Excel, Matlab, SAS, and so on.
  • Lacks multiple open source tools that would make it easier to use and implement SDMX in different contexts.
  • A real web community is missing. SDMX was developed behind closed doors by its sponsors. Recently a SDMX community is opening up a lot but this process should be accelerated. The SDMX forum has only a few posts per month and other communications are hidden. It would be good for SDMX to be influenced much more by open data, open source software, and computer geeks, adopting a Google like culture.

For more information, the SDMX Dummies website is a helpful resource.

SDMX in data.fao.org

One of the goals of data.fao.org is to provide the user with multiple format choices for data, to be used according to the user’s needs. SDMX is one of the ways that websites can provide this functionality and, in addition, it is the natural choice for implementation in data.fao.org, considering that it is already being used by FAO and FAO’s partners. Looking specifically at the benefits for different user groups, SDMX resources in data.fao.org will be helpful for institutions who want to consume FAO data and metadata in an automated and standardized way. It will also help the individual researcher because they can download the data in a uniform way, especially when their tools are SDMX enabled.

In terms of SDMX versions, the most widely used is Version 2.0. Version 2.1 came out in May 2011 and shows good improvement over the previous version. We are currently using V2.0 for data.fao.org due to the fact that this is the version being used by FAO partners. In the future, there is the possibility that we will migrate data.fao.org to V2.1 or support both of them.

Web services in SDMX come in two flavors, RESTful and SOAP. The data.fao.org website is only supporting SDMX REST specifications, because it is more straightforward. The SDMX REST specifications came with SDMX V2.1.

Author: Erik van Ingen

No comments:

Post a Comment