Tuesday, January 10, 2012

Data harmonization using dense and explicit data relationships


When I first started working at the statistical division of FAO, I was faced with the challenge of indicating an alternative and feasible approach to data harmonization. Data harmonization is the preparatory step to merge, and later publish, data through models that facilitate exchange inside and outside of the Organization. I am currently working on data harmonization for data.fao.org.

Although data exchange can be tackled at almost all of the technological layers of the ISO/OSI 7-Layer architecture, the solution approach has to consistently focus and exist in the data representation scope and boundary. The premises to apply the principles of Ontology Modeling and Linked Data are good. Halfway to prototyping the first version of the divisional semantic knowledge base, I was encouraged that the direction I had taken was correct when I found out that the Linked Data Layer was to be added to the ISO/OSI conceptual model.

Case Study: The Fisheries Linked Open Data project

The Fisheries Linked Open Data (FLOD) project was used to test the approach of data harmonization with semantic technologies. Inside FLOD more than 10 coding systems consistently co-exist. One coding system is not predominant to any other in the same domain and co-existence does not generate conflicts in the available information. They classify entities from the following domains: land and marine geography, land and marine geo-politics, fishery legislation, and fishery techniques, with others planned for connection in the future. FLOD is used as an integrated FIPS data source in geospatial applications and as a data aggregator in the FIGIS web portal, while new use cases are decreeing it the glue among pre-existing FI information systems.

The data harmonization approach taken with FLOD involved repeatedly and extensively applying the following three general steps:

  1. Analyze the data and domain of the data sources that will be harmonized.
  2. Design an ontology module for the data content and domain.
  3. Instantiate explicit relationships that exist in the available data from the included data sources.

In practical terms the harmonization was accomplished by making sense of the data. For instance, three concurrent codes exist for Denmark (e.g. ISO3:DNK, GAUL:69, and UN:208). They are related as being equivalent which means that they can be used interchangeably by information systems aware of the relationships, for instance, to aggregate statistical data referenced by both codes. Another example includes how the Denmark sovereignty over Exclusive Economic water Zone and the rights to exploit beyond its fishing areas are specified, setting relationships among Denmark and these entities. Denmark FLOD resources and its relationships are distributed openly through the following URL: http://www.fao.org/figis/flod/entities/page/codedentity/00574d13-efe0-45ac-84ec-82630506be21.

This URL is also one of the access protocols to consume FLOD content and it is designed in a very similar way as the URL for the data.fao.org API.

Data Harmonization and data.fao.org

The results obtained with the experience of developing FLOD will be transferred to the design of the metadata catalogue of data.fao.org. It will make the catalogue compatible with the ongoing initiative of silos opening and data sharing, and is a crucial resource to support data.fao.org data retrieval and navigation capacities. The data.fao.org web portal is currently in a selection process by the PUBLINK consultancy program, promoted by the LOD2 project. The intention is to achieve sustainable results for the metadata catalogue by operating with the awareness and cooperation of leading open data initiatives.

By adopting this data harmonization, data.fao.org gains:

  • The merger of multiple data sources in a network of relationships.
  • Readily available divisional knowledge that is openly accessible by a publishing platform, implementing the suggested principles of Linked Data by Tim-Barners Lee.
  • Empowered users who can formulate complex requests and leverage cross-domain connections. For example: How much fish was caught in 2008 in the Danish Exclusive Economic Zone by vessels that practice fishing with traps and are involved in fishing agreements with the United Kingdom?

A final thought…

I would like to leave the reader with one of the questions that typically comes up when adopting semantic technologies; a question that leads to the feeling that sustainable data harmonization with linked data is as promising as it is challenging: Some types of metadata are quite time sensitive, particularly geo-political entities. This affects the data attached to such metadata. What special requirements are generated by this type of data and how does linked data propose to accommodate those requirements?

Author: Claudio Baldassarre

No comments:

Post a Comment