Tuesday, January 24, 2012

Data citations in a virtual world


Citations play an important role in the accuracy and transparency of data reuse. Whenever data will be reproduced, it is vital that the author provide access to the original source so that readers can verify its accuracy. If statements and facts cannot be verified, they will become consequently less important. In addition, authors have to pass peer reviews and in order for their research to be taken seriously, the data that they quote from databases has to be able to be reproduced.

A researcher needs to be able to cite FAO statistical databases. In addition to the previously mentioned importance factors, citing data found within FAO databases also protects intellectual copyrights and provides the means to track how much and where the data is being used.

Therefore, it is important that data is reproducible and retrievable, not just in the short term but in the future as well.

Citation challenges

The traditional methods of creating citations based on titles, page numbers, and so on, work well for books and journals, but are of limited use in today’s virtual information world. Take, for example, a printed document that contains citations. It doesn’t matter how long ago the document or the citations were created (i.e. 2 days ago or 200 years), the citations will always remain constant and valid.

On the other hand, citations that focus on virtual data are difficult to reproduce and to maintain; URLs and website titles change constantly for many different reasons, especially when it comes to queries. Therefore, providing URLs in citations often brings the user to a page that no longer exists.

It is clear that new methods are needed for creating citations of virtual resources. This is especially true for results from online statistical databases which can result from highly complex queries. How does someone cite the results of a query in a way that is useful both to humans and machines? And not only today but also in the future? The challenge is to create a standardized form of citations that can be retrievable over time while being easy for the author to create.

The data.fao.org solution

To address these issues, data.fao.org is implementing a solution that will provide a way for the data to be retrievable in a constant and reliable way now and in the future. The data.fao.org citation solution includes the following aspects:

  • Permanent hyperlinks that will retrieve the exact information that the author originally cited.
  • An automated way for the author to create citations. Many different standard citation formats will be available, such as, MLA, API, and so on. This adds to the usability of the site itself.
  • Citations will include a human ‘understandable’ text aspect as well as a machine readable part (a persistent URI).
  • A way to cite data independent of the granularity, for example, an entire database, dataset, a table, or a single piece of data. Authors, therefore, can refer to all different levels of the site.
  • The option to export the citation into the author’s management software, when applicable.
  • Reproducing queries presents another challenging question. Queries can be very complex and long, therefore putting them into words isn’t efficient. The solution for the citations, therefore, is to not include all query parameters as human readable words in the citation. Instead, the citations should use basic words describing the query, such as those used to cite a table, without including all the specific parameters that were chosen. More importantly, the URI will include the query and be able to reproduce it.

    Once the solutions are implemented, it will make using data.fao.org easier and more convenient to use. Additional benefits also include allowing FAO to track the impact of its statistical data, understand where and how it is being used, and improve the acknowledgements of it. The solution that we are developing could be applied to other types of databases as well, and may even be of use to organizations outside of FAO.

    Author: James Weinheimer

    Tuesday, January 10, 2012

    Data harmonization using dense and explicit data relationships


    When I first started working at the statistical division of FAO, I was faced with the challenge of indicating an alternative and feasible approach to data harmonization. Data harmonization is the preparatory step to merge, and later publish, data through models that facilitate exchange inside and outside of the Organization. I am currently working on data harmonization for data.fao.org.

    Although data exchange can be tackled at almost all of the technological layers of the ISO/OSI 7-Layer architecture, the solution approach has to consistently focus and exist in the data representation scope and boundary. The premises to apply the principles of Ontology Modeling and Linked Data are good. Halfway to prototyping the first version of the divisional semantic knowledge base, I was encouraged that the direction I had taken was correct when I found out that the Linked Data Layer was to be added to the ISO/OSI conceptual model.

    Case Study: The Fisheries Linked Open Data project

    The Fisheries Linked Open Data (FLOD) project was used to test the approach of data harmonization with semantic technologies. Inside FLOD more than 10 coding systems consistently co-exist. One coding system is not predominant to any other in the same domain and co-existence does not generate conflicts in the available information. They classify entities from the following domains: land and marine geography, land and marine geo-politics, fishery legislation, and fishery techniques, with others planned for connection in the future. FLOD is used as an integrated FIPS data source in geospatial applications and as a data aggregator in the FIGIS web portal, while new use cases are decreeing it the glue among pre-existing FI information systems.

    The data harmonization approach taken with FLOD involved repeatedly and extensively applying the following three general steps:

    1. Analyze the data and domain of the data sources that will be harmonized.
    2. Design an ontology module for the data content and domain.
    3. Instantiate explicit relationships that exist in the available data from the included data sources.

    In practical terms the harmonization was accomplished by making sense of the data. For instance, three concurrent codes exist for Denmark (e.g. ISO3:DNK, GAUL:69, and UN:208). They are related as being equivalent which means that they can be used interchangeably by information systems aware of the relationships, for instance, to aggregate statistical data referenced by both codes. Another example includes how the Denmark sovereignty over Exclusive Economic water Zone and the rights to exploit beyond its fishing areas are specified, setting relationships among Denmark and these entities. Denmark FLOD resources and its relationships are distributed openly through the following URL: http://www.fao.org/figis/flod/entities/page/codedentity/00574d13-efe0-45ac-84ec-82630506be21.

    This URL is also one of the access protocols to consume FLOD content and it is designed in a very similar way as the URL for the data.fao.org API.

    Data Harmonization and data.fao.org

    The results obtained with the experience of developing FLOD will be transferred to the design of the metadata catalogue of data.fao.org. It will make the catalogue compatible with the ongoing initiative of silos opening and data sharing, and is a crucial resource to support data.fao.org data retrieval and navigation capacities. The data.fao.org web portal is currently in a selection process by the PUBLINK consultancy program, promoted by the LOD2 project. The intention is to achieve sustainable results for the metadata catalogue by operating with the awareness and cooperation of leading open data initiatives.

    By adopting this data harmonization, data.fao.org gains:

    • The merger of multiple data sources in a network of relationships.
    • Readily available divisional knowledge that is openly accessible by a publishing platform, implementing the suggested principles of Linked Data by Tim-Barners Lee.
    • Empowered users who can formulate complex requests and leverage cross-domain connections. For example: How much fish was caught in 2008 in the Danish Exclusive Economic Zone by vessels that practice fishing with traps and are involved in fishing agreements with the United Kingdom?

    A final thought…

    I would like to leave the reader with one of the questions that typically comes up when adopting semantic technologies; a question that leads to the feeling that sustainable data harmonization with linked data is as promising as it is challenging: Some types of metadata are quite time sensitive, particularly geo-political entities. This affects the data attached to such metadata. What special requirements are generated by this type of data and how does linked data propose to accommodate those requirements?

    Author: Claudio Baldassarre