Tuesday, January 24, 2012

Data citations in a virtual world


Citations play an important role in the accuracy and transparency of data reuse. Whenever data will be reproduced, it is vital that the author provide access to the original source so that readers can verify its accuracy. If statements and facts cannot be verified, they will become consequently less important. In addition, authors have to pass peer reviews and in order for their research to be taken seriously, the data that they quote from databases has to be able to be reproduced.

A researcher needs to be able to cite FAO statistical databases. In addition to the previously mentioned importance factors, citing data found within FAO databases also protects intellectual copyrights and provides the means to track how much and where the data is being used.

Therefore, it is important that data is reproducible and retrievable, not just in the short term but in the future as well.

Citation challenges

The traditional methods of creating citations based on titles, page numbers, and so on, work well for books and journals, but are of limited use in today’s virtual information world. Take, for example, a printed document that contains citations. It doesn’t matter how long ago the document or the citations were created (i.e. 2 days ago or 200 years), the citations will always remain constant and valid.

On the other hand, citations that focus on virtual data are difficult to reproduce and to maintain; URLs and website titles change constantly for many different reasons, especially when it comes to queries. Therefore, providing URLs in citations often brings the user to a page that no longer exists.

It is clear that new methods are needed for creating citations of virtual resources. This is especially true for results from online statistical databases which can result from highly complex queries. How does someone cite the results of a query in a way that is useful both to humans and machines? And not only today but also in the future? The challenge is to create a standardized form of citations that can be retrievable over time while being easy for the author to create.

The data.fao.org solution

To address these issues, data.fao.org is implementing a solution that will provide a way for the data to be retrievable in a constant and reliable way now and in the future. The data.fao.org citation solution includes the following aspects:

  • Permanent hyperlinks that will retrieve the exact information that the author originally cited.
  • An automated way for the author to create citations. Many different standard citation formats will be available, such as, MLA, API, and so on. This adds to the usability of the site itself.
  • Citations will include a human ‘understandable’ text aspect as well as a machine readable part (a persistent URI).
  • A way to cite data independent of the granularity, for example, an entire database, dataset, a table, or a single piece of data. Authors, therefore, can refer to all different levels of the site.
  • The option to export the citation into the author’s management software, when applicable.
  • Reproducing queries presents another challenging question. Queries can be very complex and long, therefore putting them into words isn’t efficient. The solution for the citations, therefore, is to not include all query parameters as human readable words in the citation. Instead, the citations should use basic words describing the query, such as those used to cite a table, without including all the specific parameters that were chosen. More importantly, the URI will include the query and be able to reproduce it.

    Once the solutions are implemented, it will make using data.fao.org easier and more convenient to use. Additional benefits also include allowing FAO to track the impact of its statistical data, understand where and how it is being used, and improve the acknowledgements of it. The solution that we are developing could be applied to other types of databases as well, and may even be of use to organizations outside of FAO.

    Author: James Weinheimer

    4 comments:

    1. The actual problem to me is being able to produce the exact cited query result over time. Many publications (online or paper) cite FAO (statistical) data and might may use of the described citing mechanism: how do we ensure the query will produce the same results after years? In Fishery, statistical datasets are reviewed regularly, potentially updating previously published data (because of updates in classifications, coutries merging/splitting/disappearing, ...), meaning the same query could produce different results. Linking the cited persistent URI to a data 'snapshot' might guarantee the result to be invariant.

      Francesco Calderini

      ReplyDelete
      Replies
      1. Dear Francesco,

        Thank you for your comment. You make an excellent point and it is one of the main issues I want to address. It is vital that the information cited in a paper be subject to verification. Although the information itself may eventually turn out to be untrue or obsolete, it is important that the author and others be able to point to the original information for various reasons; for the authors, they need to demonstrate that they were doing their jobs correctly, but additionally, the incorrect information may explain quite a bit too. Let me give an example that I remember happening not all that long ago...

        In the U.S. before the economic crash, there was a long, ongoing, debate among the political pundits about whether the U.S. was in a recession or not. So, the U.S. statistics were cited over and over again, and everybody talked about what economists agreed defined as a recession. It went on and on until the U.S. Dept. of Commerce (or whatever agency) said that the statistics for the previous year or so had been incorrectly done, and everything had been updated. Suddenly there was no doubt for anyone that the U.S. had been in a recession for quite some time!

        If those original numbers just disappeared, then a major political debate in the U.S. for that period of time would not make any sense. I can certainly understand that a researcher today or in the future would want to recreate those statistics so that they could better determine certain decisions and arguments.

        Therefore, I think it is absolutely vital to somehow retain the original numbers but, at the same time, it is just as important to let people know when they look at the original numbers that those same numbers may have been updated from the time they first appeared. I think there are several methods to achieve it. Snapshot are one method, freezing and archiving the tables another, and I am sure there are more ways as well. Adding in maps or graphics will increase the complexity. These are some of the decisions that need to be made.

        Best regards,
        James Weinheimer

        Delete
    2. James - thanks for a thoughtful post, looking forward to seeing this citation system in action.

      Out of interest - will you expect users to "require citation" in the form above as part of the terms of use of the site? Or will it be optional, with a more general attribution such as "data from data.fao.org" still acceptable?

      ReplyDelete
      Replies
      1. Dear Tariq,

        Thanks for your comment. This is another good point. The problem with "data from data.fao.org" is similar to saying, "this information is taken from the Financial Times"
        without giving the the page number, section or date of publication. Such a citation is essentially useless because someone who wants to verify the information or do further research has nowhere to even begin looking for it. I would hope that if someone used such a citation for the Financial Times for an article and submitted it for publication, the editor or peer-reviewers would demand additional information. "Data from data.fao.org" seems to be the same thing.

        Requiring the citation in the terms of use is another matter. Of course, that would be a question for the Organization itself to decide, but my own opinion is that it would be much better to make it as easy and user-friendly as possible for someone to add the citation to FAO material. An actual requirement that says "you must use this citation or you cannot use our material," I believe, may be too much to expect from some people and in fact, may even deter some from citing or using FAO materials. That is not the purpose of citations.

        But that is my own opinion.

        Best regards,
        James Weinheimer

        Delete