Tuesday, March 27, 2012

Metadata Catalogue Life Cycle: Capturing requirements collectively


In my previous post about data harmonization, I mentioned that data.fao.org was shortlisted by the Publink consultancy program offered by LOD2. The request was evaluated and selected: FAO earned the opportunity to develop the open version of the metadata catalogue in the context of other Open Data initiatives (co-)patronized by the LOD2 project. Both for the massive amount of data and the commitment to achieving good results, FAO was granted a double amount of consultancy time by the LOD2.

An interesting result was achieved while preparing for the first meeting with the LOD2 consultants: a list of questions and requirements in the scope of the Metadata Catalogue Life Cycle. This is a live and evolving list which will be updated, especially because the catalogue will ingest data from different units. It is an asset shared with other activities, such as Master Data Management, which involves data at institutional levels.

data.fao.org needs RDF and Linked Data

data.fao.org is the showcase for a massive amount institutional data that is exposed and organized for users in a data web portal. The web portal organizes the information along multiple data lenses, for example, landing pages. A landing page is data centric around a data subject, for example, “by country”. To reach such data aggregation effects, the portal needs to rely on a network of relationships that originate from the data subject as well as branch through it. The web of relationships are instantiated in the metadata catalogue, modeled has RDF graphs, a modeling language choice which unfolds many other possibilities as, for example, generating linked data that connects to existing RDF catalogues from other European or International institutions (i.e. Eurostat, World Bank).

“FAO questions about Linked Data Life Cycle”

In her post discussing data.fao.org from a high-level point-of-view, Lorrie Barber uses a simple example to describe “the power of data.fao.org and its capacity to pull data together in one centralized location.” The metadata catalogue is the core source of this power and we face the need to carefully control it. We have to control the process of generation, update, interlinking, evolution, and quality; or in other words, its life cycle. Three elements are requested to implement the metadata life cycle: a good framework of operations, a good set of requirements, and the right technologies in place.

The image provides the LOD2 theoretical framework for the linked data life cycle, created by groups of experts who are behind each of the phases. On the other hand, some FAO data experts gathered the first list of relevant questions per phase:

  • Data extraction (from non-RDF to RDF format)
    • When it comes to converting statistical data in RDF, what is a good level of data resolution to have?
    • What is the best way to proceed when selecting among the vocabularies (terms and properties) to describe data in RDF?
  • Data storage (dataset organization and triplestore solutions)
  • Data refinement (detailed data revision and authoring)
    • What are the tools that are able to implement the best metaphors to present casual users with manual RDF data maintenance tasks?
    • What are the approaches and tools for granting distinct user rights for a specific subset of ontologies?
  • Data interlinking and mashups
    • What are the steps to interlink distributed RDF graphs?
    • How can RDF graphs be queried in distributed datasets or distributed endpoints?
    • How does the existence of distinct linked RDF models interfere with such query capacities?
    • What is the best way to reduce development and maintenance costs of interlinking RDF datasets?
  • Classification and enrichment (of information content with RDF data)
    • What tools can be used to evaluate the proximity of information resources (for example, documents) on the basis of recurrence and proximity of terms in an RDF network?
    • How can semantically marked up web pages be produced from a repository of RDF datasets?
  • Quality analysis (of information content with RDF data)
  • Evolution and repair (data maintenance)
    • How can we ensure that recurrent data updates be as programmatic as possible in the life cycle of linked data?
    • How is versioning handled and exploited in the scenario of recurrent data updates?
  • Search, browse and explore RDF data
    • What are the most user-friendly metaphors that can be used to navigate RDF graphs for graphical interfaces that are intended for casual users?

A call for collaboration

This article has two main purposes: the first one is to share this list, and the second one is to stimulate additional discussion from other FAO adopters of linked data and ontologies, with the understanding that the result will be submitted to semantic domain experts.

Technology-wise, the LOD2 project proposes a technology stack to be implemented in each of the life cycle phases as well as a long list of alternatives that can be selected online.

A big ‘thanks’ in advance to the people who will participate in the discussion to capture requirements. You are welcome and encouraged to post questions and comments here on this blog.

Author: Claudio Baldassarre

Monday, March 19, 2012

Continuous Integration


The best definition of continuous integration comes from Martin Flower: “Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.”

The data.fao.org implementation

Considering the fact we have a globally dispersed development team that checks code into a version-controlled repository on a daily basis, the continuous integration methodology is the best solution for the development of data.fao.org. The process begins by developers writing their projects based on some Maven Archetypes that I customize.

On a separate virtual server, more specifically a Jenkins-CI server, the Subversion (SVN) repository is polled, runs an automated build, and sends feedback to the development team via email. Every time the code is changed, the following steps are performed on the Jenkins-CI server:

  1. Checkout code from the SVN.
  2. Compile source code.
  3. Run tests.
  4. Run inspections.
  5. Jenkins stores the produced .war files in Artifactory.
  6. Jenkins deploys the produced .war file on the remote server.
  7. Jenkins runs a downstream project, if present, that performs functional or black box texts.
  8. Integrate database deployment software in Artifactory and then in the test environment.

Each night, the Jenkins-CI server runs a job to execute smoke tests, or black box tests, against the test environment. Each build is stored on our Artifactory and is available for our developers.

Jenkins and the branching strategy

When a project come closer to the deadline, and the build is stable, we make an SVN branch usually with name “RC-#version_number”. This strategy gives developers the freedom to continue adding new features in the /trunk directory and, in the meantime, fix the bugs that affect only the release candidate on the branch. When the builds on the RC branch are stable according to the metrics and tests we run with Jenkins-CI, we make an SVN tag from the branch. So for each project, we have a job for /trunk directory that deploys in the development environment and one for the branch that deploys in the test/qa environment.p>

On the day of the release, the job that builds from the RC branch is used to build from the tag and deployment is in a production environment.

Maven

To handle builds in multiple environments while managing a large amount of resources that can occasionally vary, we need a standardized structure. Thanks to the Archetype and the Maven Profiles, we solved the issue of organizing information in only a few places. Our archetypes contain a default Maven Profile named env-dev. This profile is for developers and it is where they can put the resources used by applications such as the database connection.

There are two other profiles, contained in two different settings one for test/qa and the other for production. These settings.xml files that contains profiles for these environment (named env-test and env-prod). These settings.xml are not contained into archetypes but stored on a secure path.

When Jenkins-CI builds the package it uses the profile according to the environment where the deployment is finished. There are four environments: development, test, QA, and production.

Matter of practice

Continuous integration is a good investment for the data.fao.org development team due to the fact that it takes care of all the tasks that make life hard for developers. Also because behind all technology, including the most viable and promising, there are always people.

Author: Federico Paolantoni

Monday, March 5, 2012

data.fao.org: A bird's-eye view


The data.fao.org website is on the cutting edge of a larger trend – other organizations, both in the United Nations system and in the private sector, are working to make their data more consolidated and more openly available to the public, whether to researchers, journalists, other organizations, or simply to interested individuals. With data.fao.org, FAO is looking towards the future of data, which is about transparency as well as technology.

To get a sneak peak of data.fao.org, take a look at a brief video:

As a knowledge organization, one of FAO’s most appreciated resources is the comprehensive data it produces, which covers everything from food prices to water use to animal disease spread. Until now this information has been housed in a number of systems and websites, making it difficult to find data without having to access multiple websites simultaneously. The data.fao.org website will bring together statistics, maps, pictures, and documents on food and agriculture from throughout the FAO organization in one convenient location.

FAO has some really valuable sources of information that include some databases that have been around for years. With data.fao.org we’re making this information much more widely available, as well as easier to use and share. One of FAO’s current initiatives is to open its doors to society. data.fao.org is meeting that goal from a technological standpoint by opening up the Organization’s data to the world.

Another challenge that data users face is that they are constantly having to chase down information in multiple locations as well as having the burden of accessing data in many different formats. The beauty of data.fao.org is that everything you need is in one place and in the most convenient format of your choice.

So, what does that really mean? Let’s take a look at an example. Let’s say that we are trying to put together the information required to write a report about crop production in Kenya and we need to collect the corresponding statistics, maps, photographs, and documents. We go to data.fao.org, perform a simple search, and we find what we are looking for: a worldwide map of cropland distribution, a database on national crop trade, another database on sub-national production data, a policy paper on land tenure, and a FAO terminology website to quote the official country name. This simple example shows the power of data.fao.org and its capacity to pull data together in one centralized location.

In addition to housing data from many different FAO sources all in one location, data.fao.org includes tools that help you interpret the data, share it, or incorporate it into your own website as well as a powerful search engine to help you find data. More than just providing raw data, however, the site will also provide important details about data quality, how it was gathered, and how it complies with norms and standards.

How does our data stack up?

When data.fao.org goes live, it will contain an immense about of resources. The site has already over 313,300 maps and 16,729,870 statistics uploaded. We expect this to rapidly grow to 30 terabytes of data and beyond, including photographs, statistics, maps, and documents.

To put the numbers into perspective, let’s think about them in terms of things that we can visualize. For example, if we wanted to print off all of the data from data.fao.org we would have to destroy 20,480,000 trees to make enough paper. Lining the paper up, it would stretch 13 football fields long or, stacked up, would be three times the height of the Empire State Building. The data would weigh in as much as 953 adult elephants.


Author: Lorrie Barber