Tuesday, January 15, 2013

Country code harmonization in data.fao.org


Most of the data disseminated through the data.fao.org portal and its APIs is country-based information such as agricultural land-use data, production, trade and consumption data, water and agriculture information, and so on. This huge amount of data comes from heterogeneous databases and different FAO applications. One of the goals of the data.fao.org project is to facilitate the retrieval, comparability, and exchange of these country-based datasets. In this context, the adoption of standard coding systems becomes a core requirement to ensure data interoperability; defining a common way to identify countries and territories. The goal is to adopt a 'common language' for both the data-provider and the data-consumer.

In a nutshell, these coding systems foresee short alphabetic or numeric geographical codes, developed to represent countries and dependent areas, for use in data processing and communications. Several different coding systems have been adopted to achieve this goal, including both international and FAO.

International coding systems:

  • The standard ISO 3166-1 is stated using three different codes:
    • The alpha-2 - a two-letter code.
    • The alpha-3 - a three-letter code.
    • The numeric - a three-digit numeric code
  • Many other common international coding systems exist that can be used to define countries, geographical entities or areas, for example, UNDP code, dbpedia.

FAO coding systems:

  • The Global Administrative Unit Layers (GAUL) project - a FAO initiative that aims to provide reliable and standardized geographic information on national and sub-national administrative units for all countries in the world. It also provides a mechanism to track boundary changes over time and to maintain a consistent coding system throughout the layers. The GAUL codes are numeric. An updated version of GAUL is currently in-plan.
  • FAOSTAT - bases its regional classification on the M49 UN classification. FAOSTAT codes are numeric.
  • AGROVOC - a dictionary that identifies and codifies concepts related to food, nutrition, agriculture, fisheries, forestry, environment, and other related domains, as well as country and geographic terms. Agrovoc codes are numeric.

Agrovoc search terms page

Country identification in data.fao.org

The work of identifying countries in data.fao.org was done in collaboration with other FAO projects, such as Name of Countries (NOCs) and Country Profiles. The goal of the NOCs project is to maintain the official country and territory names. Currently the project is expanding by starting to provide additional information and by managing country codes. A joint effort has been made between the NOCs and data.fao.org teams to update and maintain country codes in order to obtain a complete and official reference. The geopolitical ontology reflected in the Country Profiles project ensures that FAO relies on a master reference for geopolitical information, standard coding systems for maps (UN, ISO, FAOSTAT, AGROVOC, etc), provides relationships among territories (land borders, group membership, etc), and tracks historical changes.

In practice, the identification of a country using country-code standards in data.fao.org starts by identifying the country using the different standards. In order to achieve a consistent mapping of the different country codes in data.fao.org, we extracted the country codes from the NOCS source for all of the countries, territories, and country codes of interest. We then inventoried and implemented a master table with the various country codes and the other geographical entities coming from the official sources. Finally, we compared the information from the NOCS source and from the official standard sources. In the cases where there were differences or gaps, we made further investigations in order to identify the official codes and insert the right mappings.

In the framework of data.fao.org, the focus was on the ISO3, ISO2, UN, GAUL, UNDP and FAOSTAT country codes that appear on the country profile code of the portal. For example:

Country name (en): Bangladesh
ISO3: BGD
ISO2: BD
UN: 50
GAUL: 23
UNDP: BGD
FAOSTAT: 16
AGROVOC: 810
UNEP: 76

In data.fao.org, the country codes for each country are displayed on the landing page for the country with the various standards mentioned above.

Country landing page for Bangladesh in data.fao.org

The mapping aspect is crucial and there is a need for good communication with different stakeholders. In fact, the issue cuts across a number of heterogeneous roles: project management, legal, data management, and IT. Country codes depend on governance, international standards, issues related to management, and so on.

In conclusion, this work contributes to the harmonization of country codes at the FAO level, according to international and FAO standards, to identify countries. In the same way, it reinforces the metadata of the data that is provided by geographical entities through the portal and allows us to provide high quality work to the end user. The high quality work is reflected at two levels. At the first level, with a common country code, we are able to speak the same language and indeed to receive integrated data from various and numerous sources (for instance, The World Bank, countries and so on). At the second level, and reciprocally, adopting a common language to identify a country allows us to disseminate various information through data.fao.org with adequacy and relevance.

Author: Stéphanie Petit

Wednesday, December 12, 2012

Beta Release One - The foundation for uniting our data

The BIG day has arrived, 12.12.12, and we are pleased to offer the first glance of our new data portal, http://data.fao.org.

http://data.fao.org is an innovative web-based platform that brings together statistics, maps and pictures (and soon documents) on nutrition, food and agriculture from throughout FAO, providing easy access, a powerful search engine and data visualizations all from one convenient location.

In this foundation release we are including just a fraction of the data available at FAO and we are well on our way towards data harmonization across the Organization. We have already linked a substantial amount of data:

  • 80 million statistical observations in 45 datasets
  • 230 thousand map layers
  • 500 pictures

This release includes a high level of functionality, which is accessible from most modern web browsers including smartphones and tablets:

  • Search and browse
  • Enrich and share
  • View, cite and download
  • Application Programming Interfaces (APIs) and Embeddable Widgets

Upcoming releases will include:

  • Documents (included with the launch of the new FAO Document Repository).
  • More of our data - there are billions more numbers to release.
  • Type-specific advanced search.
  • Direct data contribution.
  • Language translations (The platform will remain English only until it is more complete).
  • A SPARQL endpoint.

We are very much looking forward to your input, feedback and suggestions. It is only through your on-going participation that we can build on this foundation release.

Author: The FAOdata team

Wednesday, November 14, 2012

Uniting FAO’s master and reference data


‘Master data’ can be loosely defined as any data that is critical for the operations and practices of an organization. This critical data, however, can often times be stored in many different and dissociated locations through an organization. Lack of data interconnectivity often leads to efficiency problems in terms of data management, ownership, and access.

In the article "Gartner says Master Data Management is Critical to Achieving Effective Information Governance”, Gartner Newsroom has more thoroughly defined master data management as “… a technology-enabled business discipline in which business and IT organizations work together to ensure the governance, uniformity, accuracy, stewardship, semantic consistency and accountability of the organization's official, shared master data assets.”

Master and Reference Data Management (MDM) deals with using technology to synch together dispersed organizational data in order to make it more easily accessible. It also helps to create a more complete picture of the data itself. The driving force behind creating a solution to effectively manage master and reference data is to make the data more consistent within an organization by creating common frameworks and data definitions.

The following image highlights master data management and business strategy:

From: “The Promise of Master Data Management”, Information Management

Benefits associated with MDM include how it assists with the daily operations of the organization. As a result of de-duplication and de-fragmentation, financial costs are lower. MDM users also can rely on systems that are already in place, thereby saving time by not having to build similar solutions from scratch. More specifically, MDM helps to improve cross-functional team collaboration as well as provides a clearer picture of data ownership and governance.

To read about MDM in more depth, including its function, architecture and principles, take a look at the following resources:

Data management at FAO

Knowledge information systems that are developed at FAO rely on, and make heavy use of, taxonomies, classifications, geographical references, document-tagging, statistical code lists and quality indicators. On the other hand, operational systems (ERP) are built upon similar classifications and taxonomies in the domains that pertain to human resources, such as finance, accounting, and so on.

Every business unit in FAO has approached the issue of master and reference data from a slightly different angle; focusing on different aspects and implementing specific workflows that are tailored to the business domain they pertain to, in light of the lack of corporate data governance and of a system that enables data governance. This is often the cause of inconsistencies, duplications, poor quality, mutual dependencies, and extremely complex management and governance workflows across FAO.

Strengths and weaknesses vary widely, so what might be a great choice for a group that needs to manage a straightforward thesaurus will not be the best option for other users with very specialized needs. Some examples include:

  • Data warehouse systems.
  • Statistical systems such as indicators, attributes, dimensions, and so on.
  • Geographical reference management and mapping systems.
  • Documents or records management, classification, and tagging.
  • Countries, territories, water areas, agro-ecologic zones classifications.
  • Geo-political ontologies.
  • Aquatic species, tree species, and taxonomies.
  • Commodities, livestock, and inventories.
  • Terminologies, vocabularies and glossaries.
  • Organizations, institutes, contacts and contact list management.
  • Enterprise searches.
  • And so on.

One way or another, data domains at FAO deal with master and reference data (also referred to as metadata, depending on the perspective); one of the most valuable assets of the Organization, that faces similar reference data management issues. Each data domain has its own business focus, for example, statistical, geographical, terminology, referential, organizational or content oriented.

Master and Reference Data Management clearly represents a cross-functional initiative, with significant interest already expressed by many different functional units across FAO, and with clear potential for the whole organization. By nature, MDM would therefore fit perfectly with the “Uniting our data” mantra of the data.fao.org initiative.

In the scope of MDM, there are several features that must be considered a “must have” for the Organization including the following:

  • The ability to manage multiple vocabularies or classifications and their cross-mappings.
  • Services that support multiple languages.
  • Governance that includes data ownership and updated workflows.
  • Import or export routines supporting various formats (RDF, XML, SDMX, CSV, and so on).
  • Integration with other tools and systems, including real-time lookups through open APIs or web services.

Recommendations

Implementing MDM is an essential step to raise the profile of FAO to an authoritative source of controlled terminologies and classifications. In building data exchange mechanisms at regional and national levels, there is a critical need for accessing international classifications together with the need for understanding in which context they are authoritative. MDM in the context of open data access policies through data.fao.org would boost FAO’s capacity to respond to such a need.

MDM would serve member countries by coinciding with their expectations of FAO’s role as a facilitator of knowledge exchange. It would be a key instrument for fostering interoperability and data exchange between FAO and external organizations. By facilitating open access to authoritative terminology and classifications sources, it enhances recognition of FAO sources by external systems. As a result, integration of FAO references in these systems facilitates content interoperability and exchange between FAO and other agencies.

In conclusion, in order to realize the full benefits of this initiative, it is necessary for FAO to centrally support MDM functions through the implementation of a corporate Master and Reference Data Management solution that is positioned at the core of the enterprise architecture. It would thereby support the data.fao.org initiative as well as overall master data management and governance at FAO, possibly encompassing or linking both knowledge and operational systems.

The focus should be on the following activities, as further laid out in the article “The What, Why, and How of Master Data Management” by Roger Wolter and Kirk Haselden:

  • Identify sources of master data.
  • Identify producers and consumers of master data.
  • Analyze the metadata that is associate with the master data.
  • Develop a model for the master data.
  • Choose the software and tools.
  • Modify the systems for producing and consuming.
  • Implement a program of governance and maintenance for the data.

Author: Francesco Calderini

Wednesday, October 3, 2012

Clustering in the data.fao.org environment - Part 2



Quorum, split-brain, fencing

The key to cluster reliability is simplicity. This means that the simpler the cluster, the more it will be stable. In this case “simplicity” is used not to signify that a cluster is a simple object but that all of the processes and checks have to be easy to understand and fast to execute.

Under normal conditions, a cluster performs tasks that are very linear. Mainly it runs or stops processes that should be running in order to check if all of the nodes are up and healthy. More modular pieces could be used for specific tasks, which we'll see with the gfs file system; however the basic goal of cluster software is to monitor node status and health.

The first and most dangerous situation that could happen involves the loss of node connectivity. The problem with loosing network connectivity is that it makes it impossible for a node to know if the other nodes are hanging or if they are simply unreachable. This difference is crucial because while a blocked node cannot do harm or make terrible mistakes, a disconnected node can. When one or more node loses connection with the rest of the cluster, it is called 'partitioning'. The situation in terms of the cluster software is referred to as a ‘split-brain’ situation. It is referred to as a “split brain” because each partition believes that it is the main cluster, due to the lack of connectivity, and that the other nodes are not-working or have crashed. In this situation each partition tries to start the cluster services. In the case of the following image the cluster is split into two partitions: partition 1 includes nodes A and C and partition 2 includes node B.

The example in the images highlights the danger involved in this situation. Suppose that the cluster uses a disk as a shared resource. In a healthy cluster, access to the disk is controlled by cluster software and each node accesses the disk in an ordered way because the cluster software is aware of all disk access. In the case of a split cluster, each partition wants to be the main cluster. So in partition 1 the shared disk access is ordered between nodes A and C but in partition 2 the disk access is performed by node B only, completely unaware that the same access is being performed by nodes A and C. In this scenario it is certain that some data will be lost.

Knowing what node B is doing is impossible without communication therefore cluster software developers have built a clever solution: assure that node B cannot write to the disk, also referred to as 'fencing'. Fencing is a general purpose process that can block shared resource access. Typically, to accomplish fencing, some other special hardware facility is needed. For example, storage area network (SAN) disks can access disk switches using special cards and switches. A fencing process can communicate directly with disk switches to block all erratic nodes. One other facility is the possibility to hard power off the erratic node using special power sockets that the software can switch off. The latter fencing case is called STONITH (a funny acronym that means Shoot The Other Node In the Head). ‘Stonith’ is often the last, but rude, defense against disaster.

Fencing and stonith are not enough however. Partition 2 that includes only one node, the B node, thinks that it is the main cluster and tries to fence partition 1 and if it's faster, it succeeds. We cannot permit that only one node survives instead of two because one node means that a single point of failure possibibly could occure (see part 1 of the clustering in the data.fao.org environment article).

One way to solve the mutual fencing issue is to use a democratic solution known as the 'quorum'. Choosing the right partition without communication means that each partition must agree with a simple rule: The partition that is composed of more than half of the number of total nodes will be the winning one. In our example, partition 1 is the winner because it has two nodes (the total possible vote is 3) so partition 2 loses and deactivates itself. This is a smart solution that works every time, or at least every time a cluster is made using an odd number of nodes.

What about even numbers? For example, in a two-node cluster each partition has an even number of nodes so there is no winner partition. To solve this issue, typically a 3rd party arbitrator is used. It could be hardware or software and includes a quorum process where the main purpose is to break ties using some hardware facility (remember the first rule of clustering: hardware redundancy).

GFS2 Clustered File system

In all modern multitasking systems, file access is controlled by the operating system and assures that each program has an ordered access to disks. If one program tries to open a file that is already open, the operating system typically replies with an “Access Denied” error or opens the file in read-only mode to avoid the danger of allowing multiple writing access to data files. In a multiple-computer environment a “non-cluster aware operating system” cannot know if others are accessing the disk. In this way, simultaneous writing to the disk could occur. From the point of view of each operating system, the shared disk is a normal disk and therefore freely accessible. To avoid a writing conflict, a cluster-aware operating system has to use a distributed lock manager (DLM) which implies a specific software that, using network communication, can enlist shared disk requests and distribute them to all other cluster nodes

The images on the right highlights a process request made to an operating system to have exclusive access to file “a”:

  • The operating system uses the DLM to enlist the request in cache memory.
  • The DLM communicates the request to all others DLMs in the network.
  • Each receiving DLM enlists the request in its memory cache.
  • Proc the exclusive access to shared disk is granted.

If a process in a different node tries to request the same file in the shared disk, the operating system puts the process in sleep mode and waits until the cache lock is freed by the DLM. In this way an ordered disk access can be performed.

The data.fao.org architecture

The data.fao.org project uses a four-node cluster software and a GFS2 clustered file system. In this way the application can efficiently load the geotiff files that are stored in a shared disk. From an application point of view, the shared disk is a local disk. The network locking mechanism is performed by cluster software and DLM. In this way, data.fao.org programmers do not have to worry about complex locking procedures, but can access files in a normal easy way as local files.

Author: Damiano Scaramuzza

Thursday, August 30, 2012

Clustering in the data.fao.org environment - Part 1


At FAO, as in all big organizations that manage large amounts of data, the need for high performance and reliability is increasing, probably indefinitely. When a web service has to search, read and elaborate multiple terabytes of data, a single, albeit powerful, computer is not enough.

Often complex architectures and frameworks slow down systems but it is certain that they are necessary. We need more than one single computer to ensure so-called "high availability"; a fact that has been true for the last few decades of computer technology. The idea behind clustering is quite simple: multiple computers working together do a better job than just one. Sometimes a simple idea commands in a complex world; in this case, the complex world of cluster computing.

Wikipedia states that "a computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system." This set of "loosely coupled" computers is referred to as nodes in the cluster world. There are two main oversimplified classifications for a cluster:

  • High Availability (HA) clusters - A set of nodes that work together to assure the continuity of a service. If one node fails, another node takes charge of its work. In other words, they work to maximize reliability.
  • High Performance Computing (HPC) clusters - A set of nodes that divide a large job into smaller pieces that run simultaneously. In other words, they work together to maximize performance.

High performance and high availability can also be found in a single cluster (in other words, they are not mutually exclusive); typically in the case of a large complex system.

In order to increase the reliability of a service, we have to increase the possibility of the service being available even in the event of a nodes failure or unavailability. To do this, we create a redundancy. Adding a new node is the easiest form of a redundancy.

However, in a cluster, there are many pieces of hardware that can fail asymptotically, such as network cards, switches, power, disks and so on. Therefore, we try to duplicate everything. For example, we could triplicate disk storage or duplicate network cards and cables or power sources but this process is impossible to complete. Sooner or later we could face a single point of failure (SPOF), which in practice means a failure that stops the server cluster. Our goal is to avoid single points of failure as much as possible.

Obtaining 100% reliability is ideal but it comes with an infinite cost. With redundancy we can even reach up to 99.999% reliability but each percentage point after 99% significantly increases operational costs.

High Availability cluster types

The server environment used for data.fao.org focuses on High Availability (HA) clusters with services that run simultaneously. The simplest HA cluster is a two-node cluster. The two nodes can be configured as ‘Active-Passive’ or ‘Active-Active’.

An active-passive cluster configuration involves an active node (also referred to as the master node) and a dormant node or nodes (also referred to as the slave node). The node that is set to active mode serves the services. The other dormant node or nodes are inactive and wait for a failure of active node. The active and dormant nodes exchange a special network message, referred to as a heartbeat, communicating whether or not the active node is still functional. If the active node becomes unavailable, the heartbeat vanishes and the dormant node is automatically promoted to take the place of the active node and its services. If the previously active node recovers from failure, it is set as the dormant node and waits in case that the new active node fails.

The following image (Fig. 1) is an example of the Active-Passive mode:

Fig 1. Active-Passive

An Active-Active cluster configuration is similar to an Active-Passive cluster with the main difference being that both nodes are always fully functional and serving services as active nodes. This configuration uses the full power of both nodes and, in some cases, can include the same service in both nodes, doubling the performance of that service. In case of failure, the services that are running on the failed node are automatically started on the peer node. When the failed node comes back to life, the cluster can choose to stop the service, limiting resource consumption, or leave it running on both nodes.

The following image (Fig. 2) is an example of the Active-Active mode:

Fig.2 Active-Active

The Active-Active configuration is more attractive than Active-Passive but it is typically more complex to configure.

In a follow-up article, I will discuss node management for disks that are accessed by multiple services at the same time, specifically known as a clustered file system. Stay tuned!

Author: Damiano Scaramuzza

Monday, August 13, 2012

data.fao.org testimonials


Ekaterina Dorodnykh

"As a research consultant and PhD student, my job consists of working with different types of data and statistics, frequently utilizing multiple databases as well as international websites. From my years of research experience, I have come to understand the importance of data availability, accurate statistics, and powerful database search engines, and how they are imperative to successful research development.

Currently, I am working with the IT development team at FAO, assisting them with the development and testing of the data.fao.org website, a unique tool that pulls together statistics, maps, documents and pictures on food and agriculture. I was first introduced to the site a month ago and I immediately noticed the convenience of having the ability to access many different datasets and databases from the same source. Moreover, for the first time, a web-based information technology system is available that allows to you to browse, compare and download FAO data from the same location, previously only available through different separate databases. I like how technology is being used to improve access to FAO data and to produce visual representations of data through images, graphs and maps. Thus, even external users without a specific background in statistics can find and understand data in very simple way. Meanwhile, at the same time, for experienced users there are a lot of additional options to help validate and analyze the data by choosing different filters.

In my opinion, the data.fao.org website will soon become an important tool for policy makers, academics and others who are interested in global agricultural issues. The free open access, powerful search engine and comprehensive data visualizations of data.fao.org make it a crucial tool in helping FAO with one of its principal activities: serving as a knowledge network. I am very proud that I can contribute and participate in the development and quality review of data.fao.org."


Vittoria Papa

"As soon as I started interacting with the IT development team at FAO, and had the chance to test the data.fao.org website, I was impressed by the massive amount of work that was behind its development. As a student, I find it really important to have access to websites that can not only provide me with statistical data for my research and presentations but also allow me to share what I find with colleagues and friends with just a couple of clicks, an aspect that I didn’t know was possible until looking at this site. I can “like” the pages that interest me the most, easily ‘follow’ data to receive automatic updates when changes are made to it, or add information to Google + and Pinterest. Also, having the option to directly add comments or suggestions via the feedback feature make the site more personal and accurate.

But what I found really amazing behind the idea of this website is that it transforms a typical “data warehouse” into something more: you can have a quick look at the basic information of a country, read documents about the topics that interest you most, and look for pictures or maps useful to your personal work. If you are just a person curious about the topic of food and agriculture, you will quickly find yourself comparing country statistics and be amazed by what you can learn, such as I did. I think it is important to bring this type of information much closer to people. The more data is accessible to people in general, the more we are able to discuss it.

I enjoy watching data.fao.org transform and develop continuously towards its goal and having the opportunity to participate in its development. It’s really stimulating to work on something that you can actually see growing and evolving under your eyes and knowing how useful it will be when it is finished."

Thursday, July 19, 2012

Triplestore and the data.fao.org catalogue

A previous post, Metadata Catalogue Life Cycle: Capturing requirements collectively, described how one of the core components of the data.fao.org architecture is the metadata catalogue where the descriptors of each resource (such as databases, datasets, statistical dimensions, measures, digital assets, etc.) and their relationships are stored as RDF graphs.

With its native RDF support and the power of the SPARQL language, which is specifically designed to query RDF data, the triplestore technology provides a schema-free environment where data can be easily managed, even if its structure cannot be forced into a single normalized schema.

Therefore, from a technical standpoint, the triplestore technology is the most convenient way to store and query information that has the following characteristics:

  • A data schema that is not coherently defined.
  • A wide range of different types of entities to store.
  • An evolving set of relationships between the different entities.

OWLIM 5: The choice for data.fao.org

In terms of a triplestore solution for data.fao.org, we needed one that had the following features:

  • High-performance reasoning and retraction
  • Scalable replication cluster
  • Full-text search
  • Geo-spatial queries

After evaluating several triplestore options, both commercial and open source, we decided to use OWLIM 5, a native and pure Java RDF database engine with extensive reasoning support. OWLIM 5 has made the development of the data.fao.org catalogue component easier because it offers full compliance with the de-facto standard API model provided by the open source OpenRDF Sesame framework as well as support for the latest SPARQL 1.1 and SPARQL Update languages and protocols.

The following image demonstrates the high level architecture of the data.fao.org catalogue component:

Author: Luciano Blasetti

Wednesday, April 11, 2012

Managing data.fao.org with Chef


The data.fao.org project is ambitious. It is challenging both for the software developers to implement as well as for the operations staff to support. Highly available and highly distributed systems with multiple entry points offer many advantages but they also come at the cost of creating greater complexity with higher maintenance requirements. Just as the distributed architecture for data.fao.org forces its developers to change their mindset, it also forces the operations team to work in a different way. We can no longer treat infrastructure as a set of static artifacts but we must literally treat our “infrastructure as code”, as best described by Stephen Nelson-Smith.

Past Approaches

In the past, the operations team at FAO managed servers with a collection of Bash and Perl scripts. In my time here at FAO, I have written a provisioning script that totals over 900 lines of Bash. This, along with other scripts, allowed us to set up new servers fairly quickly. However, these scripts were brittle, inflexible, and critically not idempotent. For this reason they could only be used for initial setup purposes and could not be used to keep our servers in a known state.

This meant that our servers were subject to what is called 'configuration drift' which means that our machines would begin mutating into uniquely special artifacts, or snowflakes, due to small, individual, and undocumented changes. These snowflakes increasingly consumed time because they continued to mutate. Besides from the issue of configuration drift, automating application configuration with shell scripts is a lonely process. There is no ecosystem of shell scripts for automating system administration. This is because Bash does not have any abstraction capabilities upon which to build a framework or reusable modular components. Perl, in my humble experience, is only slightly better in these regards. A Bash script is instantaneous technical debt.

It was painfully obvious to me that I needed to seriously upgrade our tooling. The new tools needed to have the following characteristics:

  • Maintain servers in a known state. No snowflakes!
  • Be part of a larger community that shares code and best practices in order to minimize technical debt.
  • Be easy to get started with so that new staff members could quickly come on board.
  • Be powerful enough to meet our configuration needs.

As a result of these requirements that we set, I looked at three different configuration management tools:

Puppet and Chef are true configuration management systems while Fabric is really a tool for remote job execution. I had used it in the past to make small settings changes to hundreds of servers at the same time. For example, I used it one time to update the ntp configuration on all of our servers and then to restart the NTP daemon. I quickly found that Fabric was not sufficient for my needs. While it is great for making ad-hoc changes, it was designed for, and is primarily used to, deploy Django applications and its community is focused accordingly.

I then spent several months working with Puppet. I found that it was very easy to get started with the Puppet DSL, with its easy-to-use primitives for working with system resources. The following is an example of a puppet 'manifest' for configuring SSH:

These resources are idempotent meaning that they achieve the same effect no matter how many times they are applied. Another big bonus of using Puppet is that there is a large and supportive community around the project. Help was always prompt and available on IRC and the Puppet mailing list.

While it was easy for me to get started with Puppet, I soon ran into pain points. Puppet has its own configuration language, the Puppet DSL. To extend this DSL beyond the core resource types you have to write your own Ruby modules. This meant that do anything moderately complicated I had to learn two distinct languages. Jumping between the two was a serious impedance mismatch and I frequently found myself confused. To make matters worse, the Puppet DSL does not have an interactive Read-Eval-Print-Loop (REPL) like the one provided by Bash, Perl, and Python.

Chef has an internal DSL that provides the best of both worlds. The following Chef code is in a Domain Specific Language (DSL) and it is also valid Ruby code:

The simple fact that the Chef DSL is pure Ruby code provides innumerable benefits, mainly stemming from the fact that the DSL can reuse all of Ruby's tooling. My favorite example of this is Chef's interactive shell, the unfortunately name Shef (Chef’s Shell). I have used it to debug countless issues. Here is an example session:

The ability to use pure Ruby in Chef 'recipes' is one of several features that Chef provides and Puppet does not. It must be noted that Chef has a substantially higher learning curve than Puppet but in my opinion it is not so high to outweigh its benefits. Additionally, you should keep in mind that Chef is a much younger project than Puppet and has a substantially smaller user base. Despite these concerns, I chose Chef for managing our system configurations because I believe that it is the more productive platform.

Data-Driven Infrastructure

There are many excellent introductory articles on Chef so I will only highlight some of the more novel portions of them here, particularly the ability to allow data to drive the configuration of your infrastructure. Chef has two components, data bags and search, that make Chef recipes highly dynamic and remove the need to hardcode information such as IP addresses, host names, database connection values, JVM heap sizes, and so on into recipes. Data bags can be thought of as global variables for your infrastructure.

Let’s start by specifying which server is a ‘production’ server and which one is a development server:

roles/production.rb

roles/development.rb

I didn’t cover roles earlier but the following is a good description that comes from the Chef wiki:

“A role provides a means of grouping similar features of similar nodes, providing a mechanism for easily composing sets of functionality. At web scale, you almost never have just one of something, so you use roles to express the parts of the configuration that are shared by a group of Nodes.”

After creating these roles, they must be applied to the 'nodes' or individual servers. I will not cover this step in this article. The following data bag specifies the values according to the application environment:

data_bags/applications/enterprise_service_bus.json

jboss/recipes/enterprise_service_bus.rb

The search feature is exciting because it allows components to dynamically find each other. Essentially, search allows a node to query the configurations of other nodes. The Infochimps have taken this to the extreme with their silverware cookbook, which allows components to essentially wire themselves together. They use Chef to provision and configure hadoop clusters. The following small example shows how search can be used to configure a master postgresql server to forward its write-ahead-log to slave servers:

First configure the roles… roles/postgresql_master.rb

roles/postgresql_slave.rb

postgresql/recipes/master.rb

Using this technique, the postgresql master will begin replicating its data to any slave servers added since the last Chef run.

Image courtesy of Warwick Poole http://warwickp.com/

Java's Special Challenges

Managing Java applications presents special challenges. Most popular applications can be installed from system packages like .rpm or .deb. For whatever reason, the rpm and debian packages available for java-related packages are long out-of-date. For example, the main provider of Java-related rpms is jpackage.org, which has no packages available for tomcat 7. The latest JBoss package was uploaded in May of 2009. There is a pronounced disconnect between the Java developer community and the Linux distribution packages.

To make matters worse, no version of the Oracle Hotspot JDK is available as a system package. Oracle forces you to download it directly from their website using a browser. You can not automate this process without violating Oracle’s legal terms. Many Linux systems administrators create their own rpm and debian packages in private repositories using tools like fpm.

I have not taken this route as I do not want to have to maintain my own private repository. Also, any Chef recipe that relies on a private repository is hard for other people to reuse. Installing a Java application is actually far less complicated than installing an application that must be compiled. Instead I created the "ark" resource that downloads a compressed file from a given URL, unpacks the file to a given location, and optionally updates the system PATH variable. The ark resource is explained in great detail on the developsanywhere blog. The following quick example shows how I use ark to install the Java development kit (JDK):

When you run this resource, it downloads a tarball, unpacks it to the /usr/local/jdk directory and creates the following symbolic link: /usr/local/bin/java -> /usr/local/jdk/bin/java

Take note that I specified a dummy URL, http://www.example.com, for the Oracle JDK. I actually use a private webserver to serve the Java tarball that I downloaded earlier using a desktop machine.

The next issue I encountered was how to download dependencies such as JDBC connectors, EJBs, and other artifacts. I was lucky enough to meet Carlos Sanchez at the FOSDEM 2012 conference in Brussels whose puppet-maven module solves this issue elegantly by sourcing Java artifacts from public or private maven repositories.

I have ported most of his puppet-maven module to Chef https://github.com/bryanwb/chef-maven, building on top of the existing maven cookbook. Here is an example of how I use it source the JDBC connector for postgresql:

Now I will take some time to walk you through the configuration of a basic Jboss application. This is a simplified example. You can see the complete recipe on github.com: https://github.com/bryanwb/chef-jboss/blob/master/recipes/standalone_jdbc.rb

The default values for all jboss applications are in the jboss/attributes/default.rb file. However, I override them with the values that are specific to the ESB application. I put those values in the roles/esb.rb file:

Next, I create a data bag that holds the values specific to each application environment:

data_bags/applications/esb.json

Here is a simplified subset of the actual recipe code:

cookbooks/jboss/recipes/standalone_jdbc.rb

This may seem like a massive amount of custom code but you should consider how little technical debt it contains. This JBoss recipe is built on established patterns and tooling from the Chef community. Anyone with Chef experience can come in and understand these recipes in a very short period. The same cannot be said of homegrown Bash and Perl scripts.

Future Plans

Image courtesy of Warwick Poole http://warwickp.com/

I hope this article has given you a sense of how we use Chef to support the data.fao.org project. It is a basic overview and does not cover the full extent of how we intend to use Chef. In the future, I will implement high-availability and load-balancing configurations. Furthermore, I plan to use cucumber to test cookbooks with tools like minitest, simple_cuke or cuken.

Additional Resources

Author: Bryan Berry

Tuesday, March 27, 2012

Metadata Catalogue Life Cycle: Capturing requirements collectively


In my previous post about data harmonization, I mentioned that data.fao.org was shortlisted by the Publink consultancy program offered by LOD2. The request was evaluated and selected: FAO earned the opportunity to develop the open version of the metadata catalogue in the context of other Open Data initiatives (co-)patronized by the LOD2 project. Both for the massive amount of data and the commitment to achieving good results, FAO was granted a double amount of consultancy time by the LOD2.

An interesting result was achieved while preparing for the first meeting with the LOD2 consultants: a list of questions and requirements in the scope of the Metadata Catalogue Life Cycle. This is a live and evolving list which will be updated, especially because the catalogue will ingest data from different units. It is an asset shared with other activities, such as Master Data Management, which involves data at institutional levels.

data.fao.org needs RDF and Linked Data

data.fao.org is the showcase for a massive amount institutional data that is exposed and organized for users in a data web portal. The web portal organizes the information along multiple data lenses, for example, landing pages. A landing page is data centric around a data subject, for example, “by country”. To reach such data aggregation effects, the portal needs to rely on a network of relationships that originate from the data subject as well as branch through it. The web of relationships are instantiated in the metadata catalogue, modeled has RDF graphs, a modeling language choice which unfolds many other possibilities as, for example, generating linked data that connects to existing RDF catalogues from other European or International institutions (i.e. Eurostat, World Bank).

“FAO questions about Linked Data Life Cycle”

In her post discussing data.fao.org from a high-level point-of-view, Lorrie Barber uses a simple example to describe “the power of data.fao.org and its capacity to pull data together in one centralized location.” The metadata catalogue is the core source of this power and we face the need to carefully control it. We have to control the process of generation, update, interlinking, evolution, and quality; or in other words, its life cycle. Three elements are requested to implement the metadata life cycle: a good framework of operations, a good set of requirements, and the right technologies in place.

The image provides the LOD2 theoretical framework for the linked data life cycle, created by groups of experts who are behind each of the phases. On the other hand, some FAO data experts gathered the first list of relevant questions per phase:

  • Data extraction (from non-RDF to RDF format)
    • When it comes to converting statistical data in RDF, what is a good level of data resolution to have?
    • What is the best way to proceed when selecting among the vocabularies (terms and properties) to describe data in RDF?
  • Data storage (dataset organization and triplestore solutions)
  • Data refinement (detailed data revision and authoring)
    • What are the tools that are able to implement the best metaphors to present casual users with manual RDF data maintenance tasks?
    • What are the approaches and tools for granting distinct user rights for a specific subset of ontologies?
  • Data interlinking and mashups
    • What are the steps to interlink distributed RDF graphs?
    • How can RDF graphs be queried in distributed datasets or distributed endpoints?
    • How does the existence of distinct linked RDF models interfere with such query capacities?
    • What is the best way to reduce development and maintenance costs of interlinking RDF datasets?
  • Classification and enrichment (of information content with RDF data)
    • What tools can be used to evaluate the proximity of information resources (for example, documents) on the basis of recurrence and proximity of terms in an RDF network?
    • How can semantically marked up web pages be produced from a repository of RDF datasets?
  • Quality analysis (of information content with RDF data)
  • Evolution and repair (data maintenance)
    • How can we ensure that recurrent data updates be as programmatic as possible in the life cycle of linked data?
    • How is versioning handled and exploited in the scenario of recurrent data updates?
  • Search, browse and explore RDF data
    • What are the most user-friendly metaphors that can be used to navigate RDF graphs for graphical interfaces that are intended for casual users?

A call for collaboration

This article has two main purposes: the first one is to share this list, and the second one is to stimulate additional discussion from other FAO adopters of linked data and ontologies, with the understanding that the result will be submitted to semantic domain experts.

Technology-wise, the LOD2 project proposes a technology stack to be implemented in each of the life cycle phases as well as a long list of alternatives that can be selected online.

A big ‘thanks’ in advance to the people who will participate in the discussion to capture requirements. You are welcome and encouraged to post questions and comments here on this blog.

Author: Claudio Baldassarre

Monday, March 19, 2012

Continuous Integration


The best definition of continuous integration comes from Martin Flower: “Continuous Integration is a software development practice where members of a team integrate their work frequently, usually each person integrates at least daily - leading to multiple integrations per day. Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible. Many teams find that this approach leads to significantly reduced integration problems and allows a team to develop cohesive software more rapidly.”

The data.fao.org implementation

Considering the fact we have a globally dispersed development team that checks code into a version-controlled repository on a daily basis, the continuous integration methodology is the best solution for the development of data.fao.org. The process begins by developers writing their projects based on some Maven Archetypes that I customize.

On a separate virtual server, more specifically a Jenkins-CI server, the Subversion (SVN) repository is polled, runs an automated build, and sends feedback to the development team via email. Every time the code is changed, the following steps are performed on the Jenkins-CI server:

  1. Checkout code from the SVN.
  2. Compile source code.
  3. Run tests.
  4. Run inspections.
  5. Jenkins stores the produced .war files in Artifactory.
  6. Jenkins deploys the produced .war file on the remote server.
  7. Jenkins runs a downstream project, if present, that performs functional or black box texts.
  8. Integrate database deployment software in Artifactory and then in the test environment.

Each night, the Jenkins-CI server runs a job to execute smoke tests, or black box tests, against the test environment. Each build is stored on our Artifactory and is available for our developers.

Jenkins and the branching strategy

When a project come closer to the deadline, and the build is stable, we make an SVN branch usually with name “RC-#version_number”. This strategy gives developers the freedom to continue adding new features in the /trunk directory and, in the meantime, fix the bugs that affect only the release candidate on the branch. When the builds on the RC branch are stable according to the metrics and tests we run with Jenkins-CI, we make an SVN tag from the branch. So for each project, we have a job for /trunk directory that deploys in the development environment and one for the branch that deploys in the test/qa environment.p>

On the day of the release, the job that builds from the RC branch is used to build from the tag and deployment is in a production environment.

Maven

To handle builds in multiple environments while managing a large amount of resources that can occasionally vary, we need a standardized structure. Thanks to the Archetype and the Maven Profiles, we solved the issue of organizing information in only a few places. Our archetypes contain a default Maven Profile named env-dev. This profile is for developers and it is where they can put the resources used by applications such as the database connection.

There are two other profiles, contained in two different settings one for test/qa and the other for production. These settings.xml files that contains profiles for these environment (named env-test and env-prod). These settings.xml are not contained into archetypes but stored on a secure path.

When Jenkins-CI builds the package it uses the profile according to the environment where the deployment is finished. There are four environments: development, test, QA, and production.

Matter of practice

Continuous integration is a good investment for the data.fao.org development team due to the fact that it takes care of all the tasks that make life hard for developers. Also because behind all technology, including the most viable and promising, there are always people.

Author: Federico Paolantoni