Metadata Catalogue Life Cycle: Capturing requirements collectively
In my previous post about data harmonization, I mentioned that data.fao.org was shortlisted by the Publink consultancy program offered by LOD2. The request was evaluated and selected: FAO earned the opportunity to develop the open version of the metadata catalogue in the context of other Open Data initiatives (co-)patronized by the LOD2 project. Both for the massive amount of data and the commitment to achieving good results, FAO was granted a double amount of consultancy time by the LOD2.
An interesting result was achieved while preparing for the first meeting with the LOD2 consultants: a list of questions and requirements in the scope of the Metadata Catalogue Life Cycle. This is a live and evolving list which will be updated, especially because the catalogue will ingest data from different units. It is an asset shared with other activities, such as Master Data Management, which involves data at institutional levels.
data.fao.org needs RDF and Linked Data
data.fao.org is the showcase for a massive amount institutional data that is exposed and organized for users in a data web portal. The web portal organizes the information along multiple data lenses, for example, landing pages. A landing page is data centric around a data subject, for example, “by country”. To reach such data aggregation effects, the portal needs to rely on a network of relationships that originate from the data subject as well as branch through it. The web of relationships are instantiated in the metadata catalogue, modeled has RDF graphs, a modeling language choice which unfolds many other possibilities as, for example, generating linked data that connects to existing RDF catalogues from other European or International institutions (i.e. Eurostat, World Bank).
“FAO questions about Linked Data Life Cycle”
In her post discussing data.fao.org from a high-level point-of-view, Lorrie Barber uses a simple example to describe “the power of data.fao.org and its capacity to pull data together in one centralized location.” The metadata catalogue is the core source of this power and we face the need to carefully control it. We have to control the process of generation, update, interlinking, evolution, and quality; or in other words, its life cycle. Three elements are requested to implement the metadata life cycle: a good framework of operations, a good set of requirements, and the right technologies in place.
The image provides the LOD2 theoretical framework for the linked data life cycle, created by groups of experts who are behind each of the phases. On the other hand, some FAO data experts gathered the first list of relevant questions per phase:
- Data extraction (from non-RDF to RDF format)
- When it comes to converting statistical data in RDF, what is a good level of data resolution to have?
- What is the best way to proceed when selecting among the vocabularies (terms and properties) to describe data in RDF?
- Data storage (dataset organization and triplestore solutions)
- Data refinement (detailed data revision and authoring)
- What are the tools that are able to implement the best metaphors to present casual users with manual RDF data maintenance tasks?
- What are the approaches and tools for granting distinct user rights for a specific subset of ontologies?
- Data interlinking and mashups
- What are the steps to interlink distributed RDF graphs?
- How can RDF graphs be queried in distributed datasets or distributed endpoints?
- How does the existence of distinct linked RDF models interfere with such query capacities?
- What is the best way to reduce development and maintenance costs of interlinking RDF datasets?
- Classification and enrichment (of information content with RDF data)
- What tools can be used to evaluate the proximity of information resources (for example, documents) on the basis of recurrence and proximity of terms in an RDF network?
- How can semantically marked up web pages be produced from a repository of RDF datasets?
- Quality analysis (of information content with RDF data)
- Evolution and repair (data maintenance)
- How can we ensure that recurrent data updates be as programmatic as possible in the life cycle of linked data?
- How is versioning handled and exploited in the scenario of recurrent data updates?
- Search, browse and explore RDF data
- What are the most user-friendly metaphors that can be used to navigate RDF graphs for graphical interfaces that are intended for casual users?
A call for collaboration
This article has two main purposes: the first one is to share this list, and the second one is to stimulate additional discussion from other FAO adopters of linked data and ontologies, with the understanding that the result will be submitted to semantic domain experts.
Technology-wise, the LOD2 project proposes a technology stack to be implemented in each of the life cycle phases as well as a long list of alternatives that can be selected online.
A big ‘thanks’ in advance to the people who will participate in the discussion to capture requirements. You are welcome and encouraged to post questions and comments here on this blog.
Author: Claudio Baldassarre

Excellent post Claudio. I would like to comment on the "Classification and enrichment" process of the lifecycle you reference, specifically focusing on the question:
ReplyDelete* How can semantically marked up web pages be produced from a repository of RDF datasets"
Although not a true RDF dataset, Google's Freebase project (www.freebase.com) offers interesting examples of how to promote the metadata it captures to a wider audience. These examples include:
* Widgets - Such as Freebase Suggest, which allows web page designers to quickly embed search facilities bound to specific metadata domains in the Freebase dataset.
* Development APIs - including the Freebase Acre API (http://www.freebase.com/docs/acre_api) or query editor http://www.freebase.com/queryeditor
* Development environments - Including the Acre App Editor which allow users to build and publish web pages that incorporate its vast metadata repository using a Javascript based development environment.
Looking forward to comments and suggestions from other readers.
Thank you for your contribution Sergio.
ReplyDeleteAs you will know Google has disclosed the Google Knowledge Graph (GKG) project, a source of semantically related data modeled adopting "schema.org" (from Google, Bing, and Yahoo). Many say that GKG is the evolution of Freebase (that was also available in RDF for a period) and indeed it is a good source for metadata enrichment of content.
Not only Freebase/GKG will help to tag your content serving their data through the tools that you referenced, but each of the data will be connected to other data in the graph, that is potentially relevant to your information context.
To embed external metadata as part of your web page a technology in use is RDFa. RDFa allows to insert data references (i.e. URLs) from sources like Freebase or GKG. The references injected will make part of the page HTML, hence it is transparent to the end-user, but processable by machines. Following the RDFa references, a clever process will load the remote content (e.g. images, publication titles, related articles) and user-friendly display it to the reader.
In my opinion when the data.fao.org catalog will be ready and publicly exposed, the editorial workflow of FAO web pages should consider to embed the catalog references, and allow for clever behavior from other users inside and outside the organization.