This user manual provides an overview of the information contained in the ToscanaOpenResearch dashboard and aims to present to an audience without specific expertise how to consult and access data on Research, Innovation and Higher Education in Tuscany.
With the aim of increasing the usefulness and extending the use of ToscanaOpenResearch, this document describes the project behind this portal, highlights the richness and variety of the data it contains and illustrates how the system can be interrogated to obtain the desired information.
For any comments or questions regarding ToscanaOpenResearch, please send an e-mail to the following address: staff@toscanaopenresearch.it.
In the context of its Regional Observatory for Research and Innovation, established by Regional Law 20/2009 in order to have a qualified knowledge base and an information tool to support policies, the Region of Tuscany has created ToscanaOpenResearch (TOR) with a dual purpose:
TOR therefore represents the Observatory's information tool, and is made up of different components:
The ToscanaOpenResearch components are presented schematically in the figure below:
The TOR dashboard, the “Explore The Data” section that hosts the visualizations, is built with open source technologies.
The views are based on data directly retrieved through the SPARQL Endpoint. Therefore, each time the data is updated, the views will also update automatically.
The views that appear on ToscanaOpenResearch in the “Explore The Data” section have been created to answer the questions produced by the Regional Conference for Research and Innovation through a co-design process developed in 3 sessions between 2017 and 2018.
The Regional Conference for Research and Innovation is a permanent collegial structure with advisory functions, that gathers representatives of universities, research centres, science and technology parks, companies and trade unions. In this regard, during the initial phase of the Pilot Dashboard, several questions had emerged.
Among the questions posed by the Regional Conference, which were answered through interactive visualizations in the portal, we find:
Mapping of the PhD courses
An overview of PhD courses in Tuscany is outlined by these two views: the first one shows the number of students enrolled and graduates to PhD courses in Tuscany, by year and by university; the second one, provides information about students enrolled and graduates to PhD courses, by year, university, gender and title of the PhD.
Mapping of Master's courses
To answer this question, ToscanaOpenResearch includes two views: the first one the first one shows the number of students enrolled and graduated in Masters in Tuscany by year and by university, taking into account both first and second level Masters; while the second one shows the number of students enrolled and graduated in Masters in Tuscany by university, gender and title of the Master.
Overview of public research spin-offs in Tuscany
This visualization has been created, which shows the collaborations of Tuscan actors on research and innovation projects with European funding, both with Italian and international collaborators. The map allows to filter by type of program, by country collaborator and by type of activity.
A general overview of the collaboration networks that emerge from the analysis of European-funded research and innovation projects
To answer this question, a visualization has been created, which shows the collaborations of Tuscan actors on research and innovation projects with European funding, both with Italian and international collaborators. The map allows to filter by type of program, by country collaborator and by type of activity.
For each of the above mentioned visualizations, as for all those present inside the portal, it is possible to download directly the data represented inside the visualization, in csv format, by clicking on the red arrow just below the representation.
The open data on ToscanaOpenResearch is the result of the integration of numerous databases, memorandum of understanding and data from bibliometric databases.
In particular,
The system currently focuses on the Tuscan perimeter, but for most of the data the national perimeter has already been integrated, and it is therefore possible to carry out context analyses at national or even European level (in the case of research projects).
In this regard, the following "mind maps" offer a cross-section of the information available depending on the subject or topic of interest.
The idea of the “mind maps“ represented below is to outline a general picture of the data available on ToscanaOpenResearch, by topic.
Indeed, the word in the middle of each map (inside the box with the rounded border) represents the subject of interest, the subject from which it is possible to obtain the information indicated by each arrow.
The source of the data is reported in brackets, with the exception of the CTN/PRIN 2012 data, which were obtained through a memorandum of understanding between the MIUR and the Region of Tuscany.
However, the mind maps above offer only a general idea of the data available for each concept and/or subject. We will now see how to analyze in more detail the available data and how to consult them through an ontology, that is a formal, shared and explicit representation of a conceptualization of a domain of interest.
This section provides an introduction to the SPARQL Endpoint, the technology used to query TOR data, and to the ontology charts - graphical representations that allow to describe entities (objects, concepts, etc.) and their relations in a given knowledge domain - in order to make the reader autonomous in querying the data of the Observatory.
The SPARQL endpoint allows users to retrieve data using the SPARQL query language, which is a standard defined by the World Wide Web Consortium (W3C). The SPARQL endpoint can be used by both human users and machines. Humans interact with it through a graphical user interface, while machines (scripts, applications, etc.) communicate with the endpoint using the standard SPARQL protocol defined by W3C.
Data retrieved from a SPARQL endpoint complies with the Linked Data Linked Data approach to publish data in the Semantic Web context. This means that the data format is RDF (Resource Description Framework), (Resource Description Framework), where the information is represented as resources. Each resource is identified by a Uniform Resource Identifier (URI), which in most cases resembles a web address (e.g. http://unics.cloud/ontology#Project-123). Resources are characterized by properties. Some properties attribute data values to a resource, such as the acronym of a project or its title, while others link a resource with another resource, such as a project with one of its participants. Properties that attribute data values are called "data properties", while properties that link resources are called "object properties". Properties, like resources, are identified by URIs (e.g. http://unics.cloud/ontology#acronym).
Conceptually, Conceptually, data formatted in RDF is a graph, where nodes are either resources or data values, and links between nodes are either data properties or object properties. In practical terms, however, RDF-formatted data is normally represented as a list of triples. A triple is a phrase in the form of a "predicate object subject". These triples are used to list links in the graph, i.e. the subject of a triple is always a resource, the predicate is a property, and the object is either a data value or another resource. As an example, a data set containing two projects of the same call with their title and abbreviation would look like this (URIs are often written in angular brackets):
<http://unics.cloud/ontology#Project-1> <http://unics.cloud/ontology#acronym> "Acronym 1"
<http://unics.cloud/ontology#Project-1> <http://unics.cloud/ontology#title> "Title 1"
<http://unics.cloud/ontology#Project-1> <http://unics.cloud/ontology#call> <http://unics.cloud/ontology#Call-A>
<http://unics.cloud/ontology#Project-2> <http://unics.cloud/ontology#acronym> "Acronym 2"
<http://unics.cloud/ontology#Project-2> <http://unics.cloud/ontology#title> "Title 2"
<http://unics.cloud/ontology#Project-2> <http://unics.cloud/ontology#call> <http://unics.cloud/ontology#Call-A>
<http://unics.cloud/ontology#Call-A> <http://unics.cloud/ontology#name> "Name of call for projects A"
In RDF and SPARQL it is common to use prefixes to shorten resource and property names, i.e. URIs. As we see in the previous example, all resources and properties of this dataset have URIs starting with "http://unics.cloud/ontology#". To avoid writing this common part each time, we can define an abbreviation for this prefix, for example "tor:", and write this abbreviation instead of the full prefix. The above example would become then:
Prefix tor: <http://unics.cloud/ontology#>
tor:Project-1 tor:acronym “Acronym 1”
tor:Project-1 tor:title “Title 1”
tor:Project-1 tor:call tor:Call-A
tor:Project-2 tor:acronym “Acronym 2”
tor:Project-2 tor:title “Title 2”
tor:Project-1 tor:call tor:Call-A
tor:Call-A tor:name “Name of call for projects A”
In terms of syntax, the abbreviation of a prefix must be in the form of "name:" (note the ":" at the end). The shortest possible abbreviation is the one where "name" is empty, i.e. the abbreviation is just ":". Also, note that when abbreviations are used for prefixes, URIs are no longer written in angular brackets (these are only used for non-abbreviation URIs).
In order to write queries on RDF data using the SPARQL language, you need to know which properties can be applied on which resource sets. This is where ontologies come into play.
An ontology is essentially a definition of the structure of data formatted in RDF format, where resources with similar properties are grouped into concepts (also called classes). In this sense, an ontology defines a set of concepts (e.g. Project, Call, Participant, Country, ...) and a set of properties (both data and object properties) that can be applied to these concepts. Concepts, just like properties and resources, are identified by URIs (which can be abbreviated as seen above).
For data integration the system uses Ontology-Based Data Access (OBDA) approach, which allows data integration through a domain ontology and allows users to access data by queries, eliminating the need to know the technical terms related to the physical organization of the databases and the internal structure.
The documentation of the ontology is available on the ToscanaOpenResearch website, in the section Data Access> Ontology documentation, at the link: http://toscanaopenresearch.it/sparql-endpoint/en/docs/index.html.
In the diagrams, the boxes with rounded borders indicate the concepts, , with the list of names written inside each box, which are the given data properties applicable to thiat particular concept. For example, in the TOR ontology part of CORDIS projects, the "Organization" concept has the data properties "unicsId", "extendedName" and "acronym". The arrows with open tip between the boxes represent object properties. The name of an object property is indicated by a label on the arrow, and the application direction of the property is given by the direction of the arrow itself. For example, the arrow labeled with "principalInvestigator", which links the "EC-Project" concept with the "Person" concept, represents an object property that associates to a European project the person who is the principal investigator of that project.
To each given property and to each object property is associated a cardinality constraint, specified through a "min..max" pair, which establishes a limit to the minimum and maximum number of values that the property in question can have. For example, the "0..1" constraint on the "principalInvestigator" property indicates that an object in EC-Project (i.e., a European project) may or may not have a principalInvestigator (minimum constraint 0), and in any case may have at most one (maximum constraint 1). Similarly, the "[0..1]" constraint on the given "startingYear" property of "EC-Project" indicates that a European project may or may not have a single starting year. In general, the minimum cardinality can be 0 (i.e. no minimum constraint), or a positive integer; the maximum cardinality can be a positive integer, or "n" (i.e. no maximum constraint). If the cardinality constraint specification is missing in the diagram, "1..1" is assumed, i.e. the property is both mandatory (it always has a value) and functional (it has only one).
Another important element of ontologies is the generalization mechanism, which allows concepts to be organized in one or more hierarchies. A generalization (also called “is-a”), A generalization (also called "is-a"), is indicated in the diagram with an arrow ending with a closed triangle, which goes from a more specific concept to a more general one. An is-a indicates that each object that belongs to the more specific concept also belongs to the more general concept. For example, the arrow is-a from "ERC-Project" to "EC-Project" indicates that each ERC project is also a European project. By virtue of this, generalization implies inheritance, i.e. the fact that a more specific concept inherits all the properties of a more general concept. For example, every ERC project, being also a European project, has all the properties of European projects, e.g. a title. But an ERC project also has additional properties that European projects in general do not have, for example the fact of having an "ercGrant".
The Ontology-Based Data Access (OBDA) approach, also called Virtual Knowledge Graph, allows to access relational databases and data sources of different nature (e.g. CSV files, excel tables, JSON data) mediating these accesses through a domain ontology. This ontology provides a consolidated data model, shared between different actors, and expressed using domain-specific terminology and concepts, which can also go beyond what is present in specific data sources. The ontology is then linked to the sources using declarative “mappings”. These make explicit the way the connection is made, and do not "hide" it in programmatic code, as happens in traditional integration systems. The user can then extract the information of interest, using terms familiar to him/her from the ontology. He can easily formulate even complex queries directly on the ontology, without the need to know the terms used in the different source databases, often cryptic and difficult to understand, or how the data are organized there. The user query is then translated into appropriate accesses to the data sources and the answers are combined to provide the answer to the user, using the ontology itself and the mapping. This is done in a completely automatic and transparent way to the user. The entire approach is based on standards defined by the World Wide Web Consortium (W3C) for ontology, mapping, and query language.
It is important to note that the ontology provides a unified view of data as a graph (set of linked elements) of knowledge.
On the one hand, this greatly simplifies the integration and reconciliation of data from different sources.
On the other hand, the ontology can be reused in different contexts where the domain is similar, simply by connecting new data sources.
In addition, data can be made accessible by focusing on who will then use it. This logic also includes the possibility to take into account in the data mapping only specific details that are of interest for certain types of users or in a certain context, and respecting criteria of privacy or confidentiality.
Ultimately, an OBDA system allows data to be used by a wider audience with less familiarity with specific sources. The use of OBDA systems typically does not require changes to the databases themselves, thus allowing the underlying systems to remain fully operational. In summary, the OBDA approach can be considered as a mechanism to provide a high-level API (Application Programming Interface) for heterogeneous data sources.
At the technological level, the ToscanaOpenResearch OBDA system uses Ontop, a particularly high-performance and advanced OBDA implementation, as also highlighted by specific scientific literature. This Open Source software boasts a heterogeneous user community, ranging from users operating in an industrial context to financial institutions and university research groups. The software is in continuous development, with the addition of new or improved features and the possibility of professional support in addition to that of the community.
The ontology developed within ToscanaOpenResearch is now being aligned with the national project OntoPIA (the network of public administration controlled ontologies and vocabularies), coordinated by AgID - Agenzia per l'Italia Digitale, to become the national reference ontology on higher education and research: the ontology is available at this link.
After learning the basics of SPARQL Endpoint and the graphical representation of ontology, it's time to query the system and experiment with queries on your own.
A query to a SPARQL endpoint typically consists of the following keywords:
Below is an example of a query that outputs universities in the UNICS ontology.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#glt;
PREFIX unics: <http://unics.cloud/ontology#>
SELECT ?org
WHERE {
?org rdf:type unics:University .
}
The PREFIX instruction followed by a word (free choice) introduces an abbreviation for potentially long URLs. Returning to the example of universities, in the UNICS ontology universities are available through the URL “http://unics.cloud/ontology/University”, using the PREFIX instruction, you can refer to universities using "unics:University", since the URL of unics has been saved in the word unics.
Note that the query is divided into at least two parts.
In the first part, introduced by SELECT, you list the variables you want to have as a result,, separating them with spaces.
The second part, introduced by WHERE and enclosed by curly brackets, specifies the rules in the form of triple RDF. A triple RDF is composed of:
In the example above we find:
Analyzing the content of the query, you notice that you are asking for entities that are unics:University type. The term "org" is a variable in which the content is saved. It can take an arbitrary name, as long as it starts with a question mark. It allows you to use all the values it takes in the "valid" triples in the results table. Within a variable it is possible to save any type of content, be it data or URI that will be used later to create more complex queries.
Basically, WHERE selects data to be taken into account to build the results table.
You can insert more than one triple pattern in a WHERE clause:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX unics: <http://unics.cloud/ontology#>
SELECT *
WHERE {
?org rdf:type unics:University .
?org unics:extendedName "Università di Firenze" .
}
To the previous example is added the condition that with the same subject of the first pattern and associated to the variable "University of Florence".
You can guess how queries can grow in complexity. In combination with SPARQL's feature of being able to query metadata ("what property can a university have?"), this makes it a very suitable language for progressive data exploration.
There are also several more refined ways to select and process data, for which reference is made to the more specific literature and examples contained on the TOR portals.
A library of predefined queries, organized coherently with the main sections of the Observatory, is available in the Data Access > SPARQL endpoint section at the following link: http://toscanaopenresearch.it/sparql-endpoint/en/.
Through this SPARQL endpoint it is also possible to obtain immediate visual processing of the data obtained from the execution of the queries, using tools such as "Pivot Table" and "Google Chart".
However, once you become familiar with SPARQL endpoint and the ontology diagram, following the instructions above, you can change the queries and adapt them according to your needs, adding more information and/or changing the perimeter (both temporal and geographical) of interest.