ToscanaOpenResearch is based on a platform for the integration of and access to heterogeneous data related to the research and innovation ecosystem, built on a relational database. To promote data use and interoperability, and to facilitate the extraction and analysis of information from different classification systems, ToscanaOpenResearch uses an approach that combines:
- Crosswalks between the sources' native classifications: mappings that harmonize heterogeneous classification systems—both national (such as "Scientific-Disciplinary Sectors" [SSD], "CUN Areas", and "ATECO" economic activities) and international (such as the "International Patent Classification")—allowing for cross-queries regardless of the original taxonomy.
- External classification systems implemented by applying automated Natural Language Processing (NLP) techniques to the abstracts of projects and publications.
The development of the new release of TOR is the result of a process led by the Tuscany Region, supported by its technical partner SIRIS Academic.
Data
To date, TOR mainly integrates open data from national, European, and global open databases. Specifically, the main integrated databases are:
- CORDIS: the European Union's platform for research projects funded under European framework programs (Horizon 2020, Horizon Europe, etc.);
- OpenCoesione: the Italian portal for projects funded by cohesion policies, used to monitor the expenditure of European structural funds across the national territory;
- OpenAlex: an open bibliometric database covering scientific publications, authors, institutions, and their affiliations;
- PATSTAT: the European Patent Office's database containing worldwide patent information, including thematic classifications, inventors, and applicants;
- USTAT (MUR Statistics Office): statistical data on the Italian university system, including students, graduates, teaching staff, PhD programs, and disciplinary classifications;
- Cerca Università: an information source regarding Italian universities and their educational offerings;
- Excelsior (Unioncamere): an information system focusing on the occupational and training needs of Italian companies, providing data on requested professions, economic sectors, and skills;
- Registro Imprese (Business Register): the official source for Italian companies, featuring information on their sector of economic activity, production value, and registered office.
How Heterogeneous Data Communicate with Each Other
To allow for cross-cutting analyses among heterogeneous sources, crosswalks between classification systems have been established alongside data enrichment operations. In particular:
- SSD-GSD Correspondence: a mapping between Scientific-Disciplinary Sectors (SSD 2015 and 2024) and the new Scientific-Disciplinary Groups (GSD), organized according to the 14 disciplinary areas of the National University Council (CUN). This correspondence is defined by the Ministry of University and Research (D.M. 639/2025).
- Disambiguation of Organizations: a harmonization table was created to link the different spelling variations of Italian university names (found in USTAT and CercaUniversità sources) to unique identifiers through automated normalization and manual validation.
- Geolocation of Patent Holders: a methodology was developed and implemented to integrate geographic information (at the provincial and regional levels) for patent holders, which is largely incomplete in the original PATSTAT source.
- Development and implementation of text classifiers for the automatic labeling of research documents through content analysis.
Focus 1 – Semantic Analysis
The abstracts of publications, patents, and R&I projects contain a wealth of textual information that describes in detail current challenges, proposed or demonstrated progress, and the expected impact of the innovation process. To unlock the value of this semantic richness, Natural Language Processing (NLP) and Deep Learning techniques were used to analyze how the outputs of research activities align with specific taxonomies identified as relevant to the regional context. Specifically, these include the ERC classification (a result of the TOR project) and the taxonomy of Tuscany's Smart Specialization Strategy (S3)—the latter currently in the implementation phase. The approach relies on an automated classifier that analyzes each document individually, assigning taxonomic categories based on the actual text content rather than on declared metadata or manual tagging. This ensures greater consistency and comparability across heterogeneous sources.
Focus 2 – ERC Classification: Development and First Application
As part of the TOR project, an experimental framework for automated text classification was designed and developed to assign research documents to European Research Council (ERC 2024) panels.
Adopting the ERC classification addresses the need for a shared European disciplinary framework, which makes it possible to:
- compare the research activities of Tuscan actors with those of other regions and countries on a consistent basis, overcoming the limitations of purely national classifications;
- utilize the classifications most commonly used in the context of European research policies, rather than the native classifications of the sources used, which are often more technical in nature (such as the "topics" adopted by OpenAlex);
- categorize documents from heterogeneous sources under a single, "cross-cutting" system.
It is important to emphasize the experimental nature of this activity: the automated classifier is not just a tool to generate visualizations; it is an actual methodological output of the project, developed jointly by SIRIS Academic and the various Sectors of the Tuscany Region involved in the project.
The results presented in the visualizations derive from a first application of the classifier. The performance is already reliable for exploratory analysis and general comparisons, but it will be further refined with subsequent iterations of the model. The developed classifier has been made publicly available on the HuggingFace platform. It is a model trained for multi-label classification: starting from the title and abstract of a scientific publication, the system can simultaneously assign multiple ERC categories, reflecting the often interdisciplinary nature of research.