Inventory of the research infrastructure

iCANDID infrastructure

iCANDID is a software infrastructure providing FAIR access to big data collected daily by continuous data aggregations (e.g., daily harvest from Belgian Press agencies) or on the request of research projects needing specific datasets. The infrastructure was developed in the framework of an FWO Medium Size Research Infrastructure project in 2018, with extensions still being made in the ongoing iCANDID 3.0 project (2022-2026). It brings together 7 different research groups from social sciences and humanities[1]and is actively being used in several ongoing research projects[2]as well as for educational purposes[3].

Architecture design schematics

The graphical outline below shows the architecture schematics of iCANDID, from data delivery at the bottom to data representation in the user interface at the top.

iCANDID architecture

The central ontology schema

iCANDID uses a shared data model to collect and present data from different data sources in a uniform way and to make it accessible in an interoperable and reusable format. Schema.org is used as the main ontology because:

It enables semantic interlinking
Is suitable for any type of data
Is flexible and extendible
Is human- and machine-readable

Schema.org is extensible with elements from other ontologies, should this prove necessary in the future to present semantically rich data. The iCANDID application profile is available here.

Data delivery

iCANDID supports different data exchange protocols to aggregate data such as API, FTP, OAI-PMH. All collected data undergoes mapping, normalization, and validation. The normalisation process thus provides the user with an integrated, consistent, and standardised access to metadata coming from different sources.

Data storage

After the extraction and normalisation process, data is stored as json-ld in Schema.org format, a relational ontology model which allows knowledge to be represented in a machine readable way and to express relationships between data within and external to the system (Linked Data).

Search & retrieval

The use of Schema.org also facilitates for integrated search and retrieval of the data. For building the search index, iCANDID uses Elasticsearch[4], a powerful distributed search and analytics engine that can efficiently deal with the large amounts of unstructured full-text data available through iCANDID.

Image Placeholder

iCANDID UI - Advanced search & full text view (available for export together with metadata)

Graphical User Interface

The UI offers basic search and advanced search functionalities a well as different dynamic data visualisation options based on Named Entity Recognition (NER) results. To filter search results, facets and sorting options have been included. Users can also save their records to a collection accessible and export large data batches in a standardised format of choice for further processing in domain specific tools such as Sketch Engine and SPSS. Currently supported formats are txt, csv, xlsx, and json-ld.

iCANDID UI - Data export functions

API

Users can also access the data collections available through iCANDID via the API to compose more detailed queries and automate data extraction.

Identity and Access Management (IAM/AAII)

Access control to the iCANDID UI is regulated via an authentication and authorization layer based on Keycloak[5]that verifies whether the user has been given access to the iCANDID interface, specific datasets, and functions. The AAI allows KU Leuven users to login with their KU Leuven credentials (Shibboleth) but also facilitates the access of external users such as researchers from other Flemish universities and beyond. By default, iCANDID is a restricted environment due to the nature of many of the data collections being acquired under specific terms (e.g., news media data can only be accessed by KU Leuven staff according to the agreement with Belga.press) or have been collected according to the text-and-datamining exceptions for research[6]. A request access procedure to the platform and the present collections is available (cf. ‘User plan’ for more info).

Infrastructure management

The infrastructure runs on KU Leuven ICTS servers and storage. Docker is used as containerisation technology, code is managed in Github, and proper attention is paid to technical documentation and security.

Data collections

The iCANDID datahub keeps growing in volume and diversity of available datasets and connections. This happens in 2 ways:

Nightly data aggregations to build a complete archive over a long period of time. An example is the Belgian Flemish and French-Language Press data archive going back as far as 2014 with a continued nightly update of the news media from the day before.
Datasets on request, usually a specific collection within a date range required for a research project. An example is the Flemish parliamentary data on anti-discriminatory legislation (discussions & law texts). This data is kept in the archive after collection for future consultation and use. Access can be limited according to requirements (legal …).

Collection registry (red lock means no access)

The current iCANDID collection registry contains + 21 million records and the following important collections:

Belga.press data: Flemish and French press articles, BELGA online
Electronic News Archive: transcriptions of VRT & VTM 7 pm news broadcasts
Twitter data: Belgian news agency’s accounts, specific # (e.g., climate, migration) and accounts (e.g., EU politicians) harvested on the request of research projects
Parliamentary data (Flemish parliament, Swedish parliament …)
TikTok (e.g., Belgian politicians) harvested in the framework of specific research projects (restricted access due to TikTok’s terms of service).
IMDb & Themoviedatabase harvested in the framework of specific research projects

iCANDID data statistics (on 22/02/2024)

iCANDID is not limited in the type of data it can collect and include. Once an integration is set-up, such as an API connection or data scraping, the infrastructure can continue to pull in data from the website or platform. For example, a connection with the Flemish Parliament data API was made for the collection of discussions and legislative data on the topic of anti-discrimination. If another research project requires other data from the Flemish Parliament, this new dataset can be harvested with minimal effort due to the earlier integration. There are however increasing challenges for access to data platforms managed by big tech companies as most of them offer no free or affordable access anymore to the data for research via API and actively take measures against data scraping[7]. In this challenge also lies the strength of the iCANDID infrastructure as we address these challenges on an aggregated scale instead of each single researchers hitting the same wall. Moreover, the iCANDID infrastructure is well positioned to maintain close contacts with similar infrastructure providers (e.g., SURF.nl) to find common solutions for these data access challenges. iCANDID also provides more options for automating processes as a more efficient way to collect large amounts of data.

[1] Institute for Media Studies, KADOC, Cultural Studies Research Group, Centre for Sociological Research, Translation and Intercultural Transfer, Leuven School for Mass Communication Research, Quantitative Lexicology and Variational Linguistics.

[2] E.g., CELSA, HumMingbird (H2020-RIA), OPPORTUNITIES (H2020-RIA), COMMunity (C2-project).

[3]Numerous master theses, PhD Research, ‘Tekten als sociale netwerken’ en ‘Lexicale analyse op basis van corpora’ as part of the course ‘Analyse van Mediateksten’ (2BA, COM).

[4] https://www.elastic.co/products/elasticsearch

[5] https://www.keycloak.org/

[6] https://economie.fgov.be/en/themes/intellectual-property/intellectual-property-rights/copyright-and-related-rights/copyright/european-directive-copyright

[7] Twitter’s free academic API access of 10 million Tweets a month has been stopped since Juli 2023, price for API access to 1 million Tweets a month is now 60.000 euro a year, number of views has been limited to prevent scraping (https://twitter.com/elonmusk/status/1675260424109928449). Similar measures are taken by platforms like Facebook and TikTok

Knowledge Base