iCANDID infrastructure
iCANDID is
a software infrastructure providing FAIR access to big data collected daily by continuous
data aggregations (e.g., daily harvest from Belgian Press agencies) or on the
request of research projects needing specific datasets. The infrastructure was
developed in the framework of an FWO Medium Size Research Infrastructure project in
2018, with extensions still being made in the ongoing iCANDID 3.0 project
(2022-2026). It brings together 7 different research groups from social
sciences and humanities[1]and is
actively being used in several ongoing research projects[2]as well as
for educational purposes[3].
Architecture design schematics
The graphical
outline below shows the architecture schematics of iCANDID, from data delivery
at the bottom to data representation in the user interface at the top.

iCANDID architecture
The central
ontology schema
iCANDID uses a
shared data model to collect and present data from different data sources in a
uniform way and to make it accessible in an interoperable and reusable format. Schema.org is used as the main ontology because:
- It enables semantic
interlinking
- Is suitable for any type of
data
- Is flexible and extendible
- Is human- and machine-readable
Schema.org is extensible with elements from
other ontologies, should this prove necessary in the future to present
semantically rich data. The iCANDID application profile is available here.
Data delivery
iCANDID supports different data exchange
protocols to aggregate data such as API, FTP, OAI-PMH. All collected data undergoes mapping, normalization, and validation. The normalisation process thus provides the user with an integrated, consistent, and standardised access to metadata coming from different sources.
Data storage
After the extraction and normalisation process,
data is stored as json-ld in Schema.org format, a relational ontology model which allows knowledge
to be represented in a machine readable way and to express relationships
between data within and
external to the system (Linked Data).
Search & retrieval
The use of
Schema.org also facilitates for integrated search and retrieval of the data. For building the search index, iCANDID uses Elasticsearch[4], a powerful distributed search and
analytics engine that can efficiently deal with the large amounts of
unstructured full-text data available through iCANDID.

iCANDID UI - Advanced search & full text
view (available for export together with metadata)
Graphical User Interface
The UI offers
basic search and advanced search functionalities a well as different dynamic
data visualisation options based on Named Entity Recognition (NER) results.
To filter search results, facets and sorting options have been included. Users
can also save their records to a collection accessible and export large data
batches in a standardised
format of choice for further processing in domain specific tools such as Sketch
Engine and SPSS. Currently supported formats are txt, csv, xlsx, and json-ld.

iCANDID UI - Data export functions
API
Users can also access
the data collections
available through iCANDID via the API to compose more
detailed queries and
automate data extraction.
Identity
and Access Management (IAM/AAII)
Access
control to the iCANDID UI is regulated via an authentication and authorization
layer based on Keycloak[5]that verifies whether the user has been given access to the iCANDID interface,
specific datasets, and functions. The AAI allows KU Leuven users to login with
their KU Leuven credentials (Shibboleth) but also facilitates the access of
external users such as researchers from other Flemish universities and beyond.
By default, iCANDID is a restricted environment due to the nature of many of
the data collections being acquired under specific terms (e.g., news media data
can only be accessed by KU Leuven staff according to the agreement with Belga.press)
or have been collected according to the text-and-datamining exceptions for
research[6].
A request access procedure to the platform and the present collections is
available (cf. ‘User plan’ for more info).
Infrastructure
management
The
infrastructure runs on KU Leuven ICTS servers and storage. Docker is used as
containerisation technology, code is managed in Github,
and proper attention is paid to technical documentation and security.
Data
collections
The iCANDID
datahub keeps growing in volume and diversity of available datasets and
connections. This happens in 2 ways:
- Nightly data aggregations to build a
complete archive over a long period of time. An example is the Belgian Flemish and
French-Language Press data archive going back as far as 2014 with a continued
nightly update of the news media from the day before.
- Datasets on request, usually a
specific collection within a date range required for a research project. An
example is the Flemish parliamentary data on anti-discriminatory legislation
(discussions & law texts). This data is kept in the archive after
collection for future consultation and use. Access can be limited according to
requirements (legal …).

- Belga.press
data: Flemish and French press articles, BELGA online
- Electronic
News Archive: transcriptions of VRT & VTM 7 pm news broadcasts
- Twitter
data: Belgian news agency’s accounts, specific # (e.g., climate, migration) and
accounts (e.g., EU politicians) harvested on the request of research projects
- Parliamentary
data (Flemish parliament, Swedish parliament …)
- TikTok
(e.g., Belgian politicians) harvested in the framework of specific research
projects (restricted access due to TikTok’s terms of service).
- IMDb
& Themoviedatabase harvested in the framework of specific research projects

iCANDID data statistics (on 22/02/2024)
iCANDID is
not limited in the type of data it can collect and include. Once an integration
is set-up, such as an API connection or data scraping, the infrastructure can
continue to pull in data from the website or platform. For example, a
connection with the Flemish Parliament data API was made for the collection of discussions
and legislative data on the topic of anti-discrimination. If another research
project requires other data from the Flemish Parliament, this new dataset can
be harvested with minimal effort due to the earlier integration. There are
however increasing challenges for access to data platforms managed by big tech
companies as most of them offer no free or affordable access anymore to the
data for research via API and actively take measures against data scraping[7].
In this challenge also lies the strength of the iCANDID infrastructure as we
address these challenges on an aggregated scale instead of each single
researchers hitting the same wall. Moreover, the iCANDID infrastructure is well
positioned to maintain close contacts with similar infrastructure providers
(e.g., SURF.nl) to find common solutions for these data access challenges. iCANDID
also provides more options for automating processes as a more efficient way to
collect large amounts of data.