Information and Communication > Home > Journal

Information Retrieval, Document and Semantic Web

Recherche d’information, document et web sémantique

RIDoWS - ISSN 2516-3280 - © ISTE Ltd

Aims and scope

Objectifs de la revue

The diversity in forms of documents (multimedia, multilingual, with or without a structure) and in their uses encourages different communities to mingle more and more.

Information Retrieval, Document and Semantic Web is a meeting point for these scientific or industrial communities who are interested in information research, the semantic web, the analysis of documents (texts, images, sounds, videos, etc.) or in the collection of documents.

La multiplicité des formes de documents (multimédia, multilingue, structuré ou non) et des usages favorise de plus en plus un brassage entre différentes communautés.

Recherche d’information, document et web sémantique est un point de rencontre pour ces communautés scientifiques ou industrielles qui s’intéressent à la recherche d’information, au web sémantique, à l’analyse de documents (textes, images, sons, vidéos...) ou à la collection de documents.

Journal issues

2019

Volume 19- 3

Issue 1

2018

Volume 18- 2

Issue 1

2017

Volume 17- 1

Issue 1

Recent articles

Detection of weak signals in weakly structured data masses

Julien Maitre, Michel Menard, Guillaume Chiron, Alain Bouju

This paper is related to a project aiming at discovering weak signals from different streams of information, possibly sent by whistleblowers in a platform as GlobalLeaks. The study presented in this paper tackles the particular problem of clustering topics at multi-levels from multiple documents, and then extracting meaningful descriptors, such as weighted lists of words for document
representations in a multi-dimensions space. In this context, we present a novel idea which combines Latent Dirichlet Allocation and Word2Vec (providing a consistency metric regarding the partitioned topics) as potential method for limiting the “a priori” number of cluster k usually needed in classical partitioning approaches. We proposed 2 implementations of this idea, respectively able to : (1) finding the best k for LDA in terms of topic consistency ; (2) gathering the optimal clusters from different levels of clustering. We also proposed a non-traditional visualization approach based on a multi-agents system which combines both dimension reduction and interactivity.

DataNews: contextualisation of quantified values in wires

Chloé Monnin, Olivier Hamon, Victor Schmitt, Brice Terdjman

The Open Data allows the access to plentiful data, with a large coverage, but none of them offers a structured databased around news. Through DataNews, our goal is to seek for data automatically so as to provide means to reuse them. To do so, we first defined an event typology in the specific context of death in AFP wires. Then, by restraining ourselves to the natural disasters, we clustered these wires by events so as to identify them. The goal of the last step is to build extraction patterns so as to collect values corresponding to the death number, as well as the context associated to these values. The results of our evaluations reassured ourselves in the large potential of our method that could lead to several applications.

Influence over Networks, a modelling proposal

Damien Nouvel, Kévin Deturck, Frédérique Segond, Namrata Patel

This paper focuses on influencers, defined as individuals succeeding to have an impact on the decision process of other individuals simply through interaction. The success of social networks in the last decade led to an increasing interest for detecting such profiles. In such a context, we present a new influencer model based on the observation of real influence processes. We first define the theoretical frame in which we model the influence process. Then, we describe our empirical approach, based on the observation of influencers in forum discussions, allowing us to characterise each of our model component with linguistic features. Finally, we conclude by presenting, as a perspective, the model implementation with the linguistic feature annotation organised to acquire gold data.

Data correction for transcription in crowdsourcing. A feedback from RECITAL platform.

Benjamin HERVY, Pierre PÉTILLON, Hugo PIGEON, Guillaume RASCHIA

Crowdsourcing have been widely deployed to cover some challenges in digital humanities, like in the transcription of old handwritten documents. Such approach is especially useful to tackle existing limits in automatic handwriting recognition techniques. Crowdsourcing allows workers to help experts in extraction and classification of information, when the workload is daunting. Yet, it yields
some specific challenges related to the quality of produced data. In this paper, we discuss data quality in a research project called CIRESFI which aims at transcribing Italian Comedy financial archives through the RECITAL web platform.We finally propose some leads to tackle these issues.

Earth Observation Datasets for Change Detection in Forests

Julius Akinyemi, Josiane Mothe, Nathalie Neptune

The automatic detection of changes in forests (deforestation, reforestation) relies on various data sets. This article reviews data sets both global and local that can be used to evaluate tasks of land cover classification, change detection, segmentation and annotation of images for the analysis of deforestation and reforestation phenomena.

Construction(s) and contradictions of research data in the Humanities and Social Sciences

Marie-Laure Malingre, Morgane Mignon, Cécile Pierre, Alexandre Serres

In the last decade, political injunctions to curate and share research data have increased significantly. A survey conducted in 2017 in Rennes 2, a french Humanities and Social Sciences university, enabled us to question the habits and representations of the researchers in this matter, but also the term of “data” itself. Contrary to the idea that data are given, which is implicit in the french word “données”, the notion of “data” is far from being self-evident and actually proves to be complex and multifaceted. This article aims at showing that a triple redefinition and construction of research data is at stake in the discourses of researchers and institutional stakeholders: it operates at epistemological, intellectual and political levels. These concepts of data conflict with existing practices in the field.

Automatic analysis of old documents: taking advantage of an incomplete, heterogeneous and noisy corpus

Karine Abiven, Gaël Lejeune

In this article we try to tackle some problems arising with noisy and heterogeneous data in the domain of digital humanities. We investigate a corpus known as the mazarinades corpus which gathers around 5,500 documents in French from the 17th century. First of all, we show that this set of documents is not strictly speaking a corpus since its coverage has not been thoroughly
defined. Then, we advocate that it is possible to get interesting results even in the case of such an incomplete, heterogeneous and noisy dataset by strictly limiting the amount of pre-treatments necessary fro processing texts. Finally, we present some results on a case study on document dating where we aim to complete missing metadata in the mazarinades corpus. We exploit a method based on character strings analysis which is robust to noisy data and can even take advantage of this noise for improving the quality of the results.

Harness the hetorogeneity in textual data

Jacques Fize, Mathieu Roche, Maguelonne Teisseire

Over the last decades, there has been an increasing use of information systems, resulting in an exponential increase in textual data. Although the volumetric dimension of these textual data has been resolved, its heterogeneous dimension remains a challenge for the scientific community. The management of the heterogeneity in data offers many opportunities through an access to a richer information. In our work, we design a process for mapping heterogeneous textual data, based on their spatiality. In this article, we present the results returned by this process on data produced in Madagascar as part of the BVLAC project, led by CIRAD. Based on a set of 4 quality criteria, we obtain good spatial correspondence between these documents.

Editorial Board

Editor in Chief

Vincent CLAVEAU
IRISA-CNRS, Rennes
[email protected]

Co-Editors

Hervé BREDIN
CNRS-LIMSI
[email protected]

Catherine FARON-ZUCKER
Laboratoire I3S
Université Nice Sophia Antipolis
[email protected]

Karen PINEL-SAUVAGNAT
IRIT – Université Paul Sabatier
[email protected]

Haïfa ZARGAYOUNA
LIPN – Université Paris 13
[email protected]

Submit a paper