TY - Type of reference TI - Automatic analysis of old documents: taking advantage of an incomplete, heterogeneous and noisy corpus AU - Karine Abiven AU - Gaël Lejeune AB - In this article we try to tackle some problems arising with noisy and heterogeneous data in the domain of digital humanities. We investigate a corpus known as the mazarinades corpus which gathers around 5,500 documents in French from the 17th century. First of all, we show that this set of documents is not strictly speaking a corpus since its coverage has not been thoroughly defined. Then, we advocate that it is possible to get interesting results even in the case of such an incomplete, heterogeneous and noisy dataset by strictly limiting the amount of pre-treatments necessary fro processing texts. Finally, we present some results on a case study on document dating where we aim to complete missing metadata in the mazarinades corpus. We exploit a method based on character strings analysis which is robust to noisy data and can even take advantage of this noise for improving the quality of the results. DO - 10.21494/ISTE.OP.2019.0335 JF - Information Retrieval, Document and Semantic Web KW - Old documents, Mazarinades, Text Mining, Document Dating, corpus, Documents anciens, Mazarinades, Fouille de Textes, datation, corpus, numérisation, L1 - http://www.openscience.fr/IMG/pdf/iste_ridows18v2n1_3.pdf LA - en PB - ISTE OpenScience DA - 2019/02/19 SN - 2516-3280 TT - Analyse automatique de documents anciens : tirer parti d’un corpus incomplet, hétérogène et bruité UR - http://www.openscience.fr/Automatic-analysis-of-old-documents-taking-advantage-of-an-incomplete IS - Issue 1 VL - 2 ER -