@ARTICLE{10.21494/ISTE.OP.2017.0172, TITLE={Haruspex, Knowledge Management Tool for Unstructured Data}, AUTHOR={Matthieu Quantin, Benjamin HERVY, Florent Laroche, Jean-Louis Kerouanton, }, JOURNAL={Digital Archaeology}, VOLUME={1}, NUMBER={Issue 1}, YEAR={2017}, URL={https://www.openscience.fr/Haruspex-Knowledge-Management-Tool-for-Unstructured-Data}, DOI={10.21494/ISTE.OP.2017.0172}, ISSN={2515-7574}, ABSTRACT={This study presents a method designed to analyse and tap corpus made of unstructured or weakly structured documents. The term structured refers to a computer point of view, and means non-described, non explicitly marked up data. Nowadays, digital (open, or private) corpus creation is a massive trend. More and more data is being scanned, photographed, faithfully transposed, etc. to be analysed (among other uses). Digital data set is the exclusive material, daily handled by the researcher. These sets are often specifically designed for a project, even collected by the researcher himself. This trend needs to be accompanied by analytic tools. Actually physical and digital data have different potentials of analysis. Yet, the researcher in humanities often remains powerless facing the unstructured data he collects: articles, scan of archives, OCR documents, media and their metadata. Deploying a database is often limited to an “excel sheet” or some few SQL tables. Big data and data-mining technologies are restricted to large scale project, for already structured text, with a significant IT support team. This opens the gap between historians, archaeologist, sociologist and the “digital humanities”. This tool, nammed Haruspex, aims at closing this gap. It processes textual data, eventually combined with pictures, written in french or english, and outputs a graph oriented database. This database contains interlinked documents (semantic closeness). As inputs, several formats (pdf, txt, odt, latex …) are supported. The process is ran through 4 steps: 1. Corpus management: create or extract eventual metadata (date, place, tags) for each document; manipulate them: concatenate, split, gather, exclude…2. Semantic indexing of the corpus: keyword extraction (generic but also specific) and classification of these keyword in categories (if possible). 3. Results monitoring by the researcher. 4. Computing the “semantic closeness” between documents from the monitored keywords. First tests of haruspex concern several fields of study: shipyards industrial heritage , history of chemistry in the XXth century, labour history in french colonies and contemporary scientific publications studies. These tests convinced the concerned researchers.}}