WikiTailor is a tool for extracting in-domain corpora from Wikipedia. A domain must be defined as an existing category in Wikipedia (or in Vikipèdia, or in Βικιπαίδεια or in whatever language you like) and the articles belonging to that domain are extracted even if they are not tagged as such. Two extraction methods are implemented: the main system is based on the exploration of Wikipedia's category graph and a secondary one based information retrieval techniques is also included
WikiTailor 1.0 functionalities
Available languages: Arabic, Basque, Catalan, English, French, German, Greek, Italian, Romanian, Portuguese and Spanish.
- Monolingual in-domain corpora extraction
- Multilingual in-domain comparable corpora extraction
- Multilingual in-domain parallel corpora extraction build with the articles' titles
Upcoming
- New available languages: Czech and Hungarian
- Bilingual in-domain parallel corpora extraction build with the articles' content
- Evaluation of the quality of the extractions
References
For a complete analysis of the methods implemented see:
- Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction. Cristina España-Bonet, Alberto Barrón-Cedeño and Lluís Màrquez. In preparation.
- A Factory of Comparable Corpora from Wikipedia. Alberto Barrón-Cedeño, Cristina España-Bonet, Josu Boldoba and Lluís Màrquez. Proceedings of the 8th Workshop on Building and Using Comparable Corpora (BUCC), pages 3-13, Beijing, China, July 2015