TOPIXTRACT is a language independent keyterm extractor from documents developed by Luís Teixeira in the framework of his MSC Thesis. For this purpose, it takes either words, or multi-words, or word prefixes (with fixed length 4 or 5 characters) as features to represent documents. Then uses 24 measures to identify feature importance for eachdocument discimination. Results obtained may be evaluated by independent evaluators and their agreement is meaured usig Kappa statistics. Tf-idf and Chi-square based metrics have shown a higher precision. Word prefixes were used for dealing with highly inflected languages, and topic prefixes were just used as an aid for promoting words and multi-words as possible document topics. More information can be obtained in the paper: Luís Teixeira, Gabriel Pereira Lopes and Rita Ribeiro, "Automatic Extraction of Document Topics", in: Luis M. Camarinha-Matos (Ed.): DoCEIS 2011, IFIP AICT 349, pp. 101–108, 2011.

Date: October, 2010

Authors: Gabriel Pereira Lopes, Luis F. S. Teixeira
