CITI has stopped operations in 2014, to co-launch NOVA LINCS THIS SITE IS NOT BEING UPDATED SINCE 2013
citi banner
Home Page FCT/UNL UNL
  Home  \  Prototypes  \  Prototype Page Login  
   
banner bottom
File Top
SpSim4Cognates

SpSim4Cognates is a prototype whose main characteristics are described in Luís Gomes, and Gabriel Pereira Lopes, 2011, "Measuring Spelling Similarity for Cognate Identification", In: Luís Antunes and H. Sofia Pinto (Eds.), Progress in Artificial Intelligence, Lecture Notes in Artificial Intelligence 7026, p. 624-633, Springer-Verlag (Germany), URL: http://www.springerlink.com/content/gtl56j3l06906020 It takes into account that the most commonly used string similarity measures, such as the Longest Common Subsequence Ratio (LCSR) and those based on Edit Distance, only take into account the number of matched and mismatched characters. However, we observe that cognates belonging to a pair of languages exhibit recurrent spelling differences such as “ph” and “f” in English-Portuguese cognates “phase” and “fase”. Those differences are attributable to the evolution of the spelling rules of each language over time, and thus they should not be penalized in the same way as arbitrary differences found in non-cognate words, if we are using word similarity as an indicator of their cognatness degree. SpSim is a new spelling similarity measure for cognate identification that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori. Compared to LCSR and EdSim (Edit Distance-based similarity), SpSim yields an F-measure 10% higher when used for cognate identification on five different language pairs.


Date: June, 2011


Authors: Gabriel Pereira Lopes, Luís Gomes
File Bottom