I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
- convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
- remove all the characters whose Unicode type is “diacritic”.
Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?
Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.
12 s
Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.
Example:
accented_string = u'Málaga'
# accented_string is of type 'unicode'
import unidecode
unaccented_string = unidecode.unidecode(accented_string)
# unaccented_string contains 'Malaga'and is of type 'str'