Convert Unicode to ASCII without errors in Python

My code just scrapes a web page, then converts it to Unicode. html = urllib.urlopen(link).read() html.encode(“utf8″,”ignore”) self.response.out.write(html) But I get a UnicodeDecodeError: Traceback (most recent call last): File “/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/__init__.py”, line 507, in __call__ handler.get(*groups) File “/Users/greg/clounce/main.py”, line 55, in get html.encode(“utf8″,”ignore”) UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xa0 in position 2818: ordinal not in range(128) … Read more

How do I determine file encoding in OS X?

I’m trying to enter some UTF-8 characters into a LaTeX file in TextMate (which says its default encoding is UTF-8), but LaTeX doesn’t seem to understand them. Running cat my_file.tex shows the characters properly in Terminal. Running ls -al shows something I’ve never seen before: an “@” by the file listing: -rw-r–r–@ 1 me users … Read more

Encode String to UTF-8

I have a String with a “ñ” character and I have some problems with it. I need to encode this String to UTF-8 encoding. I have tried it by this way, but it doesn’t work: byte ptext[] = myString.getBytes(); String value = new String(ptext, “UTF-8”); How do I encode that string to utf-8? 11 Answers … Read more

Do I really need to encode ‘&’ as ‘&’?

I’m using an ‘&‘ symbol with HTML5 and UTF-8 in my site’s <title>. Google shows the ampersand fine on its SERPs, as do all the browsers in their titles. http://validator.w3.org is giving me this: & did not start a character reference. (& probably should have been escaped as &amp;.) Do I really need to do … Read more

Write to UTF-8 file in Python

I’m really confused with the codecs.open function. When I do: file = codecs.open(“temp”, “w”, “utf-8”) file.write(codecs.BOM_UTF8) file.close() It gives me the error UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xef in position 0: ordinal not in range(128) If I do: file = open(“temp”, “w”) file.write(codecs.BOM_UTF8) file.close() It works fine. Question is why does the first method … Read more

HTML encoding issues – “” character showing up instead of ” “

I’ve got a legacy app just starting to misbehave, for whatever reason I’m not sure. It generates a bunch of HTML that gets turned into PDF reports by ActivePDF. The process works like this: Pull an HTML template from a DB with tokens in it to be replaced (e.g. “~CompanyName~”, “~CustomerName~”, etc.) Replace the tokens … Read more