पायथन: ISO-8859-1 / latin1 से UTF-8 में परिवर्तित

Question 1

मेरे पास यह स्ट्रिंग है जिसे ईमेल मॉड्यूल से कोटेड-प्रिंट करने योग्य से आईएसओ-8859-1 तक डिकोड किया गया है। यह मुझे "\ xC4pple" की तरह तार देता है जो "Äpple" (स्वीडिश में Apple) के अनुरूप होगा। हालाँकि, मैं उन स्ट्रिंग्स को UTF-8 में नहीं बदल सकता।

>>> apple = "\xC4pple"
>>> apple
'\xc4pple'
>>> apple.encode("UTF-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in     range(128)

मुझे क्या करना चाहिए?

Question 2

पहले इसे डिकोड करने की कोशिश करें, फिर एन्कोडिंग करें:

apple.decode('iso-8859-1').encode('utf8')

Question 3

यह एक सामान्य समस्या है, इसलिए यहाँ एक अपेक्षाकृत गहन चित्रण है।

गैर-यूनिकोड स्ट्रिंग्स के लिए (जैसे कि uउपसर्ग के बिना u'\xc4pple'), व्यक्ति को मूल एन्कोडिंग ( iso8859-1/ latin1, जब तक कि गूढ़sys.setdefaultencoding कार्य के साथ संशोधित ) से डिकोड नहीं करना चाहिए unicode, तब एक वर्ण सेट में सांकेतिक शब्दों में बदलना चाहिए जो आपके इच्छित वर्ण प्रदर्शित कर सकते हैं, इस मामले में मैं 'डी की सिफारिशUTF-8 ।

सबसे पहले, यहां एक उपयोगी उपयोगिता फ़ंक्शन है जो पायथन 2.7 स्ट्रिंग और यूनिकोड के पैटर्न को रोशन करने में मदद करेगा:

>>> def tell_me_about(s): return (type(s), s)

एक सादे तार

>>> v = "\xC4pple" # iso-8859-1 aka latin1 encoded string

>>> tell_me_about(v)
(<type 'str'>, '\xc4pple')

>>> v
'\xc4pple'        # representation in memory

>>> print v
?pple             # map the iso-8859-1 in-memory to iso-8859-1 chars
                  # note that '\xc4' has no representation in iso-8859-1, 
                  # so is printed as "?".

एक iso8859-1 स्ट्रिंग डिकोडिंग - सादे स्ट्रिंग को यूनिकोड में बदलें

>>> uv = v.decode("iso-8859-1")
>>> uv
u'\xc4pple'       # decoding iso-8859-1 becomes unicode, in memory

>>> tell_me_about(uv)
(<type 'unicode'>, u'\xc4pple')

>>> print v.decode("iso-8859-1")
Äpple             # convert unicode to the default character set
                  # (utf-8, based on sys.stdout.encoding)

>>> v.decode('iso-8859-1') == u'\xc4pple'
True              # one could have just used a unicode representation 
                  # from the start

थोड़ा और चित्रण - "-" के साथ

>>> u"Ä" == u"\xc4"
True              # the native unicode char and escaped versions are the same

>>> "Ä" == u"\xc4"  
False             # the native unicode char is '\xc3\x84' in latin1

>>> "Ä".decode('utf8') == u"\xc4"
True              # one can decode the string to get unicode

>>> "Ä" == "\xc4"
False             # the native character and the escaped string are
                  # of course not equal ('\xc3\x84' != '\xc4').

UTF के लिए एन्कोडिंग

>>> u8 = v.decode("iso-8859-1").encode("utf-8")
>>> u8
'\xc3\x84pple'    # convert iso-8859-1 to unicode to utf-8

>>> tell_me_about(u8)
(<type 'str'>, '\xc3\x84pple')

>>> u16 = v.decode('iso-8859-1').encode('utf-16')
>>> tell_me_about(u16)
(<type 'str'>, '\xff\xfe\xc4\x00p\x00p\x00l\x00e\x00')

>>> tell_me_about(u8.decode('utf8'))
(<type 'unicode'>, u'\xc4pple')

>>> tell_me_about(u16.decode('utf16'))
(<type 'unicode'>, u'\xc4pple')

यूनिकोड और यूटीएफ और लैटिन 1 के बीच संबंध

>>> print u8
Äpple             # printing utf-8 - because of the encoding we now know
                  # how to print the characters

>>> print u8.decode('utf-8') # printing unicode
Äpple

>>> print u16     # printing 'bytes' of u16
���pple

>>> print u16.decode('utf16')
Äpple             # printing unicode

>>> v == u8
False             # v is a iso8859-1 string; u8 is a utf-8 string

>>> v.decode('iso8859-1') == u8
False             # v.decode(...) returns unicode

>>> u8.decode('utf-8') == v.decode('latin1') == u16.decode('utf-16')
True              # all decode to the same unicode memory representation
                  # (latin1 is iso-8859-1)

यूनिकोड अपवाद

 >>> u8.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
  ordinal not in range(128)

>>> u16.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0:
  ordinal not in range(128)

>>> v.encode('iso8859-1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
  ordinal not in range(128)

एक विशिष्ट एन्कोडिंग (लैटिन -1, utf8, utf16) से यूनिकोड जैसे में परिवर्तित करके इन के आसपास हो जाएगा u8.decode('utf8').encode('latin1')।

तो शायद कोई निम्नलिखित सिद्धांतों और सामान्यताओं को आकर्षित कर सकता है:

एक प्रकार strबाइट्स का एक सेट है, जिसमें लैटिन -1, यूटीएफ -8 और यूटीएफ -16 जैसे कई एनकोडिंग में से एक हो सकता है
एक प्रकार unicodeबाइट्स का एक सेट है जिसे किसी भी संख्या में एन्कोडिंग में परिवर्तित किया जा सकता है, सबसे आम तौर पर UTF-8 और लैटिन -1 (iso8859-1)
printआदेश है एन्कोडिंग के लिए अपने स्वयं के तर्क , करने के लिए सेटsys.stdout.encoding और UTF-8 में दोषी
strदूसरे एन्कोडिंग में परिवर्तित करने से पहले यूनिकोड को डिकोड करना चाहिए ।

बेशक, यह सभी पायथन 3.x में बदलता है।

आशा है कि प्रकाशमान है।

आगे की पढाई

कैरेक्टर बनाम बाइट्स , टिम ब्रे द्वारा।

और अर्मिन Ronacher द्वारा बहुत ही निराशात्मक रेंट:

Question 4

पायथन 3 के लिए:

bytes(apple,'iso-8859-1').decode('utf-8')

मैंने इसे एक टेक्स्ट के लिए गलत तरीके से इनको -8859-1 के रूप में इनकोड किया था ( utÅ -8 के बजाय VeÅ \ x99ejnÃ © जैसे शब्द दिखाते हुए )। यह कोड सही संस्करण Veřejné का उत्पादन करता है ।

Question 5

यूनिकोड को डिकोड करें, परिणाम को UTF8 में एन्कोड करें।

apple.decode('latin1').encode('utf8')

Question 6

concept = concept.encode('ascii', 'ignore') 
concept = MySQLdb.escape_string(concept.decode('latin1').encode('utf8').rstrip())

मैं ऐसा करता हूं, मुझे यकीन नहीं है कि अगर यह एक अच्छा तरीका है, लेकिन यह हर काम करता है !!