Is there any difference between Python u and unicode()?

Asked 2 years ago, Updated 2 years ago, 15 views

In order to convert Python to Unicode, I think there is a way to use u"abc" and unicode().
If I do the following, the results will be different, but what are the differences?
The top returns 1, and the bottom returns 6.

printlen(u"\u2192")
printlen(unicode("\u2192"))

I wanted to read Unicode from the outside and display it as a character, but I failed here.
Thank you for your cooperation.

python

2022-09-30 17:47

2 Answers

It's obvious that it's #Python2

unicode(str, [encoding, errors]) receives str, and encoding is the default encoding if not specified.

Let's check the default encoding first.

>>import sys
>>sys.getdefaultencoding()
'ascii'

The default seems to be ascii. UTF-8 in is 0xe20x860x92, but

>>print unicode("\xe2\x86\x92")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: original not in range (128)

It's that abominable error.The reason is that the ASCII decoder is 0x00-0x7F valid and 0xe2 is out of range.I will tell you that the string entered there is a UTF-8 encoding.

>>print unicode("\xe2\x86\x92", "utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2192' in position 0: original not in range (128)

It's an error again! But if you look closely at the error message, it was a decode error earlier, but this time it is an encode error.The Unicode object in U2192 cannot be represented by the ascii encoder.My terminal is UTF-8, so I will convert it to a UTF-8 string using .encode('utf-8').

>>print unicode("\xe2\x86\x92", "utf-8").encode('utf-8')
→

Thankfully, → was displayed.Now I'll give you a UTF-16 encoded string and try to convert it to UTF-8 and display it on the terminal.utf-16be means Big Endian.

>>print unicode("\x21\x92", "utf-16be") .encode('utf-8')
→

I printed it well!By the way, if you specify little endian in utf-16le, the characters are incorrect.

>>print unicode("\x21\x92", "utf-16le") .encode('utf-8')
鈡

Change the byte order and it will appear correctly.

>>print unicode("\x92\x21", "utf-16le") .encode('utf-8')
→

If it's just UTF-16, it seems that it will be treated the same as Little Endian because the result was as follows.I'm not sure if it's UTF-16 standard, Python standard or CPU dependent, so I'm not sure.

>>print unicode("\x92\x21", "utf-16").encode('utf-8')
→
>>print unicode("\x21\x92", "utf-16").encode('utf-8')
鈡

Finally, the u" prefix means that the string is encoded in the file or that the UTF-16 string specified in \u is treated as a Unicode object.If coding:utf-8 is at the beginning of the file, interpret the byte string in "" as UTF-8 and convert it to a Unicode object. \u does not mean Unicode anywhere other than u".

Now that I know the relationship between string and Unicode, encoder, and decoder, here's the question.

printlen(u"\u2192")
printlen(unicode("\u2192"))

The first line correctly translates UTF-16 bytes into a single-character Unicode object.The second line is the string \u2192, but as I wrote earlier, specifying \u outside of u" does not mean Unicode.That is, \ and u and 2 and 1 and 9 and 2.So even if you convert six characters to a Unicode object, it remains six characters long.

>>print repr("\u2192")
US>'\\u2192'#\ escaped to \\, but remains the ASCII string \u2192
>>print repr(unicode("\u2192"))
I just converted the ASCII string u'\\u2192'#\u2192 into a Unicode object, so it's 6 characters.
>>print repr(u"\u2192")
u'\u2192'# UTF16 U2192 characters


2022-09-30 17:47

In Python3, Unicode is the default.There is no unicode() constructor.

Python2

Input:

print(unicode("\u2192"))

Results:

\u2192

Input:

print(u"\u2192")

Results:

From Unicode HOWTO,

The unicode() constructor has the usage unicode(string[, encoding, errors]).

unicode("\u2192") uu"\u2192"


2022-09-30 17:47

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.