Python Hexadecimalization of Shift-jis Character Codes

Asked 2 years ago, Updated 2 years ago, 112 views

This is Win10 Pro, Python 3.7.

Shift-jis characters (external characters)

 " "".encode('cp932').hex()

I'd like to convert to hexadecimal using a conversion like , but it doesn't match the site below.
https://dencode.com/ja/string/hex

Run Results
Python: 'eee0'
sites:FBFC

I'd like to have Python say FBFC.

By the way

 " "".encode('Shift-jis').hex()

Then it seems to be an error.

If you understand, please take care of me.

python shift-jis

2022-09-30 15:54

2 Answers

As stated on the dictionary's 」 page, the official meaning of Shift_JIS (using underscore as the delimiter) does not have the letter 」, so it is normal for the latter example to fail.The byte string of the letter 」髙 as CP932 (depending on the interpretation of CP932), "EEE0" and "FBFC", are both correct.

The letter " 」" is NEC-selected IBM extension , one of the characters IBM and NEC assigned different codes to the same character for historical reasons (so the term "CP932" is one encoding method).It does not refer to an expression and is strictly ambiguous.)Microsoft then decided to integrate the two into a unified character code called Windows-31J.This integration was done in a compatible manner, resulting in two codes assigned to each of the same characters.

There seems to be a subtle difference in how each language library handles "CP932", but Python doesn't have the encoding name "Windows-31J", and "CP932" actually points to Windows-31J.So when you decode it, both "EEE0" and "FBFC" will be " "" characters:

>>b'\xEE\xE0'.decode('cp932')
' '髙'm sorry,'
>>b'\xFB\xFC'.decode ('cp932')
' '髙'm sorry,'


2022-09-30 15:54

Dear @Yuta Kitamura,The following is a supplement to the introduction and two more articles in the answer.

ウィ -Wictionary Japanese version -Wiktionary

At the bottom of the article, it is stated that -1990 JIS X0208 Series Shift JIS contains 0xFBFC as IBM extension and 0xEEE0 as NEC selection IBM extension.

Fehlersuchen und Verzeichnisse meines Windows

Later in the part titled "02.05.27 Ladder Height" on this page, 0xFBFC is selected for IME input and transcoding on Windows.

@Yuta Kitamura, the behavior of duplicate codes is "3. Duplicate NEC Selected IBM Extensions" and "IBM Extensions" in the second introduction, consistent with IBM Extensions" and "Ladder or not", so 0xFBFC seems to be the case.

Both are different from what Python does (0xEEE0).

This is a question about the JIS kanji code. Chinese character code...

The answers to this Q&A article explain the historical background in a digest.
If it's short and rough, this might be fine.

When I tried converting the Duplicate Registered Code in Windows-31J table in Microsoft Code Page 932-Wikipedia in Python, it seems that Python does not apply to "3. "NEC Selected IBM Extension" and "IBM Extension" overlap.All were converted to NEC selected IBM extension characters instead of IBM extension characters.It may be used differently.

"Also, the "" 」"" and "" 」"" on that page seemed to have different codes, so I picked them up from other articles IBM extension characters and changed them."try:except: is a remnant of it.

I'm trying Windows 10 64-bit 1909, Python 3.8.5.

extchars = list('¬∵≒≡∫√⊥∠∩∪①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳㍉㌔㌢㍍㌘㌧㌃㌶㍑㍗㌍㌦㌣㌫㍊㌻㎜㎝㎞㎎㎏㏄㎡㍻〝〟㏍㊤㊥㊦㊧㊨㈲㈹㍾㍽㍼∮∑∟⊿ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ№℡㈱ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ¦'"纊褜鍈銈蓜俉炻昱棈鋹曻彅丨仡仼伀伃伹佖侒侊侚侔俍偀倢俿倞偆偰偂傔僴僘兊兤冝冾凬刕劜劦勀勛匀匇匤卲厓厲叝﨎咜咊咩哿喆坙坥垬埈埇﨏塚增墲夋奓奛奝奣妤妺孖寀甯寘寬尞岦岺峵崧嵓﨑嵂嵭嶸嶹巐弡弴彧德忞恝悅悊惞惕愠惲愑愷愰憘戓抦揵摠撝擎敎昀昕昻昉昮昞昤晥晗晙晴晳暙暠暲暿曺朎朗杦枻桒柀栁桄棏﨓楨﨔榘槢樰橫橆橳橾櫢櫤毖氿汜沆汯泚洄涇浯涖涬淏淸淲淼渹湜渧渼溿澈澵濵瀅瀇瀨炅炫焏焄煜煆煇凞燁燾犱犾猤猪獷玽珉珖珣珒琇珵琦琪琩琮瑢璉璟甁畯皂皜皞皛皦益睆劯砡硎硤硺礰礼神祥禔福禛竑 竧 yasushi 竫 箞 sei 絈 絜 綷 midori 緖 繒 罇 羡 茁 荢 a feather 荿 菇 菶 葈 蒴 蕓 蕙 蕫 﨟 薰 蘒 﨡 蠇 裵 訒 訷 sen 誧 誾 諟 sho 諶 譓 譿 賰 賴 贒 赶 﨣 軏 﨤 itsu 鄕 tou the capital 遧 郞 釚 釗 釞 釭 釮 釤 釥 鈆 鈐 鈊 鈺 鉀 鈼 鉎 鉙 鉑 鈹 鉧 銧 鉷 鉸 鋧 鋗 鋙 鋐 﨧 鋕 鋠 鋓 錥 錡 鋻 﨨 錞 鋿 錝 錂 鍰 鍗 鎤 鏆 鏞 鏸 鐱 鑅 鑈 閒 takashi 﨩 隝 隯 霳 霻 靃 靍 靏 靑 靕 顗 顥 馞 驎 the rice gai 餧 髙 髜 魵 魲 鮏 鮱 鮻 鰀 鵰 鵫 tsuru 鸙 What ')

with open('extcharscp932.txt', 'w', encoding='utf-8', newline='') as f:
    for c in extchars:
        Industry :
            s = c + ' = ' + c.encode('cp932').hex() + '\n'
        except:
            s = c + ' = utf-8 ' + c.encode('utf-8').hex() + " is can't encode to cp932\n"
        
        f.write(s)

When I tried the .NET Core 3.1 C# console app to see what was going on with Microsoft languages and libraries instead of Python, it was unified to ...IBM extensions as described in Wikipedia.

using System;
using System.IO;
namespace ConsoleApp1
{
    class Program
    {
        static void Main(string[] args)
        {
            System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
            System.Text.Encoding sjis = System.Text.Encoding.GetEncoding("shift_jis");
            string extchars = "¬∵≒≡∫√⊥∠∩∪①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳㍉㌔㌢㍍㌘㌧㌃㌶㍑㍗㌍㌦㌣㌫㍊㌻㎜㎝㎞㎎㎏㏄㎡㍻〝〟㏍㊤㊥㊦㊧㊨㈲㈹㍾㍽㍼∮∑∟⊿ⅠⅡⅢⅣⅤⅥⅦⅧⅨⅩ№℡㈱ⅰⅱⅲⅳⅴⅵⅶⅷⅸⅹ¦'"纊褜鍈銈蓜俉炻昱棈鋹曻彅丨仡仼伀伃伹佖侒侊侚侔俍偀倢俿倞偆偰偂傔僴僘兊兤冝冾凬刕劜劦勀勛匀匇匤卲厓厲叝﨎咜咊咩哿喆坙坥垬埈埇﨏塚增墲夋奓奛奝奣妤妺孖寀甯寘寬尞岦岺峵崧嵓﨑嵂嵭嶸嶹巐弡弴彧德忞恝悅悊惞惕愠惲愑愷愰憘戓抦揵摠撝擎敎昀昕昻昉昮昞昤晥晗晙晴晳暙暠暲暿曺朎朗杦枻桒柀栁桄棏﨓楨﨔榘槢樰橫橆橳橾櫢櫤毖氿汜沆汯泚洄涇浯涖涬淏淸淲淼渹湜渧渼溿澈澵濵瀅瀇瀨炅炫焏焄煜煆煇凞燁燾犱犾猤猪獷玽珉珖珣珒琇珵琦琪琩琮瑢璉璟甁畯皂皜皞皛皦益睆劯砡硎硤硺礰礼神祥禔福禛竑竧靖竫箞精絈絜綷綠緖 繒 罇 羡 茁 荢 a feather 荿 菇 菶 葈 蒴 蕓 蕙 蕫 﨟 薰 蘒 﨡 蠇 裵 訒 訷 sen 誧 誾 諟 sho 諶 譓 譿 賰 賴 贒 赶 﨣 軏 﨤 itsu 鄕 tou the capital 遧 郞 釚 釗 釞 釭 釮 釤 釥 鈆 鈐 鈊 鈺 鉀 鈼 鉎 鉙 鉑 鈹 鉧 銧 鉷 鉸 鋧 鋗 鋙 鋐 﨧 鋕 鋠 鋓 錥 錡 鋻 﨨 錞 鋿 錝 錂 鍰 鍗 鎤 鏆 鏞 鏸 鐱 鑅 鑈 閒 takashi 﨩 隝 隯 霳 霻 靃 靍 靏 靑 靕 顗 顥 馞 驎 the rice gai 餧 髙 髜 魵 魲 鮏 鮱 鮻 鰀 鵰 鵫 tsuru 鸙 What " ;
            using (StreamWriter sw = File.CreateText("extcharssjis.txt"))
            {
                foreach (char c in extchars)
                {
                    sw.WriteLine(c + " = " + BitConverter.ToString(sjis.GetBytes(new char[] { c })).Replace("-", "").ToLower());
                }
            }
        }
    }
}

Simple invocation of standard features does not result in the corresponding hexadecimal number, so this is a possible solution.

  • If there is a range of NEC-selected IBM extended characters (edxx-eexx), convert them to IBM extended characters (faxx-fcxx)
    I don't think I can add or subtract simple numbers, so maybe I should make a one-to-one conversion table.

  • Find and use libraries that encode compatible with Microsoft
    I can't find it after a little search, but it might be somewhere

  • Create your own library that encodes compatible with Microsoft
    The article around here probably deals with a similar issue.
    How do I create custom text codecs?
    Custom Python Charmap Codec

If there is a range of NEC-selected IBM extended characters (edxx-eexx), convert to IBM extended characters (faxx-fcxx)
I don't think I can add or subtract simple numbers, so maybe I should make a one-to-one conversion table.

Find and use libraries that encode compatible with Microsoft
I can't find it after a little search, but it might be somewhere

Create your own library that encodes compatible with Microsoft
The article around here probably deals with a similar issue.
How do I create custom text codecs?
Custom Python Charmap Codec


2022-09-30 15:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.