Clear with Python special character replace

Asked 2 years ago, Updated 2 years ago, 342 views

Hello, I have a question.

If there is a special character """ when saving to Excel after crawling, there is an error when encoding euc-kr, so it cannot be saved to Excel.

The problem with the current code is

If the option has a value of "+3,000"" among the parts that check the option of the product, I want to replace 를 with a space, but it can't be erased with replace.

...

 omitted...
    option_chk2 = option_chk.select('a[rel]')
    print(option_chk2)  
    # Print result: 
    #<a class="sbFocus" href="#"rel=">- Select option -</a>, <a href="#Medium" rel="Medium">Medium</a>, <Href="#Large"rel>"/a>"3; Large"Large"Large"/a> Large"Large"Large">"

    a = ' + '.join([i.string for i in option_chk2])
    b = a.replace('₩₩','')

    if not a:
            a = "Sold Out"
    print(a)
    #print Results
    #'- Select Options - + Medium + Large (+3,000)')'
    print(b)
    #'- Select Options - + Medium + Large (+3,000)')'

This is the situation above.

Python recognizes reverse slash """ as a special character ₩₩ I searched twice and found it, but it doesn't apply.

The test below works well.

    import re

    char="₩₩"
    string = ("3,000"+char)
    print(string)
    #Result 3,000₩

    string2 = string.replace("₩₩","")
    print(string2)
    #Result 3,000

The reverse slash was marked \ in the hash code questionnaire, so I arbitrarily converted it into a special character 로.

If you had a similar problem, please share your knowledge.

Thank you.

python crawling

2022-10-09 01:00

1 Answers

Um...

Isn't the problem caused by the difference between reverse slash and original characters?

>>> s = '\u20a9'
>>> s
'₩'
>>> print(s)
₩
>>> s.encode("cp949")
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    s.encode("cp949")
UnicodeEncodeError: 'cp949' codec can't encode character '\u20a9' in position 0: illegal multibyte sequence
>>> s2 = '\\'
>>> s2
'\\'
>>> print(s2)
\
>>> ord(s2)
92
>>> ord(s)
8361
>>> hex(ord(s2))
'0x5c'
>>> hex(ord(s))
'0x20a9'
>>> s2.encode("cp949")
b'\\'

>>> import re
>>> 
>>> s1 = "3000 \u20a9"
>>> s2 = "3000 \u005c"
>>> s1
'3000 ₩'
>>> s2
'3000 \\'
>>> print(s1, s2)
3000 ₩ 3000 \
>>> 
>>> _s1 = s1.replace("₩", "")
>>> _s1
'3000 '
>>> __s1 = s1.replace("\\", "")
>>> __s1
'3000 ₩'
>>> 
>>> _s2 = s2.replace("₩", "")
>>> _s2
'3000 \\'
>>> __s2 = s2.replace("\\", "")
>>> __s2
'3000 '


2022-10-09 01:00

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.