I have a Python question.

['01', '15 Good Goods \xa0\xa0\xa0\xa0\xa0\xa0', '034252340', '96,000', '814,000', '\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa01, '01, '01, '01, '01, '01, '01, '0
['']
['02', '16 Recommendations \xa0\xa0\xa0\xa0', '\n', '0342742342', '96,000', '814,000, '\n\xa0\xa00\xa0\n\xa01\xa0\xa0\xa0\xa0\xa03,\xa0\xa0\xa0\xa0\n\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa
Process finished with exit code 0

I've been scratching one site and the result comes out weird like above.

01, 15 Good Product SLEFT, 034252340, 96,000, 814,000

At least it should come out like this

When I searched on Google, it said it was a Unicode problem

I can't solve it even if I try the RE regular expression.

# Make an episode list >> This is to spray each line around it at the final output.
episodes = []

# Tags containing the contents of the episode are extracted.
table = bs.find('table', class_='td00')

for row in table.find_all('tr'):

    values = []

    for col in row.find_all('td'):
        text = col.get_text()
        rp_text = text.replace('(?<!\x0d)\x0a',' ')
        values.append(rp_text)

    if values:
        episodes.append(values)

del episodes[0]

for episode in episodes:
    print(episode)

Please give us a hint.

python

2022-09-21 21:23

1 Answers

Try unicodedata-normalize.

from unicodedata import normalize
s = '16 Recommended Product \xa0\xa0\xa0\xa0'
normalize('NFKD', s)

2022-09-21 21:23

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656