How do I import only one element when the element I want to import from Html is duplicated?

I'm asking you a question because I can't solve it even if I google it and read the official document. The problem is as below.

After parsing, I want to output only one link address corresponding to href from the output result below.

// Code statement.

site = requests.get("http://www.alba.co.kr/") 

alba = BeautifulSoup(site.text, 'html.parser') 

brands = list(alba.find(id = "MainSuperBrand").find('ul', {"class" : "goodsBox"}). find_all('a', {"class" : "goodsBox-info"}))

for b in brands : 
    if "http" in b : 
   `b = b.select('a.href') 
       print(b)

Attempt to extract the href element of the first tag from the parsed output statement.

[
<li class="first impact"><div class="B_MyAd_"></div> 
<a class="goodsBox-info" href="http://barogo.alba.co.kr/">*
 <span class="logo"> <imgalt="(Note)"src="//imagelogo.alba.kr/data_image2/logo/brand/
20200916174910805.gif"/> </span> <span class="company"> Barogo</span> <span class="title">"<span> Barogo Recruitment <National Riders</span>> < < < < < <<<<<<<<<<<<&n></span> </span> </a>
<a class="brandHover" href="http://barogo.alba.co.kr/"  </a></li>, . ,,,,,,.

List statement.  ]

li There are two hrefs in the tag <a> below the class, and in this case, how can only one be output? I wonder if you can.

html python java scraping

2022-09-20 12:31

4 Answers

Check the format of the return.

a = soup.find_all('a')
print(a)

Add content

You can't? Is it really not working? Aren't you doing it the wrong way, not the way I explained earlier?

Aaa is returned to the list, but how did the result of the result set come out?

a = requests.get("http://www.alba.co.kr/")

aa = BeautifulSoup(a.text, 'html.parser')

aaa = list(aa.find(id = "MainSuperBrand").find('ul', {"class" : "goodsBox"}).find_all('a', {"class" : "goodsBox-info"}))


for aaaa in aaa :
    print()
    print(aaaa['href'])

2022-09-20 12:31

If it's Python Beautiful Soup... There's also a function called find Look it up

2022-09-20 12:31

The code is as follows:

site = requests.get("http://www.alba.co.kr/")

alba = BeautifulSoup(site.text, 'html.parser')

brands = list(alba.find(id = "MainSuperBrand").find('ul', {"class" : "goodsBox"}).find_all('a', {"class" : "goodsBox-info"}))


for b in brands :
  if "http" in b : 
    b = b.select('a.href')
  print(b)

Parsed html to be extracted.

<a class="goodsBox-info" href="http://dadam.alba.co.kr/"> 
<span class="logo"> <imgalt="Three Great Pigs' Feet" src="//image-logo.alba.kr/data_image2/logo/brand/20211125132816761.gif"/> </span> 
<span class="company">Three major pigs' feet</span>
 <span class="title">
<span>Recruitment of employees and part-timers nationwide</span></span> 
<span class="wrap"> 
<span class="local">National</span> 
<span class="pay"><span class="pay Letter">Check by announcement</span>
 <span class="payIcon talk"></span></span> </span> </a>

2022-09-20 12:31

I also tried it with the revised content, The link address is printed normally Like the image, the tag text was printed in duplicate.

So I posted a question because I thought I should do something more within the tag so that I don't get duplicate content.
I don't know why the tag is duplicated on the link address.

2022-09-20 12:31

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656