Web Image Crawling Question.

Hello.

When studying Python web crawling, use the findall function and after F12 > ctrl + shift + c I don't know what to put in the HTML structure I have a question.

The site is https://chilgok.fowi.or.kr/facility/admin.do. The photo code I'm trying to put in is <img src="/images/facility/admin photo05.jpgalt="library thumbnail">

The coding is as follows

import urllib.request as req
import re

rep = req.urlopen('https://chilgok.fowi.or.kr/facility/admin.do')

data = rep.read().decode('utf8')

result = re.foundall('+.jpg',data) ######### I didn't know what to put in this part.
for link in result:
    idx = link.rfind('/')
    with open(link[idx+1:], "wb") as f:
        pic = req.urlopen(link)
        f.write( pic.read() )

Source: https://www.youtube.com/watch?v=NKE0ozQ1Esw

I never use it commercially.

python crawling html css

2022-09-20 21:36

1 Answers

The site was created using jsp and uses a relative path, so you can't get it directly through the entire path like a video.

I checked and found that the location where the image was saved should be loaded after https://chilgok.fowi.or.kr, so I modified the code a little bit.

Additionally, I added ssl because I couldn't access the site because of https security when I accessed the site with urllib.

Below is the full text of the code.

import urllib.request as req
import re
import ssl

context = ssl._create_unverified_context()
url = 'https://chilgok.fowi.or.kr'
active = '/facility/admin.do'
rep = req.urlopen(url+active,context=context)
data = rep.read().decode('utf8')
result = re.findall('/images.+jpg', data)
# Regular expression "/images" text + at least one letter + "jpg" text
# The "/images/facility/admin_photo01.jpg" image is a relative path, so you need to find the root path. In this case, "https://chilgok.fowi.or.kr"

for link in result:
    print(url+link)
    # Root Path + Relative Path
    imgUrl = url + link
    idx = imgUrl.rfind('/')
    fileName = imgUrl[idx+1:]
    print(fileName) 
    # If there is an error in the middle, please check how far you receive the normal value.

    with open(fileName, "wb") as f:
        pic = req.urlopen(imgUrl,context=context) # Here too (https - ssl)
        f.write( pic.read() )

I didn't know ctrl + shift + c, but I knew it was good.

To learn more about web crawling, see Beautiful Soup.

Thank you.

2022-09-20 21:36

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656