Python 2.7 code questions.

Asked 1 years ago, Updated 1 years ago, 133 views

It's a program that extracts the image URL I'm having a hard time with 2 problems.

https://stackoverflow.com/questions/14587728/what-does-this-error-in-beautiful-soup-means

I think it's similar to this symptom, but even if I fix it, the error doesn't stop, so I'm inquiring.

How should I fix it? For your information, I have the Python 2.7 version.

import requests, bs4

_url = 'https://twitter.com/'


content = requests.get(_url).content
imgs = bs4.BeautifulSoup(content, 'html.parser', parse_only=bs4.SoupStrainer('img'))                             
img_urls = (img['src'] if img['src'].startswith('http') else "{}{}".format(_url, img['src']) for img in imgs)
for url in img_urls: print(url)

keyerror python crawling

2022-09-21 20:53

1 Answers

Python has a debugger called pdb built into it.

It's simple to use. From Python 3.7, a built-in function called breakpoint is provided, so you can use it, but the previous version is as follows.

import requests, bs4

_url = 'https://twitter.com/'


content = requests.get(_url).content
imgs = bs4.BeautifulSoup(content, 'html.parser', parse_only=bs4.SoupStrainer('img'))
import pdb; pdb.set_trace()
img_urls = (img['src'] if img['src'].startswith('http') else "{}{}".format(_url, img['src']) for img in imgs)
for url in img_urls: print(url)

When you run the pdb_test.py file, set import pdb; pdb.set_trace() and the program is paused on the line and entered.

If you give the command list here, the source will be displayed, and you can check the variables up to the current situation you ran.

python pdb_test.py #Run the pdb_test.py file in the cmd window
> d:\pdb_test.py(9)<module>()
-> img_urls = (img['src'] if img['src'].startswith('http') else "{}{}".format(_url, img['src']) for img in imgs)
(Pdb) list
  4
  5
  6     content = requests.get(_url).content
  7     imgs = bs4.BeautifulSoup(content, 'html.parser', parse_only=bs4.SoupStrainer('img'))
  8     import pdb; pdb.set_trace()
  9  -> img_urls = (img['src'] if img['src'].startswith('http') else "{}{}".format(_url, img['src']) for img in imgs)
 10     for url in img_urls: print(url)
[EOF]
(Pdb) imgs
<img alt="" class="avatar size32"/><img alt="" class="avatar size32"/><img alt="" class="avatar size32"/>

I checked the imgs value and found below. The key error is correct because the src property is not present.

(Pdb) imgs
<img alt="" class="avatar size32"/><img alt="" class="avatar size32"/><img alt="" class="avatar size32"/>

In other words, the html document used img tag, but not the src property. In other words, when programming, you have to assume that there may be no src and program it.

import requests, bs4

_url = 'https://twitter.com/'


content = requests.get(_url).content
imgs = bs4.BeautifulSoup(content, 'html.parser', parse_only=bs4.SoupStrainer('img'))
img_urls = (img['src'] if img['src'].startswith('http') else "{}{}".format(_url, img['src'])
                            for img in imgs
                                If img.has_attr('src') # Check that the src property exists for each tag and take only what exists.
for url in img_urls:
    if url.rfind('.') > -1 and url[url.rfind('.'):].lower()!= '.gif': Extract only those that are not #.gif
        print(url)


2022-09-21 20:53

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.