Python web crawling question.

Hello, I found a bug while making and testing the program.

My program includes the ability to tell you the frequency of words and the ability to count the number of years and the number of years that are missing from this post. First, I found an error while testing the function of telling the frequency of words, but I couldn't fix it, so I posted it to get help.

When you run the program, you get an input asking for the starting point and ending point, and it's designated as writing it in there, but when I checked, I found that the starting point was excluded. For example, if you designate Chapter 6 of Book 49 (the last chapter) to Chapter 2 of Book 50, Chapter 6 of 49 is excluded and words are collected from Chapter 1 of Chapter 50.

How can I fix this problem?

Thank you.

from bs4 import BeautifulSoup
import operator
import requests
import re
import bs4


def word_frequency(max_book_w, max_chapter_w):
    word_list = []
    book = b
    chapter = c
    while book <= max_book_w:
        while chapter <= max_chapter_w:
            url = 'http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL={}&CN={}&CV=99'.format(book, chapter)
            source_code = requests.get(url).text
            soup = BeautifulSoup(source_code, "html.parser")
            for bible_text in soup.findAll('font', {'class': 'tk4l'}):
                content = bible_text.get_text()
                words = content.lower().replace('-', ' ').split()
                for each_word in words:
                    word_list.append(each_word)
            chapter += 1
        book += 1
        chapter = 1
    clean_up_list(word_list)


def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        symbols = "~:!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")
        if len(word) > 0:
            clean_word_list.append(word)
    dictionary(clean_word_list)


def dictionary(clean_word_list):
    word_count = {}
    for word in clean_word_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1
    for key, value in sorted(word_count.items(), key=operator.itemgetter(1)):
        print(key, value)

user = int(input('''What do you need a help on?
          1 - word frequency, 2 - chapter count, 3 - verse count'''))
    if user == 1:
        b = int(input("type the starting book"))
        c = int(input("type the starting chapter of the book"))
        x = max_book_w = int(input("type the last book"))
        y = max_chapter_w = int(input("what chapter of the book would be the last chapter?"))
        word_frequency(x, y)
    elif user == 2:
        min_b = int(input("what is the starting book?"))
        max_b = int(input("what is the last book?"))
        chapter_counter(max_b)
    elif user == 3:
        books = int(input("which book do you want?"))
        chapters = int(input("which chapter of the book you want?"))
        if __name__ == '__main__':
            verse(books, chapters)

1stedit: I tested it with five chapters of 49 books, and it doesn't just exclude five chapters, but it excludes both chapters 5 and 6 of 49 books and shows the results

python crawling

2022-09-22 20:39

1 Answers

To get input from a few chappers to a few chappers, you need to have four inputs, so the code seems to have only two inputs There seems to be no variable on the code that stores the first information. I'm a beginner. I've never played Python before.

2022-09-22 20:39

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656