Python web crawler question.

Asked 2 years ago, Updated 2 years ago, 56 views

Hello, I'm going to make a program where you can get the number of sections for each chapter on the web page below.

url = http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99

There are two questions.

1) Each chapter is paginated and the number of VL in the url above is 66 total books, and CN is the number of chapters in each book. For example, if VL=20&CN=15, we'll be url talking about chapter 15 of the 20th book. However, each book has a different number of chapters, and even if it has up to 50 chapters, there seems to be 51 chapters, 52 chapters, and 53 chapters of url without content. ----------- The program looking for word frequency said that if you can't find the front class tkl, you add 1 to the VL, but I don't know how to solve it in this case.

2) The number of temples is also different for each chapter, so how can I extract them? For each book, the total savings would be added to each chapter, but... I have no idea Haha

I'd appreciate your help.

I made the code for counting the length like this, but I'm attaching it for your reference. Thank you.

import requests
import re


def chapter_counter(max_book):
    min_book = 1
    while min_book <= max_book:
        page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL={}&CN=1&CV=99".format(min_book))
        contents = str(page.content)
        chapter = max(int(i) for i in re.findall(r'>(\d+)</[ab]>&nbsp;', contents))
        w = re.search(r'(?<=height=12>\s<b>)(\d+\s)?[a-zA-Z]+', contents).group()
        print(w, '-', chapter, 'chapters')
        min_book += 1

chapter_counter(10)

python python3

2022-09-21 22:04

1 Answers

3 modules are required: requests, beautiful soup, and html5lib.

import requests
import bs4

def main():
    html = requests.get('http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=40&CN=8&CV=99').content
    bs = bs4.BeautifulSoup(html, 'html5lib')

    pageTD = bs.findAll('td', attrs={'align':'center', 'class':'tk3'})[0] #Paging Paragraph
    lastChapter = pageTD.findChildren('a')[-2].text # The second link behind the paging paragraph is the last chapter.

    ol = bs.findAll('ol') # Paragraph
    '''
    ol The tag's start property is the starting value of the paragraph ex) 16 11 16 ...
    Add the start attribute value of the last ol tag and the number of li (clauses) in the ol tag and subtract 1.
    '''
    sectionCnt = int(ol[-1].attrs['start']) + len(ol[-1].findAll('li')) - 1

    print('Last chapter of this Bible: {0}\nCount of this chapter: {1}'.format(lastChapter, sectionCnt))
if __name__ == '__main__':
    main()


2022-09-21 22:04

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.