Questions about Python regex!

Asked 2 years ago, Updated 2 years ago, 69 views

I'm making a program that shows how many chapters there are in each book as a practice It's hard to make the output come out the way I want it to. Questions related to title variables below. For example, if you want to get Ruth only as the title, you get Ruth 1, or if you change the code and run Ruth and you get Ruth, you get 10,000 output in the next book, 1 Samuel.

Can't we get both Ruth and 1 Samuel right? Thank you.

import requests import re


def chapter_counter(max_book):
    min_book = 1
    while min_book <= max_book:
        page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL={}&CN=1&CV=99".format(min_book))
        contents = str(page.content)
        chapter = max(int(i) for i in re.findall(r'>(\d+)</[ab]>&nbsp;', contents))
        title = [str(s) for s in re.findall(r'height=12> <b>(\w+)(\s)(\w+)', contents)]
        for w in title:
            symbols = "~:!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
            for i in range(0, len(symbols)):
                w = w.replace(symbols[i], "")
        print(w, 'has', chapter, 'chapters')
        min_book += 1


chapter_counter(10)

regex python3

2022-09-22 20:57

1 Answers

If you run the code you uploaded,

Genesis   1 has 50 chapters
Exodus   1 has 40 chapters
Leviticus   1 has 27 chapters
Numbers   1 has 36 chapters
Deuteronomy   1 has 34 chapters
Joshua   1 has 24 chapters
Judges   1 has 21 chapters
Ruth   1 has 4 chapters
1   Samuel has 31 chapters
2   Samuel has 24 chapters

Here's the result. This is...

...
Ruth has 4 chapters
1 Samuel has 31 chapters
2 Samuel has 24 chapters

I think it means that you want to change it like this.

I don't know the Bible, but I think the title of the book is <number><word> or <word><number>. The pattern is as follows:

If you abbreviate this, it is with the character (may have a number in front of it).

The regex that reflects this is (\d+\s)?[a-zA-Z]+. The code associated with title and w can be shortened to the following line:

w = re.search(r'(?<=height=12>\s<b>)(\d+\s)?[a-zA-Z]+', contents).group()

(?<=height=12>\s<b>)(\d+\s)?[a-zA-Z]+ is explained separately.

This pattern is needed because the part you want to find in the contents starts with <=height=12>\s<b>. However, this pattern is not the part to be extracted, so we used the (?<=...) expression, which is positive look behind.

Positive look behind means that needs to find a string that satisfies the pattern, but does not want to include this string in the result.

? is may or may not have a pattern. I wrote it to include both if the beginning is a number (1 Samuel) or if not (Ruth 1).


2022-09-22 20:57

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.