Questions about Python regex!

I'm making a program that shows how many chapters there are in each book as a practice It's hard to make the output come out the way I want it to. Questions related to title variables below. For example, if you want to get Ruth only as the title, you get Ruth 1, or if you change the code and run Ruth and you get Ruth, you get 10,000 output in the next book, 1 Samuel.

Can't we get both Ruth and 1 Samuel right? Thank you.

import requests import re


def chapter_counter(max_book):
    min_book = 1
    while min_book <= max_book:
        page = requests.get("http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL={}&CN=1&CV=99".format(min_book))
        contents = str(page.content)
        chapter = max(int(i) for i in re.findall(r'>(\d+)</[ab]>&nbsp;', contents))
        title = [str(s) for s in re.findall(r'height=12> <b>(\w+)(\s)(\w+)', contents)]
        for w in title:
            symbols = "~:!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
            for i in range(0, len(symbols)):
                w = w.replace(symbols[i], "")
        print(w, 'has', chapter, 'chapters')
        min_book += 1


chapter_counter(10)

regex python3

2022-09-22 20:57

1 Answers

If you run the code you uploaded,

Genesis   1 has 50 chapters
Exodus   1 has 40 chapters
Leviticus   1 has 27 chapters
Numbers   1 has 36 chapters
Deuteronomy   1 has 34 chapters
Joshua   1 has 24 chapters
Judges   1 has 21 chapters
Ruth   1 has 4 chapters
1   Samuel has 31 chapters
2   Samuel has 24 chapters

Here's the result. This is...

...
Ruth has 4 chapters
1 Samuel has 31 chapters
2 Samuel has 24 chapters

I think it means that you want to change it like this.

I don't know the Bible, but I think the title of the book is <number><word> or <word><number>. The pattern is as follows:

If you abbreviate this, it is with the character (may have a number in front of it).

The regex that reflects this is (\d+\s)?[a-zA-Z]+. The code associated with title and w can be shortened to the following line:

w = re.search(r'(?<=height=12>\s<b>)(\d+\s)?[a-zA-Z]+', contents).group()

(?<=height=12>\s<b>)(\d+\s)?[a-zA-Z]+ is explained separately.

This pattern is needed because the part you want to find in the contents starts with <=height=12>\s<b>. However, this pattern is not the part to be extracted, so we used the (?<=...) expression, which is positive look behind.

Positive look behind means that needs to find a string that satisfies the pattern, but does not want to include this string in the result.

? is may or may not have a pattern. I wrote it to include both if the beginning is a number (1 Samuel) or if not (Ruth 1).



		
		
			

				

					
				

				
					2022-09-22 20:57

			
			If you have any answers or tips



		

	
		Popular Tags
	
	python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656
	


	
		Popular Questions
	
	
	1022 In Java servlet, when SHA-256 sends WW-Authenticate header for digest authentication, the client does not return the result.

	610 GDB gets version error when attempting to debug with the Presense SDK (IDE)

	589 Scrap text information after the "View More" button when searching in the Yahoo! News search window

	881 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier

	581 PHP ssh2_scp_send fails to send files as intended