To exclude only a fraction of a string with Python regular expressions

I'm going to save the first address and delete the last part of it. But even if I try Python regular expressions, it doesn't work well. <

3rd floor, 3070, 105th building, Junghwasan-dong 2nd street.. I want to exclude this from the regular expression. Please tell me how.

Regular expression used: (\d{1,4}\-\d{1,4})|(\d{1,4})|(acid\d{1,4}\-\d{1,4})|(acid\d{1,4})

Target Data:

 1470-1, 1497, Namyang-ri, Okgye-myeon, Gangneung-si, Gangwon-do
No. 3070 on the 3rd floor of Renecite 529-1 Gwaebeop-dong, Sasang-gu, Busan
1405, Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
No. 307, 3rd floor, 105-dong, Donggwang Apartment, 1320-41, Taein-dong, Gwangyang-si, Jeollanam-do
No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do
No. 203, 2nd Floor, Jaesang-dong, 1st Apartment, Cheonan Mokcheon-dong, 422 Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do


regex
python
					
					

	


		
	

	
		2022-09-21 12:45



			

			
			2 Answers


	
		
It is faster and easier to correct as below.
addr = '1470-1, 1497 Namyang-ri, Okgye-myeon, Gangneung-si, Gangwon-do'
addr1 = "Renecite 3rd floor No. 3070 in 529-1 Gwaebeop-dong, Sasang-gu, Busan"
addr2 = 1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
addr3 = 'No. 307, 3rd floor of Donggwang Apartment No. 105-dong, Taein-dong, Gwangyang-si, Jeollanam-do'
addr4 = 'No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do'
addr5 = "San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do, 2nd floor, Jaesang-dong, Cheonan-si, Chungcheongnam-do"
addr6 = 'San 48-1, Guam-ri, Sacheon-eup, Sacheon-si'
addr7 = "Jindeok-ri 35, Hongnong-eup, Yeonggwang-gun, Jeollanam-do, 1st floor 27"


def extract_addr(addr):
    tokens = addr.translate({ord(','):None}).Split() # , remove and detach token
    for i, token in enumerate(tokens):
        If token[-1] in [Dong, Ri, Ga]: # If the last letter of the token is Dong, Ri, or Ga, the next token is the number.
            return ' '.join(tokens[0:i+2])

list(map(extract_addr, (addr, addr1, addr2, addr3, addr4, addr5, addr6, addr7)))

["Namyang-ri 1470-1, Okgye-myeon, Gangneung-si, Gangwon-do",
 529-1 Gwaebeop-dong, Sasang-gu, Busan,
 1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do,
 "1320-41 Taein-dong, Gwangyang-si, Jeollanam-do",
 549-14 Junghwa-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do,
 "San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do",
 "San 48-1, Guam-ri, Sacheon-eup, Sacheon-si",
 "35 Jindeok-ri, Hongnong-eup, Yeonggwang-gun, Jeollanam-do"]



		
		
			

				

					
				

				
					2022-09-21 12:45
				
			
		
	


	
		
It seems difficult to handle all addresses with only regular expressions.
It seems better to make and use a dedicated parser simply.
Since the administrative classification forms are different, such as cities and counties, it cannot be extracted by the number of tokens, and it cannot be judged by the presence or absence of numbers because of the case of Junghwasan-dong 2-ga.
In other words, there are four tokens for Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do, but there are three tokens for Taein-dong, Gwangyang-si, Jeollanam-do. There are five "Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do."
So let's think about the rule that there's no number until the number comes out, and you get the case of Junghwasan-dong 2-ga.
There might not even be a '-' in the street number.
105 Jungnim 1-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do, and what can be an ambiguous situation is to be handled as a regular expression.
When you think about the regular expression, the number after Dong and Ri is the number, so if the last person is Dong or Ri after separating into tokens, the next token is the number, so it seems more efficient to program simply.
I've written a regular expression for code, uh...Maybe there are cases where it doesn't match when various types of addresses come in.
import re

addr = 'Namyang-ri 1470-1, 1497 Okgye-myeon, Gangneung-si, Gangwon-do'
addr1 = "Renecite 3rd floor No. 3070 in 529-1 Gwaebeop-dong, Sasang-gu, Busan"
addr2 = 1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
addr3 = 'No. 307, 3rd floor of Donggwang Apartment No. 105-dong, Taein-dong, Gwangyang-si, Jeollanam-do'
addr4 = 'No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do'
addr5 = "San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do, 2nd floor, Jaesang-dong, Cheonan-si, Chungcheongnam-do"
addr6 = 'San 48-1, Guam-ri, Sacheon-eup, Sacheon-si'

m = re.compile(r'^([\w|\s]+[\s|acid]\d{1,4}[-]?\d{1,4}?)[,]?')
list(map(m.findall, (addr, addr1, addr2, addr3, addr4, addr5, addr6)))

["Namyang-ri 1470-1", Okgye-myeon, Gangneung-si, Gangwon-do,
 ["529-1 Gwaebeop-dong, Sasang-gu, Busan",
 [1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do]
 ["1320-4 Taein-dong, Gwangyang-si, Jeollanam-do",
 ["549-1 Junghwa-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do",
 ["San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do",
 ["San 48-1 Guam-ri, Sacheon-eup, Sacheon-si"]


		
		
			

				

					
				

				
					2022-09-21 12:45
				
			
		
	

			
			If you have any answers or tips



		

	
		Popular Tags
	
	python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656
	


	
		Popular Questions
	
	
	783 GDB gets version error when attempting to debug with the Presense SDK (IDE)

	640 Scrap text information after the "View More" button when searching in the Yahoo! News search window

	740 I'm a beginner at Flask. The requested URL was not found on the server. If you entered the URL manually please check your spelling and try again.

	815 M2 Mac fails to install rbenv install 3.1.3 due to errors

	650 ML-Agent tutorial says "Heuristic method called but not implemented.Returning placeholder actions." and fails to proceed