I'm going to save the first address and delete the last part of it. But even if I try Python regular expressions, it doesn't work well. <
3rd floor, 3070, 105th building, Junghwasan-dong 2nd street.. I want to exclude this from the regular expression. Please tell me how.
Regular expression used: (\d{1,4}\-\d{1,4})|(\d{1,4})|(acid\d{1,4}\-\d{1,4})|(acid\d{1,4})
Target Data:
1470-1, 1497, Namyang-ri, Okgye-myeon, Gangneung-si, Gangwon-do
No. 3070 on the 3rd floor of Renecite 529-1 Gwaebeop-dong, Sasang-gu, Busan
1405, Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
No. 307, 3rd floor, 105-dong, Donggwang Apartment, 1320-41, Taein-dong, Gwangyang-si, Jeollanam-do
No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do
No. 203, 2nd Floor, Jaesang-dong, 1st Apartment, Cheonan Mokcheon-dong, 422 Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do
It is faster and easier to correct as below.
addr = '1470-1, 1497 Namyang-ri, Okgye-myeon, Gangneung-si, Gangwon-do'
addr1 = "Renecite 3rd floor No. 3070 in 529-1 Gwaebeop-dong, Sasang-gu, Busan"
addr2 = 1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
addr3 = 'No. 307, 3rd floor of Donggwang Apartment No. 105-dong, Taein-dong, Gwangyang-si, Jeollanam-do'
addr4 = 'No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do'
addr5 = "San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do, 2nd floor, Jaesang-dong, Cheonan-si, Chungcheongnam-do"
addr6 = 'San 48-1, Guam-ri, Sacheon-eup, Sacheon-si'
addr7 = "Jindeok-ri 35, Hongnong-eup, Yeonggwang-gun, Jeollanam-do, 1st floor 27"
def extract_addr(addr):
tokens = addr.translate({ord(','):None}).Split() # , remove and detach token
for i, token in enumerate(tokens):
If token[-1] in [Dong, Ri, Ga]: # If the last letter of the token is Dong, Ri, or Ga, the next token is the number.
return ' '.join(tokens[0:i+2])
list(map(extract_addr, (addr, addr1, addr2, addr3, addr4, addr5, addr6, addr7)))
["Namyang-ri 1470-1, Okgye-myeon, Gangneung-si, Gangwon-do",
529-1 Gwaebeop-dong, Sasang-gu, Busan,
1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do,
"1320-41 Taein-dong, Gwangyang-si, Jeollanam-do",
549-14 Junghwa-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do,
"San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do",
"San 48-1, Guam-ri, Sacheon-eup, Sacheon-si",
"35 Jindeok-ri, Hongnong-eup, Yeonggwang-gun, Jeollanam-do"]
It seems difficult to handle all addresses with only regular expressions.
It seems better to make and use a dedicated parser simply.
Since the administrative classification forms are different, such as cities and counties, it cannot be extracted by the number of tokens, and it cannot be judged by the presence or absence of numbers because of the case of Junghwasan-dong 2-ga.
In other words, there are four tokens for Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do, but there are three tokens for Taein-dong, Gwangyang-si, Jeollanam-do. There are five "Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do."
So let's think about the rule that there's no number until the number comes out, and you get the case of Junghwasan-dong 2-ga.
There might not even be a '-' in the street number.
105 Jungnim 1-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do, and what can be an ambiguous situation is to be handled as a regular expression.
When you think about the regular expression, the number after Dong and Ri is the number, so if the last person is Dong or Ri after separating into tokens, the next token is the number, so it seems more efficient to program simply.
I've written a regular expression for code, uh...Maybe there are cases where it doesn't match when various types of addresses come in.
import re
addr = 'Namyang-ri 1470-1, 1497 Okgye-myeon, Gangneung-si, Gangwon-do'
addr1 = "Renecite 3rd floor No. 3070 in 529-1 Gwaebeop-dong, Sasang-gu, Busan"
addr2 = 1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do
addr3 = 'No. 307, 3rd floor of Donggwang Apartment No. 105-dong, Taein-dong, Gwangyang-si, Jeollanam-do'
addr4 = 'No. 101 of the lower floor of the paper, 549-14, Junghwasan-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do'
addr5 = "San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do, 2nd floor, Jaesang-dong, Cheonan-si, Chungcheongnam-do"
addr6 = 'San 48-1, Guam-ri, Sacheon-eup, Sacheon-si'
m = re.compile(r'^([\w|\s]+[\s|acid]\d{1,4}[-]?\d{1,4}?)[,]?')
list(map(m.findall, (addr, addr1, addr2, addr3, addr4, addr5, addr6)))
["Namyang-ri 1470-1", Okgye-myeon, Gangneung-si, Gangwon-do,
["529-1 Gwaebeop-dong, Sasang-gu, Busan",
[1405 Jukrim-ri, Gwangdo-myeon, Tongyeong-si, Gyeongsangnam-do]
["1320-4 Taein-dong, Gwangyang-si, Jeollanam-do",
["549-1 Junghwa-dong 2-ga, Wansan-gu, Jeonju-si, Jeollabuk-do",
["San422, Singye-ri, Mokcheon-eup, Dongnam-gu, Cheonan-si, Chungcheongnam-do",
["San 48-1 Guam-ri, Sacheon-eup, Sacheon-si"]
© 2024 OneMinuteCode. All rights reserved.