I want to delete the beginning and end of the URL string in the regular expression.

Asked 2 years ago, Updated 2 years ago, 36 views

in python3
https://hoge1.hoge2

From a string like the one shown in

Remove https:// and hoge2
Could you tell me how to take out only hoge1

?

python python3

2022-09-30 19:10

2 Answers

"Retrieve subdomains from URLs" is interpreted as a goal to the question.

If you want to get some information from the URL, use urllib.parse first (I'll leave the solution using the regular expression to someone else).

>>from urlib.parse import urlparse
>>>o=urlparse('https://hoge1.hoge2')
>>print(o.hostname.split('.')[0])
'hoge'

By the way, if you want to get domains and subdomains more accurately, you can also use a package called tldextract.
https://pypi.python.org/pypi/tldextract


2022-09-30 19:10

If you want to use regular expressions

(?<=https:\/\/)\w+(?=\.\w+)

This pattern matches hoge1.
(?<=) is Lookbehind, read later.Match only if it follows this parenthesis pattern.
(?=) is Lookahead, look ahead.Match only if it precedes the pattern in parentheses.
So https:\/\/ (slash escapes) to Lookbehind and
Specify \.\w+ (where \w is an alphabet, in this case matches .hoge2) as Lookahead and
Remove the \w (alphabetical) between the two.


2022-09-30 19:10

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.