I want to understand regular expressions that check HTML start and end tags.

I'd like to validate the start and end tags of HTML tags in Java, but I'd like to understand the regular expression.

(\\s*(?:/\\s*)?[tag name](?:\\s*|\\s+[^>]+))(?:>|(?=<)|$(?!\\n)))

I found a reference point where I am checking the tag here, but what does it mean by regular expressions?
I analyze it myself, but it's hard to understand.If you have any advice, please let me know.

Additional information

I'm trying to analyze it in my own way.I would appreciate your advice.
①
(<(\s*(?:/\s*)?[tag name](?:\s*|\s+[^>]+))
(?:>|(?=<)|$(?!\n)))
Are these bold brackets "grouped"?We will give priority to this.Is the tag name (a) of the previous tag ( ) separated before and after?

<
Is \\s* "zero or more spaces"?

<
(?:/\\s*)
I don't really understand this part.
"Is there more than one "":"" tag?"I wonder if there are any cases where there is a "":"" tag..."
And do you need /?"Also, is ""zero or more blanks""?"

<
What is the ? after the parentheses?This is ? before [tag name]

<
(?:\s*|\s+[^>]+)
Are these bold brackets grouped?

<
?:
Is it a colon?I don't really understand the colon.

<
Is \s* a blank space of 0 characters or more?

<
Characters before or after the |?:\\s* or \\s+[^>]+

<
Is \s+ at least one blank?

<
[^>] Is + at least one character other than >?

<
(?:>|(?=<)|$(?!\n))
Are the bold brackets grouped?

<
?—Is it one colon?

<
Does > mean > is it necessary?

<
?:>|(?=<)|$(?!\n)
Is this pipe "?:>", "(?=<)" or "$(?!\n)"

I would appreciate it if you could give me guidance after that.

java regular-expression

2022-09-30 21:50

2 Answers

Hello, I think you should copy the regular expression on this site and read the analysis results

https://regexr.com/

If you can read English, you will understand it as it is.

After a match to <, the first open bracket allows spaces, the second open bracket allows spaces after matching to recognize both closed open tags.The tag name is followed by the tag name, and the third open bracket is followed by a blank or non-> character.After the fourth open bracket, close > while making sure that it does not precede a new line <

I think it's like that.

2022-09-30 21:50

First of all, the regular expression you wrote seems to be embedded in the Java string literal.You already understand that the string literal "\\" in Java represents one character of \, but \ is also meaningful in the Markdown notation used here in the stack overflow, so it appears as \s or mixed with \s.

In the following description,

Make sure to add " on both sides to indicate that it is a notation in the string literal
(Fortunately, " does not appear in this regular expression, so I think it will be difficult to get confused.)
The characters themselves are not in code style or bold
Metacharacters (symbols that represent non-characters themselves) are bold in their code style

If you reshape the regular expression in your question with the above rules, it looks like this.

"((\\s*(?:\\s*)?[tag name](?:\\s*|\\s+[^+))(?:>|(?=)|$(?!\\n))"

(There will be gaps in strange places, but there will be no intentional spaces.)

It would be nice if the browser you are using showed it in an easy-to-read way, but there are only three types of characters that represent </> characters themselves

All other characters are metacharacters, meaning something.All meta-characters that can be used in Java's regular expressions are listed in Java's official document.(However, it's not your fault to think "I don't know!" because you can't say it's easy to understand anywhere.Please read it in conjunction with other explanatory articles.Also, there may be some changes depending on the Java version.)

Pattern (Java platform SE8)

Among them, we will list the regular expressions in your question by applying them to the previous rule and shaping them.

"XX, forward-referencing regular expression group

"\\s" blank characters:[\t\n\x0B\f\r]

"X*"X, 0 or more

"(?:X)"X, regular expression group without forward reference

"X?"X, 1 or 0 times

"X|YX or Y

"X+X, at least once

"[^abc]""" Non-a, b, c characters (denial)

"(?=X)"X, affirmative first reading of zero width

"$" end of line

"(?!X)"X, negative read to zero width

"\\n" Newline Characters ("\u000A")

Many commentary articles use the word "capture" for "forward reference."

As for "zero width" and "forward reading," it means that it checks if it matches, but the check position cannot be advanced."If you search for ""positive regular expressions"" or something like that, you will find explanatory articles that are easy to read."

As for ~ to に in the postscript, if you modify (?:) by dividing it into parentheses and ?:, you will not be able to mention it individually.Please read this answer (or link) and let us know if you have any questions.

2022-09-30 21:50

If you have any answers or tips

Popular Tags

python x 4647
android x 1593
java x 1494
javascript x 1427
c x 927
c++ x 878
ruby-on-rails x 696
php x 692
python3 x 685
html x 656

Popular Questions

915 When building Fast API+Uvicorn environment with PyInstaller, console=False results in an error
578 Understanding How to Configure Google API Key
572 rails db:create error: Could not find mysql2-0.5.4 in any of the sources
881 /usr/bin/google-chrome:symbol lookup error:/usr/bin/google-chrome: undefined symbol:gbm_bo_get_modifier
574 Who developed the "avformat-59.dll" that comes with FFmpeg?