To Split a String with Double Quotation into Spaces

We are looking for a way to split each keyword if you enter one or more keywords for a full-text search, such as Google Search.

I'd like to divide the following strings separated by half-width and full-width spaces.
I don't know the regular expression.

Split conditions

Example Original String

Kita A+!^"*-="P"Rain""Snow" only　　Don't break "abc Time 123 - "Gate Limit"

For Java string literal expressions:

String motor="North A+!^\"*-=\"P\"Rain\"\"\"Snow\"\"Only　　US>"abc time 123-\"gate limit\"\"\"Don't break";

Each split string

North
A+!^"*-="
P "rain", "snow"
"only　　US>"
abc
Time 123 - "Exit"
"Don't break ←I intentionally removed the last double quotation

Can you use the regular expression below?It may be, but if you write this code in JAVA,
For pre-execution issues, the descriptive string is red with an incorrect error in eclips.
(I probably need to add \ to escape, but no matter how many \ are added before / or ", the string is red with an incorrect error.)

Pattern p=Pattern.compile("\s+(?"|[^"])*"?|[^"]\S*"));
String [ ] result = p.split(moto);
for(inti=0;i<result.length;i++){
    System.out.println("["+result[i]+"]");
}

Question 1

Is the regular expression correct?

Question 2

To write in Java, add \, and so on, how can Eclipse stop giving the error that the string is incorrect?

Here's an example of a number of patterns we've done.

Pattern p=Pattern.compile("\\s+(\"(?\"|[^\"])*\"?|[^\"]\S*");

When I put \ as below, there are no more string errors in eclips in red letters.

String[]result=test.split("\\s+(\"(?\"|[^\"])*\"?|[^\"]\\S*"));
for(inti=0;i<result.length;i++){
    out.print("["+result[i]+"]");
}

However, the regular expression seems to be bad, so we couldn't divide it as below.
test="rain" only";
　↓
after splitting Rain "Mizu"

java regular-expression

2022-09-29 22:35

3 Answers

When you're almost seeing a solution, there's another solution, but if it's a specification (correct answer is [A+!^"*-=P"]), you can do it with regular expressions.

String motor="Northern A+!^\"*-=P\"Rain\"\"\"Snow\"\"Misutani\"abc Time 123-\"Gate`Limit\"\"\"\"Don't Break";
    Pattern p=Pattern.compile("(?:[^\\s\"\\\\\\]|\\\.|\"(?:\\\.|[^\\\\\\\\])*(?:\"|$))+);
    Matcher m=p.matcher (moto);
    while(m.find(){
        System.out.println("["+m.group(1)+"]");
    }

Output:

 [Here we go]
[A+!^"*-=P"]
[Rain]
[""]
[Snow]
["Only"]
[Still]
[Still]
["Abc Time 123 - "Gate Limit"] [Don't break it]

The result looks very different from the expected result in the questionnaire, but considering that the first A+!^"*-= is A+!^"*-=P" and the range between " alternates, this should be the result.

My idea is

Characters other than spaces and ",\
\ + one character

"

Think of as a lump and extract the continuity of the lump with the longest match (the default regular expression).

If you have time, please try it.

(As for the code I wrote a little while ago, if I was told that the movement was different, I would be worried about where to fix it.After all, it might be better to prioritize the ease of understanding.)

2022-09-29 22:35

It seems difficult with regular expressions.
I don't know if it matches my intention, but is it like this?

import java.util.Iterator;

public class Tokenizer implements Interator <String > {

    private static final char DELIMITER_SPACE=';
    private static final char DELIMITER_SPACE_JP=';
    private static final char DELIMITER_DOUBLE_QUOTE='";

    private String nextToken;
    private String target;
    private Character delimiter = null;

    private int pos = 0;
    private int start = 0;

    public Tokenizer (String target) {
        this.target=target;
        This.nextToken=getNextToken();
    }

    @ Override
    public boolean hasNext(){
        return nextToken!=null;
    }

    @ Override
    public String next() {
        String next=this.nextToken;
        This.nextToken=getNextToken();
        return next;
    }

    @ Override
    public void remove(){
        through new UnsupportedOperationException();
    }

    private String getNextToken() {
        int size = target.length();
        while(pos<size){
            charc = target.charAt(pos);
            pos++;
            int length = pos-start;
            if(isDelimitor(c,delimiter)){
                if(length>1){
                    boolean isDoubleQuote=isDoubleQuote(delimiter);
                    String token=getToken(isDoubleQuote, start, length, false);
                    if(isDoubleQuote){
                        delimiter = null;
                    }
                    start = pos;
                    return token;

                } else{
                    delimiter=c;
                    start = pos;
                }

            } else if(c==DELIMITER_DOUBLE_QUOTE&&length<=1){
                delimiter=c;
                start = pos;

            } else if ((c==DELIMITER_SPACE||c==DELIMITER_SPACE_JP)&&length<=1){
                if(delimiter==null||delimiter!=DELIMITER_DOUBLE_QUOTE){
                    delimiter=c;
                    start = pos;
                }
            }
        }

        int length = pos-start;
        if(length>0){
            String token=getToken(isDoubleQuote(delimiter), start, length, true);
            start = pos;
            return token;
        }
        return null;
    }

    private boolean is DoubleQuote (Character delimiter) {
        return delimiter!=null&delimiter==DELIMITER_DOUBLE_QUOTE;
    }

    private boolean isDelimiter(charc, Character delimiter) {
        if(delimiter==null){
            return c==DELIMITER_SPACE||c==DELIMITER_SPACE_JP||c==DELIMITER_DOUBLE_QUOTE;
        } else{
            return c==delimiter;
        }
    }

    private String getToken(boolean isDoubleQuote, int start, int length, boolean isLast) {
        if(isDoubleQuote){
            return target.substring (start-1, start+length);
        } else{
            return target.substring(start, start+length-(isLast?0:1));
        }
    }
}

Please use it like this.

String motor="North A+!^\"*-=P\"Rain\"\"\"Snow\"\"Only　　US>"abc time 123-\"gate limit\"\"\"Don't break";
Tokenizer tokenizer = new Tokenizer (moto);
while(tokenizer.hasNext()){
    System.out.println("["+tokenizer.next()+"]");
}

Results

[Here we go]
[A+!^"*-=]
[P "Rain" and "Snow"]
["Only"]
[abc]
[Time 123 - "Limited Gate"]
["Don't break it]

UPDATE1

The intent of the program is to:

Separate with half-width, full-width, and double-quotation.
Scan and separate strings from left to right
If you separate it with one character, do not separate it until the next character appears.
Example) After separating it with a half-width space, make it into a single string until the next half-width space appears.
However, scan to the end of the string and make it a single string even if the same separator does not appear.
Double-courtation contains the result string

UPDATE2

String motor="Northern A+!^\"*-=P\"Rain\"\"\"Snow\"\"Misutani\"abc Time 123-\"Gate`Limit\"\"\"\"Don't Break";

Running a program on the produces the following results:

[Here we go]
[A+!^"*-=]
[P"]
[Rain]
["Snow"]
["Only"]
[abc]
[Time 123 - "Limited Gate"]
["Don't break it]

2022-09-29 22:35

Regular Expressions

Le Pered'OO's response minus \.

import java.util.regex.*;

public class Main {
    public static void main(String[]args) {
        String moto="Northern A+!^\"*-=\"P\"Rain\"\"\"Snow\"\"Only　　US>"abc time 123-\"gate limit\"\"\"Don't break";

        Pattern p=Pattern.compile("((?:[^\\s\"]|\"[^\"]*(?\"|$))+);
        Matcher m=p.matcher (moto);

        while(m.find()){
            System.out.println("["+m.group(1)+"]");
        }
    }
}

(?:[^\\s\"]|\"[^\"]*(?:\"|$))+) may be a little difficult to read, but you can add indentation or

 Pattern p = Pattern.compile(
            "(" +
                "(?:" +
                    US>"[^\\s\"]|"+
                    "\"[^\"]*(?:\"|$)" +
                ")+" +
            ")"
            );

Build part by part or

// Regular characters: blank, full-width, non-double-quote characters
    String normal_char = "[^\\s\"]";

    // Quoted string—A string of at least 0 characters, either double-quote or at the end of a line.
    String quoted_str="\"[^\"]*(?:\"|$)";

    // tokens:regular characters or quoted strings, one or more
    String token="(?:"+normal_char+"|"+quoted_str+")+";

    // remember the part that matches the token as a group
    Pattern p = Pattern.compile("+token+")";

There is a solution.

Similar cases

There is a lot of demand to split the string considering the quote, so if you search for it, you'll see a lot of things.

https://stackoverflow.com/a/7804472/4368502
https://stackoverflow.com/a/3366634/4368502

It is a little different from this time in that it is divided even in the quote position, but please refer to it.
These seem to be closer to how Google searches work.
The specification of the question is closer to the quote of the shell (sh, bash, etc.) than the input field of the Google search.

No Regular Expression

I thought it might be simpler than regular expressions, so I wrote it down, but there was no division like that, so it became longer.
I think it will be longer if I write it in a way that is easy to understand and expandable.
I don't think the long one is bad, but I think it's okay to use regular expressions at this level.

public class Main {
    public static void main(String[]args) {
        String moto="Northern A+!^\"*-=\"P\"Rain\"\"\"Snow\"\"Only　　US>"abc time 123-\"gate limit\"\"\"Don't break";

        State state = State.DELIMITER;
        for (inti=0, start=0, length=moto.length(); i<length;i++){
            // US>Separate characters:
            //  spaces, tabs, full-width spaces, line breaks, paper feed,
            //  carriage return, vertical tab
            final String delimiters="\t\n\f\r\u000b";
            final String c=moto.substring(i,i+1);

            if(delimiters.contains(c)){
                if(state==State.UNQUOTED) {
                    out(moto.substring(start,i));
                    state = State.DELIMITER;
                }
            }
            else{
                if(state==State.DELIMITER) {
                    start = i;
                    state = State.UNQUOTED;
                }

                if(c.equals("\""){
                    state=(state==State.QUOTED)?
                        State.UNQUOTED: State.QUOTED;
                }
            }

            if(i==length-1&state!=State.DELIMITER){
                out(moto.substring(start, i+1));
            }
        }
    }


    private enum State {DELIMITER, QUOTED, UNQUOTED}

    private static void out (Strings) {
        if(s.length()==0){return;}
        System.out.println("["+s+"]");
    }
}

2022-09-29 22:35

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656