Remove commas in CSV data

A, "100", Z
B, "1,000", Z
C, "1,000,000", Z

↓

A, "100", Z
B, "1000", Z
C, "1000000", Z

I would like to change it to , but what do you do with sed, awk, linux, etc.?

linux sed awk

2022-09-30 21:26

5 Answers

If you can use Gawk 4.0 or later, it seems that you can handle it with gawk's FPAT. However, it is not supported in cases where line breaks or double quotes are included.

$gawk-v OFS=, -vFPAT='([^,])*)|("[^"]+")'{for(i=1;i<=NF;i++){if($i~/^[\t]*"/){gsub(", "", "", ", $i)}}};print}'

Defining Fields by Content and metropolis' comments to support 0 character fields, so ([^,]+) ([^,] where is located.*) I chose .

If you can only use it before Gawk-4, I think C is easier to write than awk.
This also supports line breaks and double quotes.

#include<stdio.h>
static int parse_element_quoted()
{
 intc;
 while((c=getchar())!=EOF){
  switch(c){
   case '' '':
    c=getchar();
    if(c=='"){
     // Two consecutive ""s are when there is one "" in the original data.
     // putchar(''''); this line is not required when converting from '''' to ''''
     putchar('"');
     break;
    }
    else{
     Putchar('''); // No need to take ''' (A)
     putchar(c);
     return 0;
    }
   case '\r':
   case '\n':
   // case '\\':
   case', ':
    // ignore(remove)them
    break;
   default:
    putchar(c);
  }
 }
 return1;
}
static int parse_element()
{
 intc=getchar();
 if(c==EOF)
  return1;
 else if(c=='"'){ 
  Putchar('''); // No need to take ''' (B)
  return parse_element_quoted();
 }
 else{
  putchar(c);
  if(c=='\n'||c==',')
   return 0;
  while((c=getchar())!=EOF){
   switch(c){
    case', ':
    case '\n':
     putchar(c);   
     return 0;
    case '\r':
     break;
    default:
     putchar(c);   
   }
  }
  return1;
 }
}
int main()
{
 while(parse_element()==0);
 return 0;
}

Personally, I think it would be easier to delete "," and "" together.
If so, comment out the lines (A) and (B) of the code above.

2022-09-30 21:26

If it's sed, it's like this.However, it is assumed that the original string does not contain tab characters.

sed'-es/"\"([^"]*\)"/\tA\1\tB/g;:loop;s/\(\tA[^\t]*\), /\1/g;loop;s/\t./"/g'

Use tab characters for the barrier and replace quotation marks with \tA,\tB.

Remove commas in strings starting with \tA one character at a time in the loop. When you run out of \tA,\tB to remove, you'll have to put it back in the original quotation mark.

2022-09-30 21:26

With one liner, the line feed part is semicolon.

awk'BEGIN {FS="\";OFS="\"\"};{gsub(",", "", "", $2);print}'data.csv

2022-09-30 21:26

More simply, how about this writing method?

cat data.csv | awk' match ($0, / ".*?"/) {tmp0=substr($0, RSTART,RLENGTH); tmp1=tmp0; tmp2=gsub(/,/, ", ", tmp0); sub(tmp1, tmp0, $0); print$0}

I think the processing will be strange if data surrounded by more than one "in one line comes, but I can't think of repeating awk right away, so I'll write it in php below.

$csv_data='
    A, "100", Z
    B, "1,000", "1,000", Z, "1,000"
    C, "1,000,000", Z
    ';
$array0 = array();
$end_flag=false;

do{
    $end_flag=false;
    $array=explode('',$csv_data,2);
    if(count($array)>1){
        $array0[] = $array[0].';
        $array2=explode('',$array[1],2);
        if(count($array2)>1){
            $array0[]=str_replace(',',',',',$array2[0]).';
            $csv_data = $array2[1];
            $end_flag = true;
        } else {
            $array0[] = $array2[0];
        }
    } else {
        $array0[] = $array[0];
    }
} while($end_flag);

echo implode(',$array0);

2022-09-30 21:26

I tried double-coating the field separator in awk.
Executable jikko.bat

awk-f src.txt<data.csv

Source File src.txt

BEGIN {FS="\";OFS="\"}
{gsub(", "", "", "", $2); print}

Please tell me how to use one liner.

2022-09-30 21:26

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656