In C or C++, I want to create a program that reads txt or csv files line by line and removes duplicate items.

I think it's a simple programming, but it won't work, so I'd like to ask you a question.
In C language or C++, I would like to create a program that reads the contents of txt or csv files line by line, and if there are multiple overlapping lines, delete the others and make them only one line.
Specifically, the txt,csv files are as follows.

About In About
Ada NP Ada
Additional JJ additional
Adventures NNS adventure
Adventures NNS adventure
Adventures NP<unknown>

In the example above, I would like to delete a line called Adventures NNS adventure.
They are already sorted and arranged alphabetically, so if there are overlapping lines, they are always next to each other.
There are nearly 30,000 lines.

while(fgets(buf,20,fp)!=NULL){
    strcpy(word[i], buf);
    printf("%s\n", word[i]);
    i++;
}

Like this, I tried several types of file loading, but it didn't work because it was loaded across two lines and a new line was broken in strange places.
Once it's loaded, I think it's possible to do so by saying, "Only output when it's different compared to the next line."
Thank you for your advice.

c++ c

2022-09-29 21:53

2 Answers

I would like to ask you for advice, so I'm just pointing out the problem, but I wonder if it's like this.

If it is less than one line, read it again with fgets() and concatenate it to the end of the buffer (word[i] for the code in question).
At this time, it is necessary to determine if word[i] has enough space.
If it is not enough, use realloc to reassign it.

If there is a match, do nothing and move on to the next loop.
If it does not match, print the read string and duplicate it to a variable pointing to the previous value.
At this time, it is necessary to determine if the previous value has enough space.

If the memory space is sufficient or insufficient, I think the maximum possible buffer size can be fixed length.Instead, you should be careful if it serves your purpose as it becomes a program limitation.

If you are concerned about slow string copying in the substitution of the variable pointing to the previous value, you may want to prepare two buffers and swap (Swap) pointers pointing to each one.

Also, reading the UNIX uniq command source will help you learn a lot.
GNU coreutils

If you want practicality, you can use the uniq command or PowerShell's Get-Unique for Windows.

2022-09-29 21:53

In the first place, it is inappropriate to use fgets(), which is to read characters.
"There are several functions for ""read line by line"", so let's use them."

2022-09-29 21:53

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656