Is there a Bash function that allow me to separate/delete/isolate line from a file when they have the same first word

by odran   Last Updated July 12, 2019 08:26 AM

I have a text file like this:

id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg
id ; lorem ipsum  fgdg df gdg

And if 2 id are similar, I want to separate to line where 2 id are similar and the line that are unique.

uniquefile contains the lines with unique id. notuniquefile contains the lines that don't have one.

I already found a way to almost do it but only with the first word. Basically it is just isolating the id and deleting the rest of line.

Command 1: isolating unique id (but missing the line):

awk -F ";" '{!seen[$1]++};END{for(i in seen) if(seen[i]==1)print i }' originfile >> uniquefile

Command 2: isolating the not unique id (but missing the line and losing the "lorem ipsum" content that can be different depending on the line):

awk -F ":" '{!seen[$1]++;!ligne$0};END{for(i in seen) if(seen[i]>1)print i  }' originfile >> notuniquefile

So in a perfect world I would like you to help me obtain this type of result:

originfile:

1 ; toto
2 ; toto
3 ; toto
3 ; titi
4 ; titi

uniquefile:

1 ; toto
2 ; toto
4 ; titi

notuniquefile:

3 ; toto
3 ; titi

Have a good day.



Answers 4


Here is a small Python script which does this:

#!/usr/bin/env python3

import sys

unique_markers = []
unique_lines = []
nonunique_markers = set()
for line in sys.stdin:
  marker = line.split(';')[0]
  if marker in nonunique_markers:
    # found a line which is not unique
    print(line, end='', file=sys.stderr)
  elif marker in unique_markers:
    # found a double
    index = unique_markers.index(marker)
    print(unique_lines[index], end='', file=sys.stderr)
    print(line, end='', file=sys.stderr)
    del unique_markers[index]
    del unique_lines[index]
    nonunique_markers.add(marker)
  else:
    # marker not known yet
    unique_markers.append(marker)
    unique_lines.append(line)
for line in unique_lines:
  print(line, end='', file=sys.stdout)

It is not a pure shell solution (which would be cumbersome and hard to maintain IMHO), but maybe it helps you.

Call it like this:

separate_uniq.py < original.txt > uniq.txt 2> nonuniq.txt

Alfe
Alfe
July 11, 2019 15:30 PM

untested: process the file twice: first to count the ids, second to decide where to print the record:

awk -F';' '
    NR == FNR      {count[$1]++; next}
    count[$1] == 1 {print > "uniquefile"}
    count[$1]  > 1 {print > "nonuniquefile"}
' file file
glenn jackman
glenn jackman
July 11, 2019 15:39 PM

Not sure if it's the most unique or elegant way to achieve this, depending on your file size, but this seemed to produce your desired results locally using the test data you provided:

for dupe in "$(awk '++x[$1] > 1 { print $1; }' originfile)"; do
  grep -v "^$dupe" originfile > uniquefile;
  grep "^$dupe" originfile > notuniquefile;
done
thmsdnnr
thmsdnnr
July 11, 2019 16:04 PM

This will copy the unique id and line into the specified file.

awk -F ';' 'BEGIN{k=0;l=0;}{if(!seen[$1]++){ m[k]=$0; k++}else{n[l]=$0;l++;} }END{k=0;j=l;l=0; for(i in seen) { if(seen[i]==1){printf "%s\n",m[k] >> "UNIQUE.txt";} k++;} for (i=0;i<j;i++)printf "%s\n",n[i] >> "NONUNIQUE.txt" }' originfile >> unique_file 
j23
j23
July 11, 2019 16:07 PM

Related Questions


Updated March 07, 2019 12:26 PM

Updated July 08, 2016 07:57 AM

Updated January 30, 2019 08:26 AM

Updated June 18, 2017 13:26 PM

Updated May 13, 2017 01:26 AM