View previous topic :: View next topic |
Author |
Message |
Ranjish Beginner

Joined: 22 Dec 2002 Posts: 64 Topics: 28 Location: Chennai
|
Posted: Wed May 07, 2003 11:47 am Post subject: Removing the duplicates |
|
|
Hi,
I know that we can reove the duplicates by using
SUM FIELDS=NONE
But can we remove the duplicates without doing a sort ?
When I did a
SORT FIELDS=COPY
SUM FIELDS=NONE, it was not working.
Is there any way of doing this ?
cheers
Ranjish |
|
Back to top |
|
 |
coolman Intermediate
Joined: 03 Jan 2003 Posts: 283 Topics: 27 Location: US
|
Posted: Wed May 07, 2003 12:36 pm Post subject: |
|
|
Ranjish,
SORT FIELDS=COPY -> implies you are just copying the input onto the output. You can't have it with SUM FIELDS=NONE. The explanation is given below :
Duplicate removal :
-----------------------
1 -> How do you identify whether a record is a duplicate or not ?
2 -> Obviously, you would need a key field to determine that. Only when 2
two records in the file have the same key, you would say they are
duplicate records.
3 -> So, essentially, what you need to do is :
Change your Sort card like this :
Code: |
sort fields=(1,10,ch,a) * -> Assuming the first 10 bytes is the key
sum fields=none
|
and run the job.
Hope this helps...
Cheers,
Coolman.
P.S : Why would you want to remove the duplicates, other than using SORT. BTW, there are lots of ways of doing it, but SORT is the most efficient and neat way of doing it.[/code]
________
Mazdaspeed6 specifications
Last edited by coolman on Sat Feb 05, 2011 1:20 am; edited 1 time in total |
|
Back to top |
|
 |
Frank Yaeger Sort Forum Moderator

Joined: 02 Dec 2002 Posts: 1618 Topics: 31 Location: San Jose
|
Posted: Wed May 07, 2003 5:16 pm Post subject: |
|
|
Ranjish,
Theoretically, you could simulate SUM FIELDS=NONE while copying by storing all of the records in storage as you read them, determining which ones are the duplicates (that is, which ones have the same key) after all of the records are read, and then only writing out the non-duplicates and first record of each group of duplicates from storage. You could probably even do this with DFSORT using an E15 or E35 exit.
But unless you only have a few records to deal with, this isn't practical. It makes more sense to SORT the records so the duplicates are in order. That way, you can remove the unwanted duplicates as they're read without having to store them all. That's what SORT with SUM FIELDS=NONE does. _________________ Frank Yaeger - DFSORT Development Team (IBM)
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
DFSORT is on the Web at:
www.ibm.com/storage/dfsort |
|
Back to top |
|
 |
Ranjish Beginner

Joined: 22 Dec 2002 Posts: 64 Topics: 28 Location: Chennai
|
Posted: Thu May 08, 2003 12:55 am Post subject: |
|
|
Frank/Coolman,
Thanks a lot for the replies.
So does this mean that we dont have any option to preserve the original sequence and still remove the duplicates?
cheers
Ranjish |
|
Back to top |
|
 |
CaptBill Beginner
Joined: 02 Dec 2002 Posts: 100 Topics: 2 Location: Pasadena, California, USA
|
Posted: Thu May 08, 2003 4:19 pm Post subject: |
|
|
Sure you can preserve the original sequence and still get rid of duplicates. How did you get the file in that sequence to begin with? Figure that out then just redo that sort or other process AFTER you remove the duplicates.
What this means I suppose you are saying is you have a file in sequence by FIELD-1. You want to remove all the duplicates based upon FIELD-2. So sort it by FIELD-2 with SUM FIELDS=NONE then take the output of that and resort it by FIELD-1. |
|
Back to top |
|
 |
Premkumar Moderator

Joined: 28 Nov 2002 Posts: 77 Topics: 7 Location: Chennai, India
|
Posted: Thu May 08, 2003 11:03 pm Post subject: |
|
|
You can preserve the original sequence by following these steps.
- Add sequence number to the records before removing duplicates,
- Remove the duplicates by sorting on key w/ SUMFIELDS=NONE,
- Sort by the sequence number to resore the original sequence and
- Remove the sequence number.
|
|
Back to top |
|
 |
|
|