Joined: 02 Dec 2002 Posts: 629 Topics: 176 Location: Stockholm, Sweden
Posted: Tue Jul 08, 2014 2:16 am Post subject: Producing a guaranteed sample of records from a large file
We are testing various programs where the files come from production. Obviously (?), in test, we don't want to test using hundreds of thousands of records, so we create a sample extract of those records using SORT and the SAMPLE keyword. This works fine, but .....
is there a simple way of ensuring that the output file contains samples of all the "key" values found in the input file. What I mean by this is the following. Assuming your key values in the file are in position 5 and contain FRED, BERT and JOHN. Now, let's assume that records 1-1000 all contain FRED, 1001-1002 contain BERT, and the rest contain JOHN. If I run SORT using a simple sample of every 300 records, I'll get examples of FRED & JOHN, but none of BERT.
What I would like to do would be to run sort and either specify the values in my keys (FRED, BERT, JOHN here) or, even better, let SORT figure them out, and get SORT to produce an output file of max 20 records for each unique key. Is this doable (and does my append make sense) _________________ Michael
Joined: 02 Dec 2002 Posts: 629 Topics: 176 Location: Stockholm, Sweden
Posted: Tue Jul 08, 2014 3:33 am Post subject: A follow-on question to the above
It all works fine but for the small detail that the input file has a header date record. Is there any way of using the HEADER option so that the first record remains in place (as it is now, it gets sorted/selected as the last record in the output file as positions 11,5 contains a date in the form 2014-).
I have tried the following, but it didn't seem to work
Sort the records so that the date record always ends up first, then select the first 10 unique records for all positions 11-15
(Anyone wondering why I include all these questions/answers. The next person who needs an answer similar to this might hopefully find it based on my original header) _________________ Michael
If you need to SORT and preserve positions of headers/trailers, ICETOOL has DATASORT.
If you don't need to SORT, did you try turning it off for the SELECT? You could relocate your ON field to an extension of the record in INREC, clobber it to some low value for the header, and remove it later in OUTFIL.
A sequence number from WHEN=GROUP might also do what you want.
If you SORT Descending to keep the header at the front, haven't you ruined your sampling, by no longer representing your file (data not in the same order)?
Joined: 26 Nov 2002 Posts: 12378 Topics: 75 Location: San Jose
Posted: Tue Jul 08, 2014 10:25 am Post subject:
misi01,
If your input is already SORTED on the Key (11,5,CH) a simple COPY operation with WHEN=GROUP will give you the desired results. _________________ Kolusu
www.linkedin.com/in/kolusu
Joined: 02 Dec 2002 Posts: 629 Topics: 176 Location: Stockholm, Sweden
Posted: Wed Jul 09, 2014 1:30 am Post subject: In answer to your suggestions
William. I tried/looked at the DATASORT and have used it elsewhere. Trouble was, I couldn't figure out how to "extract" groups of similar records (as ON(11,5,CH) FIRST(10) does with select) using DATASORT.
Kolusu. The data isn't sorted, that's why the header record ends up as the last record using the SELECT statement.
Both of you. WHEN=GROUP. I've googled it but didn't really understand how it would apply. I also tried variations on the following
Forgetting any header and trailer records, is the data a;ready in the sequence that you want before you extract it? Are data for the same "key" contiguous? Forget any header/trailer records for the moment. Is the data in the sequence you want already?
If you get error messages, you look them up.
If you want to look at something new, you read the manual and search here for some working examples.
WHEN=GROUP is told when a record is the start of a "group" (using BEGIN or KEYBEGIN) and can optionally be told ho to identify the end of a group (END or RECORDS). RECORDS is almost certainly not doing what you think.
With WHEN=GROUP, to achieve anything, you need to use PUSH. This puts data onto your records, the same data for each record in the group. The source of the data must be the first record of the group, or the special data SEQ= or ID=. SEQ= gets you a sequence number within the group. ID= gets you a sequence number of the groups.
You want to have 10 records, you PUSH a SEQ and on OUTFIL use INCLUDE=/OMIT= for less-than-or-equal-to-10 (or however you want to phrase it).
Joined: 02 Dec 2002 Posts: 629 Topics: 176 Location: Stockholm, Sweden
Posted: Wed Jul 09, 2014 2:59 am Post subject: Thanks William
Forgetting any header and trailer records, is the data a;ready in the sequence that you want before you extract it? Are data for the same "key" contiguous? Forget any header/trailer records for the moment. Is the data in the sequence you want already? Yes
If you get error messages, you look them up. Hmmm, not quite THAT easy. Here is the error message from one job.
Quote:
ICE107A 6 DUPLICATE, CONFLICTING, OR MISSING INREC OR OUTREC STATEMENT OPERANDS
Now since I HAD defined an INREC statement (though it might have been incorrect), the error message above gives THREE possible error possibilities. If you're not an expert, that's a rather nebulous error message.
If you want to look at something new, you read the manual and search here for some working examples. Trust me, I do that (or try to). In the case of WHEN=GROUP I kept on finding, basically, the SAME examples everywhere.
WHEN=GROUP is told when a record is the start of a "group" (using BEGIN or KEYBEGIN) and can optionally be told ho to identify the end of a group (END or RECORDS). RECORDS is almost certainly not doing what you think I think this is covered in my answer above.
With WHEN=GROUP, to achieve anything, you need to use PUSH. This puts data onto your records, the same data for each record in the group. The source of the data must be the first record of the group, or the special data SEQ= or ID=. SEQ= gets you a sequence number within the group. ID= gets you a sequence number of the groups. Maybe WHEN=GROUP isn't what I want then. The select statement does (basically) what I want. Look for each group of unique keys in pos 11-15 and extract a max of 10 records. I'm not interested in changing the input records in any way.
You want to have 10 records, you PUSH a SEQ and on OUTFIL use INCLUDE=/OMIT= for less-than-or-equal-to-10 (or however you want to phrase it) See comment just above.
I do appreciate both you and Kolusu answering my queries, but I'm beginning to wonder if the WHEN=GROUP keyword is what I'm really after ??? _________________ Michael
Did you look up the message? There's a manual, there's IBM LookAt. You may find that knowing what the 6 means is useful. The message is located after INREC or OUTREC, so that bit is not so hard. DUPLICATE you can eyeball. Just leaves CONFLICTING and MISSING. MISSING is easier, so take the element you have coded, and look in the manual for the syntax and you should find the MISSING bit.
If the data is in order, GROUP is going to allow you to say (on OUTFIL) "give me the first 10 records of this key and not the other 9381, without having to sort the data. I thought that is what you were after.
Joined: 02 Dec 2002 Posts: 629 Topics: 176 Location: Stockholm, Sweden
Posted: Wed Jul 09, 2014 6:16 am Post subject:
Sorry William. What I am after is reducing the file so that each UNIQUE identifier (in pos 11-15) is written to the output file a max of 10 times. Ie, I have 100 KEY1, 150 KEY2 and 3 KEY3. The resulting output file would have 10xKEY1, 10xKEY2 and 3xKEY3 _________________ Michael
And that is what you'd get if you PUSH a SEQ and use INCLUDE= on OUTFIL for the SEQ-field being less-than-or-equal-to-10. You just need to drop the SEQ-field afterwards with BUILD on OUTFIL.
Joined: 26 Nov 2002 Posts: 12378 Topics: 75 Location: San Jose
Posted: Wed Jul 09, 2014 10:56 am Post subject:
Misi,
As I mentioned earlier it is quite easy to pick the first "n" records using WHEN=GROUP. If your input is already sorted and you want to pick first "5" records, then you simply need to use SEQ on WHEN=GROUP . I assumed your Input has RECFM=FB and LRECL=30. If you want to understand how the SEQ and ID of WHEN=GROUP work, run the jobs removing the OUTFIL statements and see the contents at position 31 and 40.
ex:
Code:
//STEP0100 EXEC PGM=SORT
//SYSOUT DD SYSOUT=*
//SORTIN DD *
HEADER - 2014-07-09
D BERT
D BERT
D BERT
D BERT
D BERT
D BERT
D BERT
D BERT
D BERT
D FRED
D FRED
D FRED
D FRED
D FRED
D FRED
D FRED
D JOHN
D JOHN
D JOHN
D JOHN
D JOHN
D JOHN
//SORTOUT DD SYSOUT=*
//SYSIN DD *
OPTION COPY
INREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:SEQ=8))
OUTFIL INCLUDE=(31,8,ZD,LE,5),BUILD=(1,30)
//*
If you want to split each key into a different output file and also retain the header in each file, then it is also quite easy. All you need is ID instead of SEQ on WHEN=GROUP
Further more if you want to split each key into a different file AND also limit the number of keys written to each file(say 5), then you can use BOTH ID and SEQ on WHEN=GROUP and use the following control cards
If the Data is NOT sorted you can move the WHEN=GROUP to OUTREC after the SORT is completed. I used INREC to tag the header record so that it will stay in the same place. You can use the following set of control cards
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum