MVSFORUMS.com Forum Index MVSFORUMS.com
A Community of and for MVS Professionals
 
 FAQFAQ   SearchSearch   Quick Manuals   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Producing a guaranteed sample of records from a large file
Goto page 1, 2  Next
 
Post new topic   Reply to topic   printer-friendly view    MVSFORUMS.com Forum Index -> Utilities
View previous topic :: View next topic  
Author Message
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Tue Jul 08, 2014 2:16 am    Post subject: Producing a guaranteed sample of records from a large file Reply with quote

We are testing various programs where the files come from production. Obviously (?), in test, we don't want to test using hundreds of thousands of records, so we create a sample extract of those records using SORT and the SAMPLE keyword. This works fine, but .....

is there a simple way of ensuring that the output file contains samples of all the "key" values found in the input file. What I mean by this is the following. Assuming your key values in the file are in position 5 and contain FRED, BERT and JOHN. Now, let's assume that records 1-1000 all contain FRED, 1001-1002 contain BERT, and the rest contain JOHN. If I run SORT using a simple sample of every 300 records, I'll get examples of FRED & JOHN, but none of BERT.

What I would like to do would be to run sort and either specify the values in my keys (FRED, BERT, JOHN here) or, even better, let SORT figure them out, and get SORT to produce an output file of max 20 records for each unique key. Is this doable (and does my append make sense)
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Tue Jul 08, 2014 2:24 am    Post subject: Reply with quote

Googling at the same time. I should mention that if the answer is to USE ICETOOL, that's not a problem
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Tue Jul 08, 2014 2:57 am    Post subject: Found the answer to that as well Reply with quote

Wow !!!! ICETOOL is slick. I found the original answer.
but I'll include my solution as well
Code:

//S1       EXEC PGM=ICETOOL                                   
//TOOLMSG  DD SYSOUT=*                                         
//DFSMSG   DD SYSOUT=*                                         
//*                                                           
//SORTIN   DD DSN=xxxx.xxxx.xxxxxxx(0),DISP=SHR               
//*                                                           
//SORTOUT  DD DSN=xxxxxx.xxxx.xxxxxxx.SAMPLE,DISP=(,CATLG),   
//         RECFM=FB,LRECL=270,DATACLAS=DCLARGE                 
//*                                                           
//*                                                           
//TOOLIN   DD   *                                             
 SELECT FROM(SORTIN) TO(SORTOUT) ON(11,5,CH) FIRST(10)         
/*                                                             

This extract 10 records for each unique identifier in position 11 with a length of 5 characters.
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Tue Jul 08, 2014 3:33 am    Post subject: A follow-on question to the above Reply with quote

It all works fine but for the small detail that the input file has a header date record. Is there any way of using the HEADER option so that the first record remains in place (as it is now, it gets sorted/selected as the last record in the output file as positions 11,5 contains a date in the form 2014-).

I have tried the following, but it didn't seem to work
Code:

//TOOLIN DD *                                                     
  SELECT FROM(SORTIN) TO(SORTOUT) ON(11,5,CH) FIRST(10) USING(CTL1)
//CTL1CNTL DD *                                                   
  SORT FIELDS(11,5,CH,A) HEADER(1)                                 
/*                                                                 

It isn't a big deal to re-sort the file afterwards, I was just wondering if it could be done in one foul sweep.
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Tue Jul 08, 2014 4:06 am    Post subject: Got it Reply with quote

The following code will do this
Code:

//TOOLIN   DD   *                                                   
 SELECT FROM(SORTIN) TO(SORTOUT) ON(11,5,CH) FIRST(10) USING(CTL1) 
/*                                                                 
//CTL1CNTL DD *                                                     
 SORT FIELDS=(11,5,CH,D)                                           

Sort the records so that the date record always ends up first, then select the first 10 unique records for all positions 11-15

(Anyone wondering why I include all these questions/answers. The next person who needs an answer similar to this might hopefully find it based on my original header)
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
William Collins
Supermod


Joined: 03 Jun 2012
Posts: 437
Topics: 0

PostPosted: Tue Jul 08, 2014 9:03 am    Post subject: Reply with quote

If you need to SORT and preserve positions of headers/trailers, ICETOOL has DATASORT.

If you don't need to SORT, did you try turning it off for the SELECT? You could relocate your ON field to an extension of the record in INREC, clobber it to some low value for the header, and remove it later in OUTFIL.

A sequence number from WHEN=GROUP might also do what you want.

If you SORT Descending to keep the header at the front, haven't you ruined your sampling, by no longer representing your file (data not in the same order)?
Back to top
View user's profile Send private message
kolusu
Site Admin
Site Admin


Joined: 26 Nov 2002
Posts: 12378
Topics: 75
Location: San Jose

PostPosted: Tue Jul 08, 2014 10:25 am    Post subject: Reply with quote

misi01,

If your input is already SORTED on the Key (11,5,CH) a simple COPY operation with WHEN=GROUP will give you the desired results.
_________________
Kolusu
www.linkedin.com/in/kolusu
Back to top
View user's profile Send private message Send e-mail Visit poster's website
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Wed Jul 09, 2014 1:30 am    Post subject: In answer to your suggestions Reply with quote

William. I tried/looked at the DATASORT and have used it elsewhere. Trouble was, I couldn't figure out how to "extract" groups of similar records (as ON(11,5,CH) FIRST(10) does with select) using DATASORT.

Kolusu. The data isn't sorted, that's why the header record ends up as the last record using the SELECT statement.

Both of you. WHEN=GROUP. I've googled it but didn't really understand how it would apply. I also tried variations on the following
Code:

SORT FIELDS=COPY                                   
INREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),RECORDS=10)

(remember, I'm definitely no expert on SORT), but all the variations I tried only gave me S000 U0107 abends
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
William Collins
Supermod


Joined: 03 Jun 2012
Posts: 437
Topics: 0

PostPosted: Wed Jul 09, 2014 2:17 am    Post subject: Reply with quote

Forgetting any header and trailer records, is the data a;ready in the sequence that you want before you extract it? Are data for the same "key" contiguous? Forget any header/trailer records for the moment. Is the data in the sequence you want already?

If you get error messages, you look them up.

If you want to look at something new, you read the manual and search here for some working examples.

WHEN=GROUP is told when a record is the start of a "group" (using BEGIN or KEYBEGIN) and can optionally be told ho to identify the end of a group (END or RECORDS). RECORDS is almost certainly not doing what you think.

With WHEN=GROUP, to achieve anything, you need to use PUSH. This puts data onto your records, the same data for each record in the group. The source of the data must be the first record of the group, or the special data SEQ= or ID=. SEQ= gets you a sequence number within the group. ID= gets you a sequence number of the groups.

You want to have 10 records, you PUSH a SEQ and on OUTFIL use INCLUDE=/OMIT= for less-than-or-equal-to-10 (or however you want to phrase it).
Back to top
View user's profile Send private message
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Wed Jul 09, 2014 2:59 am    Post subject: Thanks William Reply with quote

Forgetting any header and trailer records, is the data a;ready in the sequence that you want before you extract it? Are data for the same "key" contiguous? Forget any header/trailer records for the moment. Is the data in the sequence you want already? Yes

If you get error messages, you look them up. Hmmm, not quite THAT easy. Here is the error message from one job.
Quote:
ICE107A 6 DUPLICATE, CONFLICTING, OR MISSING INREC OR OUTREC STATEMENT OPERANDS
Now since I HAD defined an INREC statement (though it might have been incorrect), the error message above gives THREE possible error possibilities. If you're not an expert, that's a rather nebulous error message.

If you want to look at something new, you read the manual and search here for some working examples. Trust me, I do that (or try to). In the case of WHEN=GROUP I kept on finding, basically, the SAME examples everywhere.

WHEN=GROUP is told when a record is the start of a "group" (using BEGIN or KEYBEGIN) and can optionally be told ho to identify the end of a group (END or RECORDS). RECORDS is almost certainly not doing what you think I think this is covered in my answer above.

With WHEN=GROUP, to achieve anything, you need to use PUSH. This puts data onto your records, the same data for each record in the group. The source of the data must be the first record of the group, or the special data SEQ= or ID=. SEQ= gets you a sequence number within the group. ID= gets you a sequence number of the groups. Maybe WHEN=GROUP isn't what I want then. The select statement does (basically) what I want. Look for each group of unique keys in pos 11-15 and extract a max of 10 records. I'm not interested in changing the input records in any way.

You want to have 10 records, you PUSH a SEQ and on OUTFIL use INCLUDE=/OMIT= for less-than-or-equal-to-10 (or however you want to phrase it) See comment just above.

I do appreciate both you and Kolusu answering my queries, but I'm beginning to wonder if the WHEN=GROUP keyword is what I'm really after ???
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
William Collins
Supermod


Joined: 03 Jun 2012
Posts: 437
Topics: 0

PostPosted: Wed Jul 09, 2014 4:36 am    Post subject: Reply with quote

Did you look up the message? There's a manual, there's IBM LookAt. You may find that knowing what the 6 means is useful. The message is located after INREC or OUTREC, so that bit is not so hard. DUPLICATE you can eyeball. Just leaves CONFLICTING and MISSING. MISSING is easier, so take the element you have coded, and look in the manual for the syntax and you should find the MISSING bit.

If the data is in order, GROUP is going to allow you to say (on OUTFIL) "give me the first 10 records of this key and not the other 9381, without having to sort the data. I thought that is what you were after.
Back to top
View user's profile Send private message
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Wed Jul 09, 2014 6:16 am    Post subject: Reply with quote

Sorry William. What I am after is reducing the file so that each UNIQUE identifier (in pos 11-15) is written to the output file a max of 10 times. Ie, I have 100 KEY1, 150 KEY2 and 3 KEY3. The resulting output file would have 10xKEY1, 10xKEY2 and 3xKEY3
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
William Collins
Supermod


Joined: 03 Jun 2012
Posts: 437
Topics: 0

PostPosted: Wed Jul 09, 2014 9:15 am    Post subject: Reply with quote

And that is what you'd get if you PUSH a SEQ and use INCLUDE= on OUTFIL for the SEQ-field being less-than-or-equal-to-10. You just need to drop the SEQ-field afterwards with BUILD on OUTFIL.

Here's a an example you could apply, at a very quick search: http://www.mvsforums.com/helpboards/viewtopic.php?p=57821#57821 (I used GROUP SEQ INCLUDE and selected to search for all words, and user kolusu).
Back to top
View user's profile Send private message
kolusu
Site Admin
Site Admin


Joined: 26 Nov 2002
Posts: 12378
Topics: 75
Location: San Jose

PostPosted: Wed Jul 09, 2014 10:56 am    Post subject: Reply with quote

Misi,

As I mentioned earlier it is quite easy to pick the first "n" records using WHEN=GROUP. If your input is already sorted and you want to pick first "5" records, then you simply need to use SEQ on WHEN=GROUP . I assumed your Input has RECFM=FB and LRECL=30. If you want to understand how the SEQ and ID of WHEN=GROUP work, run the jobs removing the OUTFIL statements and see the contents at position 31 and 40.

ex:

Code:

//STEP0100 EXEC PGM=SORT                                       
//SYSOUT   DD SYSOUT=*                                         
//SORTIN   DD *                                               
HEADER -  2014-07-09                                           
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         BERT                                                 
D         FRED                                                 
D         FRED                                                 
D         FRED                                                 
D         FRED                                                 
D         FRED                                                 
D         FRED                                                 
D         FRED                                                 
D         JOHN                                                 
D         JOHN                                                 
D         JOHN                                                 
D         JOHN                                                 
D         JOHN                                                 
D         JOHN                                                 
//SORTOUT  DD SYSOUT=*                                         
//SYSIN    DD *                                               
  OPTION COPY                                                 
  INREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:SEQ=8))
   
  OUTFIL INCLUDE=(31,8,ZD,LE,5),BUILD=(1,30)                   
//*


If you want to split each key into a different output file and also retain the header in each file, then it is also quite easy. All you need is ID instead of SEQ on WHEN=GROUP

Code:
                                       
//OUT01    DD SYSOUT=*                                         
//OUT02    DD SYSOUT=*                                         
//OUT03    DD SYSOUT=*                                         
//SYSIN    DD *                                               
  OPTION COPY                                                 
  INREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:ID=8))     

  OUTFIL FNAMES=OUT01,BUILD=(1,30),                           
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,2)                       

  OUTFIL FNAMES=OUT02,BUILD=(1,30),                           
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,3)                       

  OUTFIL FNAMES=OUT03,BUILD=(1,30),                           
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,4)                       
//*



Further more if you want to split each key into a different file AND also limit the number of keys written to each file(say 5), then you can use BOTH ID and SEQ on WHEN=GROUP and use the following control cards

Code:

//SYSIN    DD *                                                   
  OPTION COPY                                                     
  INREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:ID=8,40:SEQ=8))

  OUTFIL FNAMES=OUT01,BUILD=(1,30),                               
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,2),AND,40,8,ZD,LE,5)       

  OUTFIL FNAMES=OUT02,BUILD=(1,30),                               
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,3),AND,40,8,ZD,LE,5)       

  OUTFIL FNAMES=OUT03,BUILD=(1,30),                               
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,4),AND,40,8,ZD,LE,5)       
//*   


If the Data is NOT sorted you can move the WHEN=GROUP to OUTREC after the SORT is completed. I used INREC to tag the header record so that it will stay in the same place. You can use the following set of control cards

Code:

//SYSIN    DD *                                               
  INREC IFOUTLEN=31,                                         
  IFTHEN=(WHEN=INIT,OVERLAY=(31:C'A',SEQNUM,8,ZD)),           
  IFTHEN=(WHEN=(32,8,ZD,EQ,1),OVERLAY=(31:X))                 
                                                             
  SORT FIELDS=(31,1,CH,A,                                     
               11,5,CH,A),EQUALS                             
                                                             
  OUTREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:SEQ=8)) 

  OUTFIL INCLUDE=(31,8,ZD,LE,5),BUILD=(1,30)                 
//*



Code:

//OUT01    DD SYSOUT=*                                     
//OUT02    DD SYSOUT=*                                     
//OUT03    DD SYSOUT=*                                     
//SYSIN    DD *                                           
  INREC IFOUTLEN=31,                                       
  IFTHEN=(WHEN=INIT,OVERLAY=(31:C'A',SEQNUM,8,ZD)),       
  IFTHEN=(WHEN=(32,8,ZD,EQ,1),OVERLAY=(31:X))             
                                                           
  SORT FIELDS=(31,1,CH,A,                                 
               11,5,CH,A),EQUALS                           
                                                           
  OUTREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:ID=8))
                                                           
  OUTFIL FNAMES=OUT01,BUILD=(1,30),                       
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,2)                   
                                                           
  OUTFIL FNAMES=OUT02,BUILD=(1,30),                       
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,3)                   
                                                           
  OUTFIL FNAMES=OUT03,BUILD=(1,30),                       
  INCLUDE=(31,8,ZD,EQ,1,OR,31,8,ZD,EQ,4)                   
//*



Code:

//SYSIN    DD *                                                     
  INREC IFOUTLEN=31,                                                 
  IFTHEN=(WHEN=INIT,OVERLAY=(31:C'A',SEQNUM,8,ZD)),                 
  IFTHEN=(WHEN=(32,8,ZD,EQ,1),OVERLAY=(31:X))                       
                                                                     
  SORT FIELDS=(31,1,CH,A,                                           
               11,5,CH,A),EQUALS                                     
                                                                     
  OUTREC IFTHEN=(WHEN=GROUP,KEYBEGIN=(11,5),PUSH=(31:ID=8,40:SEQ=8))
                                                                     
  OUTFIL FNAMES=OUT01,BUILD=(1,30),                                 
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,2),AND,40,8,ZD,LE,5)         
                                                                     
  OUTFIL FNAMES=OUT02,BUILD=(1,30),                                 
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,3),AND,40,8,ZD,LE,5)         
                                                                     
  OUTFIL FNAMES=OUT03,BUILD=(1,30),                                 
  INCLUDE=((31,8,ZD,EQ,1,OR,31,8,ZD,EQ,4),AND,40,8,ZD,LE,5)         
//*

_________________
Kolusu
www.linkedin.com/in/kolusu
Back to top
View user's profile Send private message Send e-mail Visit poster's website
misi01
Advanced


Joined: 02 Dec 2002
Posts: 629
Topics: 176
Location: Stockholm, Sweden

PostPosted: Mon Jul 14, 2014 4:03 am    Post subject: Thanks Kolusu Reply with quote

Tried your suggestion (the first bit) and removed the OUTFIL statement. I understand now (I think) how it works.

That ISN'T to say I'll be able to put together the next variation on this theme without a load of experimenation.

Thanks again
_________________
Michael
Back to top
View user's profile Send private message Send e-mail
Display posts from previous:   
Post new topic   Reply to topic   printer-friendly view    MVSFORUMS.com Forum Index -> Utilities All times are GMT - 5 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


MVSFORUMS
Powered by phpBB © 2001, 2005 phpBB Group