News & Events
Free Training!
SSC is offering free training sessions to companies in the Midwest! Contact us for details! admin@sys-seminar.com
SUM Enchanted Procedure:
The Advanced Power of PROC SUMMARY
David Beam
Systems Seminar Consultants, Inc.
Madison, WI 53711
(608) 278-9964
Abstract
Many SAS users are uncomfortable or unsure about the inherent power of PROC SUMMARY. The creation of summarized data sets, with emphasis on the CLASS statement will be demonstrated in this paper.
Explanations of how SAS organizes the output file will help the user take advantage of this powerful SAS tool. There are many techniques of preparing new variables before going into PROC SUMMARY that can allow the procedure to generate counters and statistics, as well as the traditional sums.
The use of PROC FORMAT together with PROC SUMMARY can also allow much flexibility in the use of this procedure.
Introduction
PROC SUMMARY is a basic 'statistical' procedure, in that numeric variables can be used to compute a variety of statistics, including averages, sums, standard deviations, etc. Often overlooked is how this procedure can save programming time when used to full potential.
This paper is not meant to be a statistics course, but will look at the procedure for very common business analysis needs.
In the simple sense, PROC SUMMARY will read in a data set and create an output data set with whatever statistics are needed. The statistics coming out can be grouped together by values of the CLASS variables, or only a single total can be created.
The two procedures, MEANS and SUMMARY, are now (with release 6 or higher of The SAS System) virtually identical. The main difference is that SUMMARY does not produce a printed report as a default, but PROC MEANS assumes a report is needed.
The basic syntax of PROC SUMMARY is shown here. Many of the optional statements and keywords will be discussed later. This is NOT meant to replace the SAS documentation, refer to the SAS Language: Reference manual for complete description of the procedure.
PROC SUMMARY <option list> <statistic-keyword-list>;
VAR variable-list;
CLASS variable-list;
FREQ variable;
WEIGHT variable;
ID variable-list;
BY variable-list;
OUTPUT <OUT=SAS-data-set>
<output-statistic-list>
<statistic(variable)=new-var-name>
<statistic(variable)=new-var-name>
...
;
RUN;
Partial List of Keywords on the PROC Statement
DATA= data set name (input)
DESCENDING
orders the output data set in order of the _TYPE_ variable, default is ascending
MISSING
treats missing values for the CLASS variable as a valid subgroup
NWAY
only the highest value for the _TYPE_ variable will be output
ORDER= DATA | EXTERNAL | FORMATTED |
FREQ | INTERNAL
specifies sorting order of the CLASS variables for output of the data set. Default is INTERNAL, which is sequential unformatted values.
Partial List of Keyword Statistics Available
N the count of observations having non-missing values for the selected variable
NMISS the count of missing values for the variable
MIN the minimum value
MAX the maximum value
SUM the total of the specified variable
MEAN the average (mean)
STD the standard deviation
If no keyword statistics are present, PROC SUMMARY produces the following default statistics:
N, MEAN, STD, MIN, MAX
Statements used with PROC SUMMARY
The following is a partial list of the statements that can be used to control the procedure:
BY variables;
Names variable(s) that will be used to obtain separate statistics when the value changes. The data set must be sorted or indexed by these variables.
CLASS variables;
Used to form sub-groups of output data. Similar to the BY statement, but does not require sorting the data first.
With multiple variables on the CLASS statement, the procedure will generate statistics for each unique value of the CLASS variable, as well as combination statistics for all combinations of all CLASS values.
OUTPUT statement
This "drives" the procedure. The OUTPUT statement can name the new data set being created, select the variables and statistics to generate, and name the new variables being created.
There is no limit to the number of OUTPUT statements in a single PROC SUMMARY.
Sample Data Set
The examples that follow will use the following set of data as input. The CONTENTS of the data set is:
# Variable Type Len Pos
-----------------------------------
1 STATE Char 2 0
2 SALESREP Char 5 2
3 MONTH Char 2 7
4 SALES Num 8 9
5 COST Num 8 17
PROC PRINT of the sample data set:
STATE SALESREP MONTH SALES COST
CA DONNA 01 231 180
CA FRED 02 74 83
CA MARY 01 156 148
CA MARY 02 143 97
CA MARY 02 98 65
NY DONNA 01 . .
NY FRED 02 210 155
NY SAM 02 104 96
OK FRED 02 64 72
OK SAM 01 237 184
OK SAM 01 . .
OK SAM 02 94 72
WI FRED 01 53 53
WI FRED 02 197 142
WI JERRY 01 . .
WI JERRY 01 215 184
WI JERRY 02 14 8
WV DONNA 01 315 207
WV DONNA 02 96 78
WV SAM 02 174 169
WV SAM 02 97 53
====== ======
2572 2046
Case 1 - Summary by State
For our first example, we will create a summary file for each state. We need to know total sales and the count of missing values for sales in each state, as well as the highest (MAX) sales in each state.
PROC SUMMARY DATA= sugi.sales;
CLASS state;
VAR sales;
OUTPUT OUT=example1
SUM(sales) = totsales
MAX(sales) = maxsales
NMISS(sales) = badsales
;
RUN;
PROC PRINT DATA=example1;
RUN;
OBS STATE _TYPE_ _FREQ_ TOTSALES MAXSALES BADSALES
1 0 21 2572 315 3
2 CA 1 5 702 231 0
3 NY 1 3 314 210 1
4 OK 1 4 395 237 1
5 WI 1 5 479 215 1
6 WV 1 4 682 315 0
The output data set contains one observation with total statistics for the complete incoming data set, as well as one observation for each value of the CLASS variable.
The _TYPE_ variable created by SAS shows the level of the totals on that observation:
The _TYPE_ = 0 observation are the stats for the complete data set.
The _TYPE_ = 1 level are the stats for the values of STATE, or the class variable.
When more variables are present on the CLASS statement, then SAS produces even more output observations. The value of the _TYPE_ variable will increase as well. A summary observation is produced for each possible combination of values of the CLASS variables.
Case 2 - Summary by State and Salesrep
Now we will run the same PROC SUMMARY as above, changing the CLASS statement to include two variables:
PROC SUMMARY DATA= sugi.sales;
CLASS state salesrep;
VAR sales;
OUTPUT OUT=example2
SUM(sales) = totsales
MAX(sales) = maxsales
NMISS(sales) = badsales
;
RUN;
PROC PRINT DATA=example2 NOOBS
WIDTH=MIN HEADING=V;
RUN;
S T M B
A O A A
L _ _ T X D
S E T F S S S
T S Y R A A A
A R P E L L L
T E E Q E E E
E P _ _ S S S
0 21 2572 315 3
DONNA 1 4 642 315 1
FRED 1 5 598 210 0
JERRY 1 3 229 215 1
MARY 1 3 397 156 0
SAM 1 6 706 237 1
CA 2 5 702 231 0
NY 2 3 314 210 1
OK 2 4 395 237 1
WI 2 5 479 215 1
WV 2 4 682 315 0
CA DONNA 3 1 231 231 0
CA FRED 3 1 74 74 0
CA MARY 3 3 397 156 0
NY DONNA 3 1 . . 1
NY FRED 3 1 210 210 0
NY SAM 3 1 104 104 0
OK FRED 3 1 64 64 0
OK SAM 3 3 331 237 1
WI FRED 3 2 250 197 0
WI JERRY 3 3 229 215 1
WV DONNA 3 2 411 315 0
WV SAM 3 2 271 174 0
As can be seen, the output now contains values of the _TYPE_ variable that range from 0 to 3:
_TYPE_ Description
0 totals for complete data set
1 totals for each SALESREP
2 totals for each STATE
3 totals for SALESREP within STATE
The data set produced by PROC SUMMARY can now be used to generate a variety of reports, by using the WHERE statement to pull the required data:
PROC PRINT DATA=example2(DROP=salesrep)
NOOBS WIDTH=MIN HEADING=V;
TITLE "State Totals Only";
WHERE _type_ = 2;
RUN;
State Totals Only
T M B
O A A
_ _ T X D
S T F S S S
T Y R A A A
A P E L L L
T E Q E E E
E _ _ S S S
CA 2 5 702 231 0
NY 2 3 314 210 1
OK 2 4 395 237 1
WI 2 5 479 215 1
WV 2 4 682 315 0
Case 3 - Three (or more) variables on the CLASS statement
Change the above program so the CLASS statement contains three variables:
PROC SUMMARY DATA= sugi.sales;
CLASS state salesrep month;
VAR sales;
OUTPUT OUT=example3
SUM(sales) = totsales
MAX(sales) = maxsales
NMISS(sales) = badsales
;
RUN;
It now becomes difficult to determine what the values of the _TYPE_ variable will represent. The following simple DATA step can be used to get the first observation for each value of the _TYPE_ variable. These can then be printed to review what the level of totals the _TYPE_ variable represent.
This program works because the data set produced by PROC SUMMARY is output in the order of the _TYPE_ values:
/* Produce data set with one observation per value */
/* of the _TYPE_ variable that is created from */
/* PROC SUMMARY earlier. */
DATA small;
SET example3;
BY _type_;
if FIRST._type_;
RUN;
PROC PRINT DATA = small;
TITLE "One per value of _TYPE_";
VAR _type_ state salesrep month;
run;
One per value of _TYPE_
OBS _TYPE_ STATE SALESREP MONTH
1 0
2 1 01
3 2 DONNA
4 3 DONNA 01
5 4 CA
6 5 CA 01
7 6 CA DONNA
8 7 CA DONNA 01
This report can now be used to easily see what the _TYPE_ values represent:
_TYPE_ Description
0 totals for complete data set
1 totals for each MONTH alone
2 totals for each SALESREP alone
3 totals for each combination of MONTH and SALESREP
4 totals for STATE alone
5 each MONTH and STATE combo
6 each SALESREP and STATE combo
7 final detail, totals for each combination of MONTH, SALESREP and STATE
Using the NWAY Option
Adding the keyword NWAY to the PROC statement tells SAS that the output data set should only contain the highest value of the _TYPE_ variable. So:
PROC SUMMARY DATA=sugi.sales NWAY;
...
will create a data set with only _TYPE_ = 7 in the above example.
Case 5 - No output Statistics Specified
If you use a CLASS statement and an OUTPUT statement but do not specify the desired statistic keywords, you will get a very different type of data set.
SAS produces five observations for each value of the CLASS variables, one observation for each of the five default statistics: N, MEAN, STD, MIN and MAX.
The name of the statistic generated is stored in the variable _STAT_ in the new data set.. The following program and print show the results:
PROC SUMMARY DATA= sugi.sales;
CLASS state;
VAR sales;
OUTPUT OUT=example4;
RUN;
PROC PRINT DATA= example4;
RUN;
OBS STATE _TYPE_ _FREQ_ _STAT_ SALES
1 0 21 N 18.000
2 0 21 MIN 14.000
3 0 21 MAX 315.000
4 0 21 MEAN 142.889
5 0 21 STD 78.926
6 CA 1 5 N 5.000
7 CA 1 5 MIN 74.000
8 CA 1 5 MAX 231.000
9 CA 1 5 MEAN 140.400
10 CA 1 5 STD 60.550
11 NY 1 3 N 2.000
12 NY 1 3 MIN 104.000
13 NY 1 3 MAX 210.000
14 NY 1 3 MEAN 157.000
15 NY 1 3 STD 74.953
16 OK 1 4 N 3.000
17 OK 1 4 MIN 64.000
18 OK 1 4 MAX 237.000
19 OK 1 4 MEAN 131.667
20 OK 1 4 STD 92.446
21 WI 1 5 N 4.000
22 WI 1 5 MIN 14.000
23 WI 1 5 MAX 215.000
24 WI 1 5 MEAN 119.750
25 WI 1 5 STD 101.125
26 WV 1 4 N 4.000
27 WV 1 4 MIN 96.000
28 WV 1 4 MAX 315.000
29 WV 1 4 MEAN 170.500
30 WV 1 4 STD 103.029
This data set can be used for specific kinds of reports, where each statistic is a separate observation. A common use for this data would be a PROC PLOT, with each statistic as a separate plot.
Case 7 - Counting a Variety of Situations
Frequently we need counts of activity happening with the data, but don't have the specific information needed as separate variables in the data set.
For example, in the sample data set, suppose we need to know the number or count of observations that had a cost/sales ratio of greater than 75%. We also need a count of SALES under $150.00.
One method would be to pass the data set through a DATA step, create new variables with either a value 1 or a missing value to signify the desired "true or false" conditions. These can then be counted with PROC SUMMARY, at the same time other statistics are generated.
DATA changed;
SET sugi.sales;
/************************************/
/* assign 1 to counter variable for */
/* profit percent over 75% */
/* and sales under $150 */
/************************************/
profit = ROUND(cost/sales, .01);
IF profit > .75 THEN over75 = 1;
IF sales < 150 THEN und150 = 1;
RUN;
PROC SUMMARY DATA= changed NWAY;
CLASS state;
VAR sales cost over75 und150;
OUTPUT OUT=example5(DROP=_FREQ_)
SUM(sales) = totsales
SUM(cost) = totcost
N(over75) = over75
N(und150) = und150
;
RUN;
PROC PRINT DATA=example5 NOOBS LABEL SPLIT='*'
WIDTH=MIN;
TITLE "Counting Computed events in data";
LABEL over75 = "Count*Ratio*> 75%"
und150 = "Count*Sales*< $150";
RUN;
Counting Computed events in data
Count Count
Ratio Sales
STATE _TYPE_ TOTSALES TOTCOST > 75% < $150
CA 1 702 573 3 3
NY 1 314 251 1 2
OK 1 395 328 3 3
WI 1 479 387 2 3
WV 1 682 507 2 2
Case 8 - Using Formatted CLASS Variables
Suppose you really want to group the STATE values together into regions for the summary totals. You could create a new variable containing the region assignments in a DATA step.
A better approach is to use PROC FORMAT to assign the region codes based on STATE values, and tell PROC SUMMARY to use the formatted region assignments.
PROC FORMAT;
VALUE $regions
'WI' = 'Midwest'
'CA', 'OK' = 'West'
'NY', 'WV' = 'East'
;
RUN;
PROC SUMMARY DATA= sugi.sales NWAY;
CLASS state;
FORMAT state $regions.;
VAR sales cost;
OUTPUT OUT=example6
SUM(sales) = totsales
SUM(cost) = totcost
;
RUN;
PROC PRINT DATA=example6 NOOBS
WIDTH=MIN;
TITLE "Proc Summary used formatted States";
RUN;
Proc Summary used formatted States
STATE _TYPE_ _FREQ_ TOTSALES TOTCOST
West 1 9 1097 901
East 1 7 996 758
Midwest 1 5 479 387
NOTE - the data set created above has a variable STATE, that contains the unformatted value of state. The $REGIONS. Format was carried with the data set and used in PROC PRINT. The value of STATE came out, as a default, in ascending order ('CA' first).
If you do a PROC PRINT with the following FORMAT assigned to STATE, you will see the original STATE value is passed to the data set:
FORMAT STATE $2.;
Using the ORDER= option on PROC SUMMARY
To change to order of the output observations to reflect the formatted values, the above PROC SUMMARY should be changed to request SAS use the formatted values instead of the unformatted values for output
(the results are shown below):
PROC SUMMARY DATA= sugi.sales NWAY ORDER=FORMATTED;
STATE _TYPE_ _FREQ_ TOTSALES TOTCOST
East 1 7 996 758
Midwest 1 5 479 387
West 1 9 1097 901
Remember that the variable STATE still contains the unformatted two character state code for the first observation in the formatted CLASS grouping..
Other values for the ORDER= option include:
DATA in original order of the incoming data set
EXTERNAL or FORMATTED
As shown above, using the assigned formats
FREQ output is in descending frequency counts (levels with most observations first)
INTERNAL
unformatted value (default)
Using the MAXID or MINID options
Suppose you want to compute the total sales by STATE, and find out the name of the SALESREP that had the highest and lowest sales in each STATE.
Using the MAXID and MINID options on the OUTPUT statement, name the variable that contains the value you want to see to identify the highest/lowest value for that value of the CLASS variable:
PROC SUMMARY DATA= sugi.sales;
CLASS state;
VAR sales;
OUTPUT OUT=example7
SUM(sales) = totsales
MAXID(sales(salesrep)) = best
MINID(sales(salesrep)) = worst
;
RUN;
PROC PRINT DATA=example7 NOOBS
WIDTH=MIN;
TITLE "Best/Worst Salesreps in Each State";
TITLE2 "and in Total U.S.";
RUN;
Best/Worst Salesreps in Each State
and in Total U.S.
STATE _TYPE_ _FREQ_ TOTSALES BEST WORST
0 21 2572 DONNA JERRY
CA 1 5 702 DONNA FRED
NY 1 3 314 FRED SAM
OK 1 4 395 SAM FRED
WI 1 5 479 JERRY JERRY
WV 1 4 682 DONNA DONNA
NOTE - it is important to understand that the above shows which salesrep had the highest/lowest individual observation in the data set, not the highest/lowest total sales. In 'CA', Mary had total sales higher than Donna, but Donna had the single highest sales amount in that state.
Missing Values were ignored in the statistics, including the selection of the lowest sales by salesrep.
CONCLUSION
PROC SUMMARY is one of my favorite tools to reduce a large set of data to meaningful and manageable totals. With some creative use of new numeric variables, formats and options in the procedure, this tool can perform very complex summarization tasks.
I suggest reviewing the SAS documentation for further use of this PROC, specifically the more statistically oriented features.
SAS is a registered trademark of SAS Institute Inc.
References
SAS Institute Inc. , SAS Procedures Guide, Version 6, Third Edition, Cary, NC: SAS Institute Inc.
Author
The author will be glad to answer questions and accept suggestions at the following address:
David Beam
Systems Seminar Consultants
2997 Yarmouth Greenway Drive
Madison, WI 53711
voice: (608) 278-9964
Fax: (608) 278-0065
E-mail: <train@sys-seminar.com>
Website: <http://www.sys-seminar.com>
