SAS Base

作者: ShawnDuan | 来源:发表于2018-03-09 05:23 被阅读64次

变量名

名字的长度要小于等于 32 个字节。（一个字母 1 个字节，一个汉字 2 个字节）
以字母或下划线开头。
可以包含字母、数字、或者是下划线，不能是%$!*&#@。
可以是小写或大写字母，且不区分大小写
Missing numeric data are represented by a single period (.) and missing character data are represented by blanks.

library name

1-8个字符，字母或者下划线开头，剩余部分为字母，数字或者下划线

注释

星号开头 ;结尾
星号斜杠开头，星斜杠结尾 asterisk (*)

DATA steps与PROC steps区别

The DATA statement does three things

Tells SAS that a DATA step is starting.

Names the SAS dataset being created.

Set variables used in the DATA step to missing values

three default windows

1.program editor window
2.log window
3.output window

The basics of using SAS

Prepare the SAS program

Submit it for analysis

Review the resulting log for errors

Examine the output files to view the results of your analysis

Executing the program

Pull down the Locals menu and select Submit.

Click on the run icon on taskbar, which is a picture of a man running.

Push F8.

Highlight text and click on run symbol

Note: DATA or PROC step is not executed until next DATA and PROC. Use RUN; statement to force execution.

读入dat文件;

DATA NAME;
INFILE 'E:\data\a.dat' FIRSTOBS=4 DLM=',';
INPUT V1 1-5   V2 5-10   V3 $ 15; 
RUN;
PROC PRINT DATA=NAME; RUN;

infile控制

格式 INFILE 'AAAAA.DAT' XXX;
FIRSTOBS=行数从哪一行开始读取数据
OBS=行数一直读取到哪一行
MISSOVER 表示数据读到行末时，如果字段长度短于申明字段长度，则不从下一行读取数据，否则会自动从下一行读取数据
TURNCOVER column input中指定最长的一行

INPUT Notes

(1) Duplicate formats can be used when variables have the same format. The examples below represent the same formats of variables x1-x5.
INPUT x1 4. x2 4. x3 4. x4 4. x5 4.;
INPUT (x1 x2 x3 x4 x5) (4. 4. 4. 4. 4.);
INPUT (x1-x5) (5*4.);
(2) @@ tells SAS to hold the line of raw data and use it when processing the next
observation. The @@ must be the last entry in the INPUT statement.
(3) @ tells SAS to hold this line of data for possible use by INPUT statements later in theDATA step. The @ must be the last entry in the INPUT statement.
(4) / tells SAS to move to the next line of the raw dataset.
(5) #n tells SAS to skip to the nth line of the raw data for the observation.
(6) @n tells SAS to move to the nth column.

特殊字符

@40 跳至第40列 @‘aa’ 跳至aa后面
斜线/ 跳至原始数据第二行
#2 跳至某观测值第二行
重复观测值，将@@放在input句尾
input句尾加@， trailing at, 可用来选择部分数据，看例子

数据步读取分隔符文件 delimited files

DLM=',' 指定逗号分隔符 '09'x Tab分隔符
DSD 忽略引号中数据的分隔符，例如一个观测 Joseph,76,"Red Racers, Washington"非引号中的逗号能识别成分隔符，而引号中的逗号不能识别；自动将字符串中的引号去掉；将两个相邻的分隔符当作缺失值来处理。

Excel数据读取

PROC IMPORT DATAFILE='D:\A.XLS' OUT=A  REPLACE DBMS=XLS; GETNAMES=YES; SHEET="Sheet1"; RUN;
PROC PRINT DATA=A; RUN;

OUT= 输出数据集名称
DBMS= XLS XLSX

sas7dbat文件读取 (桌面上的文件)

data new; set 'C:\Users\sdkyc\Desktop\hsb2.sas7bdat'; run;
proc print data=new; run;

数据集是临时还是永久

变量赋值与运算

IF-THEN DO IF-ELSE

DO 与END 是一个组合，内部actions都会被执行

DATA A;
INFILE 'C:\A.DAT';
INPUT V1 $ V2 V3;
IF V2 = .  THEN   V4='MISSING';
  ELSE IF V2<100  THEN   V4='LOW';
  ELSE IF V2<1000  THEN   V4='MEDIUM';
  ELSE V4 = 'HIGH';
RUN;

可以用来构造子集

使用数组简化程序 ARRAY

ARRAY array-name <{n}> <$> <length> <elements> <(initialvalues)>;
array-name - is the name of the array.
{n} - is either the dimension of the array, or an asterisk (*) to indicate that the dimension is determined from the number of array elements or initial values.
$ indicates that the array type is character.
length - is the maximum length of elements in the array. For character arrays, the maximum length cannot exceed 200.
elements - are the variables that make up the array and they exist in a dataset or are created before the array definition.
initial-values - are the values to use to initialize some or all of the array elements. Separate these values with commas or blanks
ARRAY rain {5} janr febr marr aprr mayr;
ARRAY days{7} d1-d7;
ARRAY month{*} jan feb jul oct nov;
ARRAY x{*} _NUMERIC_;
ARRAY qbx{10};
ARRAY meal{3};

关于各个PROC的note链接

https://stats.idre.ucla.edu/other/annotatedoutput/

PROC CONTENTS 获取数据集的描述部分，不包括数据本身

PROC MEANS

输出一些Descriptive Statistics 功能与univariate重复
maxdec 小数位个数
proc means data=a N NMISS MEAN STD STDERR MAXDEC=4; run;

PROC UNIVARIATE t-test sample mean mu0

Test for location就是一个two-tail的t-test，查看student's t value，如果P＜α，wirte的平均值不等于30.
proc univariate data = "D:\hsb2" plots normal mu0=30; var write; run;
用来测试normality，画plot图找到Shapiro-Wilk P value大于α，正态分布
proc univariate data=a normal plot; var write; run;

1.These tests check the assumption that the data is distributed as a normal distribution.
2.Null hypothesis: data is normal vs Alternate hypothesis: data not normal.
3.P-value large (eg > 0.05) indicate the data follow normal (we accept the null hypothesis) .
4.If 6 < sample size < 2001 use Shapiro-Wilk.
5.Sample size > 2000 use Kolmogorov-Smirnov test.
6.Within the appropriate sample size range Shapiro-Wilk is more powerful than Kolmogorov-Smirnov test.
7.Any departure from Skewness =0 and kurtosis = 0 implies non normality.

PROC FREQ TABLES chisq

用来测试变量之间有无association，相互是否独立。找到输出结果中chi-square值，大值对应小p-value。如果P＜α，两个变量有相关关系，不相互独立。
English： A large chi-square statistic will correspond to small p-value. If the p-value is small enough (say < 0.05), then we will reject the null hypothesis that the two variables are independent and conclude that there is an association between the row and the column variables.
PROC FREQ DATA=CLASSFIT2; TABLES SEX*HT/CHISQ; RUN;

PROC REG

Assumption

a.Normality of errors: The error distribution is normal.
b.Normality of errors is checked by doing residual analysis. In residual analysis we first calculate the residuals (r = y - ( 𝑦) ̂𝑝𝑟𝑒𝑑𝑖𝑐𝑡) then verify the normality of the residuals using proc univariate or Q-Q plots.
c.Independence: The errors or observations are independent of each other. Example: apple stock price recorded on 10 consecutive days. Here the 10 observations are not independent
d.变量必须是numerical value

PROC ANOVA

Assumption sampled populations are normally distributed.
one-way ANOVA----only one factor (一个变量，这个变量可以有几个level)
查看ppt

PROC GLM contrast

http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#glm_toc.htm
1.问题：不同年龄的身高平均值相同吗？μ1=μ2=μ3=μ4
proc glm data=a; class age; model height=age; run;
2.问题： 11岁与12岁孩子的平均身高13-16岁孩子的平均身高有区别吗

proc glm data=a; class age; 
model height=age;
contrast '11&12 vs. rest' 
age 2 2 -1 -1 -1 -1; run; quit;

PROC CORR

查看变量间的相关系数 pearson correlation coefficients，负值负相关；正值正相关。
nosimple 不显示Descriptive Statistics
proc corr data = "D:\hsb2" pearson nosimple; var read write; run;

PROC TTEST t-test

Assumption: all variables are normally distributed.

Single sample t-test 例子：检验score的平均值是否与50相同， p小于α，显著不同
proc ttest data="D:\hsb2" H0=50; var score; run;
Dependent group t-test (paired t-test) 例子：一群学生都考了两门考试，学生的write 成绩与read成绩的平均值是否相同， p小于α，显著不同
proc ttest data="D:\hsb2"; paired write*read; run;
Independent group t-test 例子：男女性别对write成绩有无影响

如果equality of variances Pr>F的值小于α，那么两个性别group的variance不同，必须选择Satterthwaite (unequal)方法，然后查看这个方法对应的Pr>|t|
如果equality of variances Pr>F的值小于α，选Satterhwaite，否则选pooled
proc ttest data="D:\hsb2"; class sex; var write; run;

PROC NPAR1WAY

可以用来Wilcoxon test，问题举例：
Are test scores different from 4th grade to 5th grade on the same students?
Does a particular diet drug have an effect on BMI when tested one the same individuals?
该test的假设是：
Data comes from two matched, or dependent, populations.
The data is continuous.
Because it is a non-parametric test it does not require a special distribution of the dependent variable in the analysis. 对数据的distribution不做要求！！
尤其适用small sample size

one- and two-tail test

P value

如果 test H0=0，结果p<α 那么reject the H0，the mean is significantly different from 0.

预制代码

proc print data= ; run;

网友评论

本文标题：SAS Base

本文链接：https://www.haomeiwen.com/subject/lgslfftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！