上篇学习了grep文本处理工具,这篇总结下另一款更强大的处理工具awk。
AWK 是一种解释执行的编程语言。它非常的强大,被设计用来专门处理文本数据。AWK 的名称是由它们设计者的名字缩写而来 —— Afred Aho, Peter Weinberger 与 Brian Kernighan。除了文本处理,awk还可以生成格式化的文本报告,进行算术运算,字符串操作等。
工作流程
awk的工作流程就三步:读取、执行和重复
![](https://img.haomeiwen.com/i5914725/18ff4691ce85d62c.jpg)
(1)读(Read)
AWK 从输入流(文件、管道或者标准输入)中读入一行然后将其存入内存中。
(2)执行(Execute)
对于每一行输入,所有的 AWK 命令按顺执行。 默认情况下,AWK 命令是针对于每一行输入,但是我们可以将其限制在指定的模式中。
(3)重复(Repeate)
一直重复上述两个过程直到文件结束。
awk程序结构
分为三个模块:开始模块,主体模块和结束模块
开始模块
语法:
BEGIN {awk-commands}
仅在程序启动时执行,且只执行一次,通常用于为变量赋值等初始化操作。BEGIN必须大写。
另外,开始模块是可选模块,可以没有。
主体模块
语法:
/pattern/ {awk-commands}
主体模块就是程序对文件每行进行处理的部分。
结束模块
语法:
END {awk-commands}
类似开始模块,结束模块只在结束时调用一次,也是可选模块,END关键字必须大写。
实例
文件内容(后文通用文件)test.txt
1) Amit Physics 80
2) Rahul Maths 90
3) Shyam Biology 87
4) Kedar English 85
5) Hari History 89
awk命令实例
awk 'BEGIN{printf "NO\tName\tSubject\tMark\n"} {print$0}' test.txt
输出:
NO Name Subject Mark
1) Amit Physics 80
2) Rahul Maths 90
3) Shyam Biology 87
4) Kedar English 85
5) Hari History 89
基础语法
awk基础语法,awk [awk commands] file,awk关键字必带,之后是一串awk指令,即上文讲的语法模块,最后是要处理的文件。
如果在命令行中输入awk指令,awk命令主体部分必须包括在''内,且每句指令需要包括在{}中。例如打印test.txt全文:
awk '{print}' test.txt
awk命令可以在命令行执行,也可以写入文件中执行。还是上面的操作,这次换到文件执行,创建exe1.awk
{print}
awk执行指令文件需要通过-f选项完成,awk -f awkcommandfile targetfile
awk -f exe.awk test.txt
也将得到前文的输出结果。
常用awk选项
输入awk --help,列出全部可用选项
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options:
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
-m[fr] val
-W compat --compat
-W copyleft --copyleft
-W copyright --copyright
-W dump-variables[=file] --dump-variables[=file]
-W gen-po --gen-po
-W help --help
-W lint[=fatal] --lint[=fatal]
-W lint-old --lint-old
-W non-decimal-data --non-decimal-data
-W profile[=file] --profile[=file]
-W posix --posix
-W re-interval --re-interval
-W source=program-text --source=program-text
-W traditional --traditional
-W usage --usage
-W version --version
①-f:执行指令文件
②-F:指定内容分隔符,默认使用空格分割。例如:-F':'即指定:作为分隔符,''单引号可以省去。可以同时使用多个域分隔符,这时应该把分隔符写成放到方括号中,awk -F[:\t] 可以使用冒号、制表符进行分割。
③-v:赋值操作。除了可以在BEGIN中进行赋值,awk还提供了-v选项在命令主体外进行赋值。
awk -v name=Chuck 'BEGIN{printf "my name is %s\n", name}'
输出:
my name is Chuck
④--dump-variables[=file]:输出awk全局变量
#输出awk全局变量
awk --dump-variables ''
#默认在awkvars.out中
cat awkvars.out
输出:
ARGC: number (1)
ARGIND: number (0)
ARGV: array, 1 elements
BINMODE: number (0)
CONVFMT: string ("%.6g")
ERRNO: number (0)
FIELDWIDTHS: string ("")
FILENAME: string ("")
FNR: number (0)
FS: string (" ")
IGNORECASE: number (0)
LINT: number (0)
NF: number (0)
NR: number (0)
OFMT: string ("%.6g")
OFS: string (" ")
ORS: string ("\n")
RLENGTH: number (0)
RS: string ("\n")
RSTART: number (0)
RT: string ("")
SUBSEP: string ("\034")
TEXTDOMAIN: string ("messages")
⑤--profile:格式化awk指令,将通过命令行输入的awk指令格式化到文件中
#命令行输入执行语句,--profile默认写入awkprof.out中
awk --profile 'BEGIN{print "This is BEGIN"} {print $1,$2,$3} END{print "AWK command end"}' test.txt
#查看格式化后的命令
cat awkprof.out
输出:
# gawk profile, created Sat Apr 7 09:39:57 2018
# BEGIN block(s)
BEGIN {
(prinf "This is BEGIN")
}
# Rule(s)
{
print $1, $2, $3
}
# END block(s)
END {
print "AWK command end"
}
美不美。指定文件的话只需要--profile=filename即可,④中同样适用。
内置变量
①ARGC
参数个数
awk 'BEGIN{print "num of argument is ", ARGC}' one two three four
输出:
num of argument is 5
②ARGV
这个变量表示存储命令行输入参数的数组。
awk 'BEGIN{for(i = 0; i < ARGC - 1; i++){printf "ARGV[%d] = %s\n", i, ARGV[i]}}' one two three four
输出:
ARGV[0] = awk
ARGV[1] = one
ARGV[2] = two
ARGV[3] = three
③ ENVIRON
这个变量是系统的环境变量数组,相当于Linux的env指令。
awk 'BEGIN{printf "environment params user is %s\n", ENVIRON["USER"]}'
输出:
environment params user is Chuck
④ FILENAME
文件名
awk 'END{printf "execute file name is %s\n", FILENAME}' test.txt
输出:
test.txt
注意文件名输出不能在BEGIN模块中,可以试试。
⑤FS
分隔符,默认是空格,可以通过-F选项进行指定
awk -F: 'BEGIN{print "FS =", FS}'
输出:
FS =
#指定分隔符
awk -F: 'BEGIN{print "FS =", FS}'
输出:
FS = :
⑥NF
此变量表示当前输入记录中域的数量。所谓域就是行经过分隔符分割之后的列,NF就是列的数量。
echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'BEGIN{i = 0}{printf "line %d NF = %d\n", i++, NF}'
输出:
line 0 NF = 2
line 1 NF = 3
line 2 NF = 4
#NF也可以作为条件使用
echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'NF > 2'
输出:
One Two Three
One Two Three Four
NF > 2是条件,后面不跟任何操作时默认执行print,输出当前行。
⑦ NR
此变量表示当前记录的数量。即行数。
在⑥中使用定义的变量i表示行数,本例中可以使用NR变量。
echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk '{printf "now line number is %d\n", NR}'
输出:
now line number is 1
now line number is 2
now line number is 3
可以看出,NR标记行号是从1开始计算。
⑧ FNR
当多个文件同时读取时,NR会从第一个文件的第一行开始一直计算到最后一个文件的最后一行。
使用FNR时,每更换文件时NR重新开始计算。
awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, NR, $0}' test.txt data.txt
输出:
now file is test.txt and NR = 1 content is 1) Amit Physics 80
now file is test.txt and NR = 2 content is 2) Rahul Maths 90
now file is test.txt and NR = 3 content is 3) Shyam Biology 87
now file is test.txt and NR = 4 content is 4) Kedar English 85
now file is test.txt and NR = 5 content is 5) Hari History 89
now file is data.txt and NR = 6 content is root:x:0:0:root:/root:/bin/bash
now file is data.txt and NR = 7 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
now file is data.txt and NR = 8 content is DADddd:x:2:2:daemon:/sbin:/bin/false
now file is data.txt and NR = 9 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
now file is data.txt and NR = 10 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
now file is data.txt and NR = 11 content is &nobody:$:99:99:nobody:/:/bin/false
now file is data.txt and NR = 12 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
now file is data.txt and NR = 13 content is http:x:33:33::/srv/http:/bin/false
now file is data.txt and NR = 14 content is dbus:x:81:81:System message bus:/:/bin/false
now file is data.txt and NR = 15 content is hal:x:82:82:HAL daemon:/:/bin/false
now file is data.txt and NR = 16 content is mysql:x:89:89::/var/lib/mysql:/bin/false
now file is data.txt and NR = 17 content is aaa:x:1001:1001::/home/aaa:/bin/bash
now file is data.txt and NR = 18 content is ba:x:1002:1002::/home/zhangy:/bin/bash
now file is data.txt and NR = 19 content is test:x:1003:1003::/home/test:/bin/bash
now file is data.txt and NR = 20 content is @zhangying:*:1004:1004::/home/test:/bin/bash
now file is data.txt and NR = 21 content is policykit:x:102:1005:Po
#使用FNR
awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, FNR, $0}' test.txt data.txt
输出:
now file is test.txt and NR = 1 content is 1) Amit Physics 80
now file is test.txt and NR = 2 content is 2) Rahul Maths 90
now file is test.txt and NR = 3 content is 3) Shyam Biology 87
now file is test.txt and NR = 4 content is 4) Kedar English 85
now file is test.txt and NR = 5 content is 5) Hari History 89
now file is data.txt and NR = 1 content is root:x:0:0:root:/root:/bin/bash
now file is data.txt and NR = 2 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
now file is data.txt and NR = 3 content is DADddd:x:2:2:daemon:/sbin:/bin/false
now file is data.txt and NR = 4 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
now file is data.txt and NR = 5 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
now file is data.txt and NR = 6 content is &nobody:$:99:99:nobody:/:/bin/false
now file is data.txt and NR = 7 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
now file is data.txt and NR = 8 content is http:x:33:33::/srv/http:/bin/false
now file is data.txt and NR = 9 content is dbus:x:81:81:System message bus:/:/bin/false
now file is data.txt and NR = 10 content is hal:x:82:82:HAL daemon:/:/bin/false
now file is data.txt and NR = 11 content is mysql:x:89:89::/var/lib/mysql:/bin/false
now file is data.txt and NR = 12 content is aaa:x:1001:1001::/home/aaa:/bin/bash
now file is data.txt and NR = 13 content is ba:x:1002:1002::/home/zhangy:/bin/bash
now file is data.txt and NR = 14 content is test:x:1003:1003::/home/test:/bin/bash
now file is data.txt and NR = 15 content is @zhangying:*:1004:1004::/home/test:/bin/bash
now file is data.txt and NR = 16 content is policykit:x:102:1005:Po
可以发现,更换文件是NR重新开始计算。
⑨ RLENGTH
匹配的字符串的长度。
awk 'BEGIN{if(match("three", "re")){printf "regex length is %d\n", RLENGTH}}'
输出:
regex length is 2
⑩ RSTART
匹配字符串的起始位置
awk 'BEGIN{if(match("three", "re")){printf "start pos of str is %d\n", RSTART}}'
输出:
start pos of str is 3
⑪$n
输出列,0——整行,n>0为分隔符分割后的第n列
⑫ IGNORECASE
指定是否区分大小写
awk 'BEGIN{IGNORECASE=1} /amit/' test.txt
输出:
1) Amit Physics 80
#如果没有IGNORECASE设置
awk '/amit/' test.txt
将无匹配记录
awk常用内建函数
1.字符串函数
①sub、gsub
sub 函数匹配记录中最大、最靠左边的子字符串的正则表达式,并用替换字符串替换这些字符串。如果没有指定目标字符串就默认使用整个记录。替换只发生在第一次匹配的时候。
sub (regular expression, substitution string):
sub (regular expression, substitution string, target string)
例
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'
输出:
hello i ok Chuck i am
hello i am Chuck i ok
sub只在第一次匹配时发生,如果想替换文档中所有匹配项,需要使用gsub函数
#将替换文档中所有am为ok
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
#将替换文档所有第6项为am的记录的第6项为ok
echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'
②index
返回字符串第一次被匹配的位置,偏移量从1开始
index(string, originstr)
③length
获取字符串长度
length #获取整条记录的字符数
length(string) #获取string的字符数
④substr
截取字符串
substr(string, startpos); #startpos起的所有字符
substr(string, startpos, length); #startpos起长度为length的字符串
⑤match
匹配正则表达式,不符合返回0
match(string, regular expression);
⑥split
分割字符串到数组
split(string, array); #默认按照FS分割
split(string, array, separator); #按照分隔符分割
例
awk 'BEGIN{ split( "20:18:00", time, ":" ); print time[2] }'
输出:
18
2.时间函数
①systime
获取当前时间戳
②strftime
获取时间格式化
![](https://img.haomeiwen.com/i5914725/a92873dcc10abcb7.png)
strftime(format, [timestamp]);
例:
awk 'BEGIN{now = strftime("%D"); print now}'
awk 'BEGIN{now = strftime("%D", systime()); print now}'
3.数学函数
![](https://img.haomeiwen.com/i5914725/88f45a1e6ef94ca3.png)
awk操作符
1.算术运算符
①加法操作
awk 'BEGIN{a = 10; b = 30; print "(a + b) =", a + b}'
输出:
(a + b) = 40
②减法运算符
awk 'BEGIN{a = 10; b = 30; print "(a - b) =", a - b}'
输出:
(a - b) = -20
③乘法运算符
awk 'BEGIN{a = 10; b = 30; print "a * b =", a * b}'
输出:
a * b = 300
④除法运算符
awk 'BEGIN{a = 10; b = 20; print "a / b = ", a / b}'
输出:
a / b = 0.5
⑤模运算符
awk 'BEGIN{a = 10; b = 20; print "a % b = ", a % b}'
输出:
a % b = 10
2.递增运算符与递减运算符
和大多数编程语言一样,都有前置、后置的递增递减运算符
#后置递增
awk 'BEGIN{a = 10; printf "the res of a++ is %d then print a is %d", a++, a}'
输出:
the res of a++ is 10 then print a is 11
#前置递增
awk 'BEGIN{a = 10; printf "the res of ++a is %d then print a is %d", ++a, a}'
输出:
the res of ++a is 11 then print a is 11
#后置递减
awk 'BEGIN{a = 10; printf "the res of a-- is %d then print a is %d", a--, a}'
输出:
the res of a-- is 10 then print a is 9
#前置递减
awk 'BEGIN{a = 10; printf "the res of --a is %d then print a is %d", --a, a}'
输出:
the res of --a is 9 then print a is 9
3.赋值操作符
这里介绍简单赋值,加法赋值,减法赋值,乘法赋值,除法赋值,取模赋值,指数赋值
#简单赋值
awk 'BEGIN{a = 10; printf "a is %d\n", a}'
输出:
a is 10
#加法赋值
awk 'BEGIN{a = 10; printf "a += 10 is %d\n", a += 10}'
输出:
a is 20
#减法赋值
awk 'BEGIN{a = 10; printf "a -= 5 is %d\n", a -= 5}'
输出:
a is 5
#乘法赋值
awk 'BEGIN{a = 10; printf "a *= 5 is %d\n", a *= 5}'
输出:
a *= 5 is 50
#除法赋值
awk 'BEGIN{a = 10; printf "a /= 5 is %d\n", a /= 5}'
输出:
a /= 5 is 2
#取模赋值
awk 'BEGIN{a = 10; printf "a %= 3 is %d\n", a %= 3}'
输出:
a %= 3 is 1
#指数赋值
awk 'BEGIN{a = 10; printf "a ^= 3 is %d\n", a ^= 3}'
输出:
a ^= 3 is 1000
4.关系运算符
①等于
awk 'BEGIN { a = 10; b = 10; if (a == b) print "a == b" }'
输出:
a == b
②不等于
awk 'BEGIN{ a = 10; b = 5; if(a != b){print "a != b"} }'
输出:
a != b
③ 小于
awk 'BEGIN{ a = 5; b = 10; if(a < b){print "a < b"} }'
输出:
a < b
④小于或等于
awk 'BEGIN{ a = 5; b = 10; if(a <= b){print "a <= b"} }'
输出:
a <= b
⑤大于
awk 'BEGIN{a = 5; b = 10; if(a > b){print "a > b"}}'
输出:
无
⑥大于或等于
awk 'BEGIN{a = 5; b = 10; if(a >= b){print "a >= b"}}'
输出:
无
控制流程
1.if-else
awk '{if($3 > 1){print "Y\n"}else{print "N"}}' test.txt
2.while
awk 'BEGIN{i = 0;}{while(i < $3){i++;}}' test.txt
3.for
awk '{for(i = 0; i < 10; i++){print $i}}' test.txt
awk数组
awk数组和PHP类似,是一种key-value的模式,下标可以是数字和字符串,value将以字符串的形式存储,支持多维数组。
声明数组
声明数组的方式:数组名[key] = value
arr[0] = "Chuck"
arr["Craig"] = 1;
arr[$0] = $1;
arr["a", "b"] = 100; #二维数组,相当于arr["a"]["b"] = 100;
输出元素
方式一:print
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[0]}'
输出:
1
如果输出数组中不存在的下标
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[100]}'
输出:
将输出空字符串
方式二:for
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print i}}'
输出:
0
a
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print arr[i]}}'
1
2
awk数组对不存在的key采用空字符串方式输出,通过for方式输出时,i表示数组的key,只有通过arr[i]才能输出value。如果print arr会报错!
删除数组元素
awk可以删除数组元素也可以删除整个数组,通过delete命令完成
#删除单个元素
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr["a"]; print arr["a"];}'
输出:
2
#删除整个数组
awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr; print arr[0]; print arr["a"];'
输出:
2
删除之后再打印元素将显示空字符串,如果delete一个不存在的key,awk将不会报错。
多维数组
声明多维数组使用 数组名[index1, index2,...]方式声明
awk 'BEGIN{arr["a", "b"] = 1; print arr["a", "b"];}'
输出:
1
打印多维数组
除了按上面print方式打印,for方式也可以打印,不像一般的编程语言几维数组需要几层for循环,awk多维数组可以用一个for循环搞定。
awk 'BEGIN{arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
输出:
ab 1
cd 2
awk多维数组默认使用''连接每个维度,可以定义SUBSEP变量的值设置维度之间的分隔符。注意设定SUBSEP一定在数组声明之前,否则无效。
awk 'BEGIN{SUBSEP = ":"; arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
输出:
a:b 1
c:d 2
通过设置SUBSEP分隔符时需要注意避免使用index中的符号,否则有可能出问题,如:
awk 'BEGIN{SUBSEP = ":"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in arr){print i, arr[i];}}'
输出:
a:b:c 2
awk 'BEGIN{SUBSEP = "~"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in arr){print i, arr[i];}}'
输出:
a:b~c 2
a~b:c 1
通过':'连接后,两个元素的key都是'a:b:c',会产生覆盖问题。
awk自定义函数
awk可以像编程语言一样自定义函数,格式如下:
function funcName(parameter1, parameter2, parameter3, ...){
statements;
[return xxx;]
}
例:
awk 'function add(a, b){a += 5; res = a + b; return res;}BEGIN{print add(10, 20)}'
输出:
35
函数定义需要在执行流程之前,否则会出错。
网友评论