美文网首页Linux学习之路Linux专题我用 Linux
文本处理命令(二)——awk学习总结

文本处理命令(二)——awk学习总结

作者: Chuck_Hu | 来源:发表于2018-05-02 15:55 被阅读17次

    上篇学习了grep文本处理工具,这篇总结下另一款更强大的处理工具awk。
    AWK 是一种解释执行的编程语言。它非常的强大,被设计用来专门处理文本数据。AWK 的名称是由它们设计者的名字缩写而来 —— Afred Aho, Peter Weinberger 与 Brian Kernighan。除了文本处理,awk还可以生成格式化的文本报告,进行算术运算,字符串操作等。

    工作流程

    awk的工作流程就三步:读取、执行和重复


    awk工作流程

    (1)读(Read)
    AWK 从输入流(文件、管道或者标准输入)中读入一行然后将其存入内存中。
    (2)执行(Execute)
    对于每一行输入,所有的 AWK 命令按顺执行。 默认情况下,AWK 命令是针对于每一行输入,但是我们可以将其限制在指定的模式中。
    (3)重复(Repeate)
    一直重复上述两个过程直到文件结束。

    awk程序结构

    分为三个模块:开始模块,主体模块和结束模块

    开始模块

    语法:

    BEGIN {awk-commands}
    

    仅在程序启动时执行,且只执行一次,通常用于为变量赋值等初始化操作。BEGIN必须大写。
    另外,开始模块是可选模块,可以没有。

    主体模块

    语法:

    /pattern/ {awk-commands}
    

    主体模块就是程序对文件每行进行处理的部分。

    结束模块

    语法:

    END {awk-commands}
    

    类似开始模块,结束模块只在结束时调用一次,也是可选模块,END关键字必须大写。
    实例
    文件内容(后文通用文件)test.txt

    1)    Amit     Physics    80
    2)    Rahul    Maths      90
    3)    Shyam    Biology    87
    4)    Kedar    English    85
    5)    Hari     History    89
    

    awk命令实例

    awk 'BEGIN{printf "NO\tName\tSubject\tMark\n"} {print$0}' test.txt
    

    输出:

    NO    Name    Subject    Mark
    1)    Amit     Physics    80
    2)    Rahul    Maths      90
    3)    Shyam    Biology    87
    4)    Kedar    English    85
    5)    Hari     History    89
    

    基础语法

    awk基础语法,awk [awk commands] file,awk关键字必带,之后是一串awk指令,即上文讲的语法模块,最后是要处理的文件。
    如果在命令行中输入awk指令,awk命令主体部分必须包括在''内,且每句指令需要包括在{}中。例如打印test.txt全文:

    awk '{print}' test.txt
    

    awk命令可以在命令行执行,也可以写入文件中执行。还是上面的操作,这次换到文件执行,创建exe1.awk

    {print}
    

    awk执行指令文件需要通过-f选项完成,awk -f awkcommandfile targetfile

    awk -f exe.awk test.txt
    

    也将得到前文的输出结果。

    常用awk选项

    输入awk --help,列出全部可用选项

    Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
    Usage: awk [POSIX or GNU style options] [--] 'program' file ...
    POSIX options:      GNU long options:
        -f progfile     --file=progfile
        -F fs           --field-separator=fs
        -v var=val      --assign=var=val
        -m[fr] val
        -W compat       --compat
        -W copyleft     --copyleft
        -W copyright        --copyright
        -W dump-variables[=file]    --dump-variables[=file]
        -W gen-po       --gen-po
        -W help         --help
        -W lint[=fatal]     --lint[=fatal]
        -W lint-old     --lint-old
        -W non-decimal-data --non-decimal-data
        -W profile[=file]   --profile[=file]
        -W posix        --posix
        -W re-interval      --re-interval
        -W source=program-text  --source=program-text
        -W traditional      --traditional
        -W usage        --usage
        -W version      --version
    

    ①-f:执行指令文件
    ②-F:指定内容分隔符,默认使用空格分割。例如:-F':'即指定:作为分隔符,''单引号可以省去。可以同时使用多个域分隔符,这时应该把分隔符写成放到方括号中,awk -F[:\t] 可以使用冒号、制表符进行分割。
    ③-v:赋值操作。除了可以在BEGIN中进行赋值,awk还提供了-v选项在命令主体外进行赋值。

    awk -v name=Chuck 'BEGIN{printf "my name is %s\n", name}'
    输出:
    my name is Chuck
    

    ④--dump-variables[=file]:输出awk全局变量

    #输出awk全局变量
    awk --dump-variables ''
    #默认在awkvars.out中
    cat awkvars.out
    输出:
    ARGC: number (1)
    ARGIND: number (0)
    ARGV: array, 1 elements
    BINMODE: number (0)
    CONVFMT: string ("%.6g")
    ERRNO: number (0)
    FIELDWIDTHS: string ("")
    FILENAME: string ("")
    FNR: number (0)
    FS: string (" ")
    IGNORECASE: number (0)
    LINT: number (0)
    NF: number (0)
    NR: number (0)
    OFMT: string ("%.6g")
    OFS: string (" ")
    ORS: string ("\n")
    RLENGTH: number (0)
    RS: string ("\n")
    RSTART: number (0)
    RT: string ("")
    SUBSEP: string ("\034")
    TEXTDOMAIN: string ("messages")
    

    ⑤--profile:格式化awk指令,将通过命令行输入的awk指令格式化到文件中

    #命令行输入执行语句,--profile默认写入awkprof.out中
    awk --profile 'BEGIN{print "This is BEGIN"} {print $1,$2,$3} END{print "AWK command end"}' test.txt 
    #查看格式化后的命令
    cat awkprof.out
    输出:
    # gawk profile, created Sat Apr  7 09:39:57 2018
    
        # BEGIN block(s)
    
        BEGIN {
            (prinf "This is BEGIN")
        }
    
        # Rule(s)
    
        {
            print $1, $2, $3
        }
    
        # END block(s)
    
        END {
            print "AWK command end"
        }
    

    美不美。指定文件的话只需要--profile=filename即可,④中同样适用。

    内置变量

    ①ARGC
    参数个数

    awk 'BEGIN{print "num of argument is ", ARGC}' one two three four
    输出:
    num of argument is 5
    

    ②ARGV
    这个变量表示存储命令行输入参数的数组。

    awk 'BEGIN{for(i = 0; i < ARGC - 1; i++){printf "ARGV[%d] = %s\n", i, ARGV[i]}}' one two three four
    输出:
    ARGV[0] = awk
    ARGV[1] = one
    ARGV[2] = two
    ARGV[3] = three
    

    ③ ENVIRON
    这个变量是系统的环境变量数组,相当于Linux的env指令。

    awk 'BEGIN{printf "environment params user is %s\n", ENVIRON["USER"]}'
    输出:
    environment params user is Chuck
    

    ④ FILENAME
    文件名

    awk 'END{printf "execute file name is %s\n", FILENAME}' test.txt
    输出:
    test.txt
    

    注意文件名输出不能在BEGIN模块中,可以试试。
    ⑤FS
    分隔符,默认是空格,可以通过-F选项进行指定

    awk -F: 'BEGIN{print "FS =", FS}'
    输出:
    FS = 
    #指定分隔符
    awk -F: 'BEGIN{print "FS =", FS}'
    输出:
    FS = :
    

    ⑥NF
    此变量表示当前输入记录中域的数量。所谓域就是行经过分隔符分割之后的列,NF就是列的数量。

    echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'BEGIN{i = 0}{printf "line %d NF = %d\n", i++, NF}'
    输出:
    line 0 NF = 2
    line 1 NF = 3
    line 2 NF = 4
    #NF也可以作为条件使用
    echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk 'NF > 2'
    输出:
    One Two Three
    One Two Three Four
    

    NF > 2是条件,后面不跟任何操作时默认执行print,输出当前行。
    ⑦ NR
    此变量表示当前记录的数量。即行数。
    在⑥中使用定义的变量i表示行数,本例中可以使用NR变量。

    echo -e "One Two\nOne Two Three\nOne Two Three Four" | awk '{printf "now line number is %d\n", NR}'
    输出:
    now line number is 1
    now line number is 2
    now line number is 3
    

    可以看出,NR标记行号是从1开始计算。
    ⑧ FNR
    当多个文件同时读取时,NR会从第一个文件的第一行开始一直计算到最后一个文件的最后一行。
    使用FNR时,每更换文件时NR重新开始计算。

    awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, NR, $0}' test.txt data.txt
    输出:
    now file is test.txt and NR = 1 content is 1)    Amit     Physics    80
    now file is test.txt and NR = 2 content is 2)    Rahul    Maths      90
    now file is test.txt and NR = 3 content is 3)    Shyam    Biology    87
    now file is test.txt and NR = 4 content is 4)    Kedar    English    85
    now file is test.txt and NR = 5 content is 5)    Hari     History    89
    now file is data.txt and NR = 6 content is root:x:0:0:root:/root:/bin/bash
    now file is data.txt and NR = 7 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
    now file is data.txt and NR = 8 content is DADddd:x:2:2:daemon:/sbin:/bin/false
    now file is data.txt and NR = 9 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
    now file is data.txt and NR = 10 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
    now file is data.txt and NR = 11 content is &nobody:$:99:99:nobody:/:/bin/false
    now file is data.txt and NR = 12 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
    now file is data.txt and NR = 13 content is http:x:33:33::/srv/http:/bin/false
    now file is data.txt and NR = 14 content is dbus:x:81:81:System message bus:/:/bin/false
    now file is data.txt and NR = 15 content is hal:x:82:82:HAL daemon:/:/bin/false
    now file is data.txt and NR = 16 content is mysql:x:89:89::/var/lib/mysql:/bin/false
    now file is data.txt and NR = 17 content is aaa:x:1001:1001::/home/aaa:/bin/bash
    now file is data.txt and NR = 18 content is ba:x:1002:1002::/home/zhangy:/bin/bash
    now file is data.txt and NR = 19 content is test:x:1003:1003::/home/test:/bin/bash
    now file is data.txt and NR = 20 content is @zhangying:*:1004:1004::/home/test:/bin/bash
    now file is data.txt and NR = 21 content is policykit:x:102:1005:Po
    
    #使用FNR
    awk '{printf "now file is %s and NR = %d content is %s\n", FILENAME, FNR, $0}' test.txt data.txt
    输出:
    now file is test.txt and NR = 1 content is 1)    Amit     Physics    80
    now file is test.txt and NR = 2 content is 2)    Rahul    Maths      90
    now file is test.txt and NR = 3 content is 3)    Shyam    Biology    87
    now file is test.txt and NR = 4 content is 4)    Kedar    English    85
    now file is test.txt and NR = 5 content is 5)    Hari     History    89
    now file is data.txt and NR = 1 content is root:x:0:0:root:/root:/bin/bash
    now file is data.txt and NR = 2 content is bin:x:1:1:bin:/bin:/bin/false,aaa,bbbb,cccc,aaaaaa
    now file is data.txt and NR = 3 content is DADddd:x:2:2:daemon:/sbin:/bin/false
    now file is data.txt and NR = 4 content is mail:x:8:12:mail:/var/spool/mail:/bin/false
    now file is data.txt and NR = 5 content is ftp:x:14:11:ftp:/home/ftp:/bin/false
    now file is data.txt and NR = 6 content is &nobody:$:99:99:nobody:/:/bin/false
    now file is data.txt and NR = 7 content is zhangy:x:1000:100:,,,:/home/zhangy:/bin/bash
    now file is data.txt and NR = 8 content is http:x:33:33::/srv/http:/bin/false
    now file is data.txt and NR = 9 content is dbus:x:81:81:System message bus:/:/bin/false
    now file is data.txt and NR = 10 content is hal:x:82:82:HAL daemon:/:/bin/false
    now file is data.txt and NR = 11 content is mysql:x:89:89::/var/lib/mysql:/bin/false
    now file is data.txt and NR = 12 content is aaa:x:1001:1001::/home/aaa:/bin/bash
    now file is data.txt and NR = 13 content is ba:x:1002:1002::/home/zhangy:/bin/bash
    now file is data.txt and NR = 14 content is test:x:1003:1003::/home/test:/bin/bash
    now file is data.txt and NR = 15 content is @zhangying:*:1004:1004::/home/test:/bin/bash
    now file is data.txt and NR = 16 content is policykit:x:102:1005:Po
    

    可以发现,更换文件是NR重新开始计算。
    ⑨ RLENGTH
    匹配的字符串的长度。

    awk 'BEGIN{if(match("three", "re")){printf "regex length is %d\n", RLENGTH}}'
    输出:
    regex length is 2
    

    ⑩ RSTART
    匹配字符串的起始位置

    awk 'BEGIN{if(match("three", "re")){printf "start pos of str is %d\n", RSTART}}'
    输出:
    start pos of str is 3
    

    ⑪$n
    输出列,0——整行,n>0为分隔符分割后的第n列
    ⑫ IGNORECASE
    指定是否区分大小写

    awk 'BEGIN{IGNORECASE=1}  /amit/' test.txt
    输出:
    1)    Amit     Physics    80
    
    #如果没有IGNORECASE设置
    awk '/amit/' test.txt
    将无匹配记录
    

    awk常用内建函数

    1.字符串函数

    ①sub、gsub
    sub 函数匹配记录中最大、最靠左边的子字符串的正则表达式,并用替换字符串替换这些字符串。如果没有指定目标字符串就默认使用整个记录。替换只发生在第一次匹配的时候。

    sub (regular expression, substitution string):
    sub (regular expression, substitution string, target string)
    例
    echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
    echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'
    输出:
    hello i ok Chuck i am
    hello i am Chuck i ok
    

    sub只在第一次匹配时发生,如果想替换文档中所有匹配项,需要使用gsub函数

    #将替换文档中所有am为ok
    echo "hello i am Chuck i am" | awk '{sub(/am/, "ok"); print $0}'
    #将替换文档所有第6项为am的记录的第6项为ok
    echo "hello i am Chuck i am" | awk '{sub(/am/, "ok", $6); print $0}'
    

    ②index
    返回字符串第一次被匹配的位置,偏移量从1开始

    index(string, originstr)
    

    ③length
    获取字符串长度

    length    #获取整条记录的字符数
    length(string)    #获取string的字符数
    

    ④substr
    截取字符串

    substr(string, startpos);  #startpos起的所有字符
    substr(string, startpos, length);  #startpos起长度为length的字符串
    

    ⑤match
    匹配正则表达式,不符合返回0

    match(string, regular expression);
    

    ⑥split
    分割字符串到数组

    split(string, array);  #默认按照FS分割
    split(string, array, separator);  #按照分隔符分割
    例
    awk 'BEGIN{ split( "20:18:00", time, ":" ); print time[2] }'
    输出:
    18
    
    2.时间函数

    ①systime
    获取当前时间戳
    ②strftime
    获取时间格式化


    时间格式表
    strftime(format, [timestamp]);
    例:
    awk 'BEGIN{now = strftime("%D"); print now}'
    awk 'BEGIN{now = strftime("%D", systime()); print now}'
    
    3.数学函数
    数学函数表

    awk操作符

    1.算术运算符

    ①加法操作

    awk 'BEGIN{a = 10; b = 30; print "(a + b) =", a + b}'
    输出:
    (a + b) = 40
    

    ②减法运算符

    awk 'BEGIN{a = 10; b = 30; print "(a - b) =", a - b}'
    输出:
    (a - b) = -20
    

    ③乘法运算符

    awk 'BEGIN{a = 10; b = 30; print "a * b =", a * b}'
    输出:
    a * b = 300
    

    ④除法运算符

    awk 'BEGIN{a = 10; b = 20; print "a / b = ", a / b}'
    输出:
    a / b =  0.5
    

    ⑤模运算符

    awk 'BEGIN{a = 10; b = 20; print "a % b = ", a % b}'
    输出:
    a % b =  10
    
    2.递增运算符与递减运算符

    和大多数编程语言一样,都有前置、后置的递增递减运算符

    #后置递增
    awk 'BEGIN{a = 10; printf "the res of a++ is %d then print a is %d", a++, a}'
    输出:
    the res of a++ is 10 then print a is 11
    
    #前置递增
    awk 'BEGIN{a = 10; printf "the res of ++a is %d then print a is %d", ++a, a}'
    输出:
    the res of ++a is 11 then print a is 11
    
    #后置递减
    awk 'BEGIN{a = 10; printf "the res of a-- is %d then print a is %d", a--, a}'
    输出:
    the res of a-- is 10 then print a is 9
    
    #前置递减
    awk 'BEGIN{a = 10; printf "the res of --a is %d then print a is %d", --a, a}'
    输出:
    the res of --a is 9 then print a is 9
    
    3.赋值操作符

    这里介绍简单赋值,加法赋值,减法赋值,乘法赋值,除法赋值,取模赋值,指数赋值

    #简单赋值
    awk 'BEGIN{a = 10; printf "a is %d\n", a}'
    输出:
    a is 10
    
    #加法赋值
    awk 'BEGIN{a = 10; printf "a += 10 is %d\n", a += 10}'
    输出:
    a is 20
    
    #减法赋值
    awk 'BEGIN{a = 10; printf "a -= 5 is %d\n", a -= 5}'
    输出:
    a is 5
    
    #乘法赋值
    awk 'BEGIN{a = 10; printf "a *= 5 is %d\n", a *= 5}'
    输出:
    a *= 5 is 50
    
    #除法赋值
    awk 'BEGIN{a = 10; printf "a /= 5 is %d\n", a /= 5}'
    输出:
    a /= 5 is 2
    
    #取模赋值
    awk 'BEGIN{a = 10; printf "a %= 3 is %d\n", a %= 3}'
    输出:
    a %= 3 is 1
    
    #指数赋值
    awk 'BEGIN{a = 10; printf "a ^= 3 is %d\n", a ^= 3}'
    输出:
    a ^= 3 is 1000
    
    4.关系运算符

    ①等于

    awk 'BEGIN { a = 10; b = 10; if (a == b) print "a == b" }'
    输出:
    a == b
    

    ②不等于

    awk 'BEGIN{ a = 10; b = 5; if(a != b){print "a != b"} }'
    输出:
    a != b
    

    ③ 小于

    awk 'BEGIN{ a = 5; b = 10; if(a < b){print "a < b"} }'
    输出:
    a < b
    

    ④小于或等于

    awk 'BEGIN{ a = 5; b = 10; if(a <= b){print "a <= b"} }'
    输出:
    a <= b
    

    ⑤大于

    awk 'BEGIN{a = 5; b = 10; if(a > b){print "a > b"}}'
    输出:
    无
    

    ⑥大于或等于

    awk 'BEGIN{a = 5; b = 10; if(a >= b){print "a >= b"}}'
    输出:
    无
    

    控制流程

    1.if-else
    awk '{if($3 > 1){print "Y\n"}else{print "N"}}' test.txt
    
    2.while
    awk 'BEGIN{i = 0;}{while(i < $3){i++;}}' test.txt
    
    3.for
    awk '{for(i = 0; i < 10; i++){print $i}}' test.txt
    

    awk数组

    awk数组和PHP类似,是一种key-value的模式,下标可以是数字和字符串,value将以字符串的形式存储,支持多维数组。

    声明数组

    声明数组的方式:数组名[key] = value

    arr[0] = "Chuck"
    arr["Craig"] = 1;
    arr[$0] = $1;
    arr["a", "b"] = 100;  #二维数组,相当于arr["a"]["b"] = 100;
    
    输出元素
    方式一:print
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[0]}'
    输出:
    1
    如果输出数组中不存在的下标
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr[100]}'
    输出:
    
    将输出空字符串
    
    方式二:for
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print i}}'
    输出:
    0
    a
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; for(i in arr){print arr[i]}}'
    1
    2
    

    awk数组对不存在的key采用空字符串方式输出,通过for方式输出时,i表示数组的key,只有通过arr[i]才能输出value。如果print arr会报错!

    删除数组元素

    awk可以删除数组元素也可以删除整个数组,通过delete命令完成

    #删除单个元素
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr["a"]; print arr["a"];}'
    输出:
    2
    
    #删除整个数组
    awk 'BEGIN{arr[0] = 1; arr["a"] = 2; print arr["a"]; delete arr; print arr[0]; print arr["a"];'
    输出:
    2
    
    
    

    删除之后再打印元素将显示空字符串,如果delete一个不存在的key,awk将不会报错。

    多维数组

    声明多维数组使用 数组名[index1, index2,...]方式声明

    awk 'BEGIN{arr["a", "b"] = 1; print arr["a", "b"];}'
    输出:
    1
    

    打印多维数组
    除了按上面print方式打印,for方式也可以打印,不像一般的编程语言几维数组需要几层for循环,awk多维数组可以用一个for循环搞定。

    awk 'BEGIN{arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
    输出:
    ab 1
    cd 2
    

    awk多维数组默认使用''连接每个维度,可以定义SUBSEP变量的值设置维度之间的分隔符。注意设定SUBSEP一定在数组声明之前,否则无效。

    awk 'BEGIN{SUBSEP = ":"; arr["a", "b"] = 1; arr["c", "d"] = 2; for(i in arr){print i, arr[i];}}'
    输出:
    a:b 1
    c:d 2
    

    通过设置SUBSEP分隔符时需要注意避免使用index中的符号,否则有可能出问题,如:

    awk 'BEGIN{SUBSEP = ":"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in  arr){print i, arr[i];}}'
    输出:
    a:b:c 2
    awk 'BEGIN{SUBSEP = "~"; arr["a", "b:c"] = 1; arr["a:b", "c"] = 2; for(i in  arr){print i, arr[i];}}'
    输出:
    a:b~c 2
    a~b:c 1
    

    通过':'连接后,两个元素的key都是'a:b:c',会产生覆盖问题。

    awk自定义函数

    awk可以像编程语言一样自定义函数,格式如下:

    function funcName(parameter1, parameter2, parameter3, ...){
        statements;
        [return xxx;]
    }
    例:
    awk 'function add(a, b){a += 5; res = a + b; return res;}BEGIN{print add(10, 20)}'
    输出:
    35
    

    函数定义需要在执行流程之前,否则会出错。

    相关文章

      网友评论

        本文标题:文本处理命令(二)——awk学习总结

        本文链接:https://www.haomeiwen.com/subject/njpshftx.html