面对海量数据要进行交集、并集等基本的统计时我们该怎么做呢?
一、基本概念
交集
交集3为A和B的交集
并集
并集1、2、3、4、5为A和B的并集
差集
差集1、2、4、5为A和B的差集
补集
补集二、代码展示
下面代码只针对两个文件进行相互比较
#!/usr/bin/perl
use warnings;
use strict;
open A,$ARGV[0] or die;
open B,$ARGV[1] or die;
open UNION, ">union.txt" or die;
open INTSEC, ">intsec.txt" or die;
open DIFF, ">diff.txt" or die;
open A_SPEC, ">a_spec.txt" or die;
open B_SPEC, ">b_spec.txt" or die;
my @a;
my @b;
while(<A>){
chomp;
push @a, $_;
}
while(<B>){
chomp;
push @b, $_;
}
my @union;#并集
my @diff;#差集
my @intsec;#交集
my @a_spec;#A独有
my @b_spec;#B独有
my %union;
my %diff;
my %intsec;
my %a;
my %b;
#my (@union,@diff,@intsec,@a_spec,@b_spec,%union,%diff,%intsec,%a,%b);可以同时写到一起
foreach (@a){$union{$_}++; $a{$_}++;}
foreach (@b){$union{$_}++; $b{$_}++;} #如果A中的数在B中再次出现的话,则该数所对应的值会自增为2
@union = keys %union;
foreach (@union){
if (exists $a{$_} and $union{$_}==1 ){
push @a_spec,$_;
}
elsif($union{$_} == 1 and $b{$_} == 1){
push @b_spec,$_;
}
else {push @intsec,$_;}
}
@diff = (@a_spec,@b_spec);
print UNION (join "\n",@union);
print INTSEC (join "\n",@intsec);
print DIFF (join "\n", @diff);
print A_SPEC (join "\n", @a_spec);
print B_SPEC (join "\n", @b_spec);
代码有点长,其实可以更简单》》》》》》
网友评论