关联分析关联分析:关联分析是在大规模数据集中有目的地寻找关系的任务。关联分析寻找关系:频繁项集、关联规则。支持度:数据集中包含项集的记录的比例。例如,在商品购买记录集合中,如果购买铅笔的订单占订单总数的10%,则{pencil}项集的支持度为10%。即$$P({pencil})=0.1$$置信度或可信度:定义为条件概率。例如,对于关联规则{diaper}-->{wine},这条规则的可信度定义为“support({diaper,wine})/support({diaper})”,即“概率在顾客之间买酒”。频繁项集:经常一起出现的项的集合,定义为支持度大于某个阈值的集合。$$P(某集合)>c$$关联规则:置信度大于某个阈值的关系。例如,对于关系{尿布}-->{酒},如果购买尿布的顾客中购买酒的概率大于某个阈值,则这种关系称为关联规则。注意:关联规则是单向的。先验原理?还是以购买商品为例。对于有$N$种商品的超市,顾客所有可能的数据组合是$$\sum_{i=1}^NC_N^i=(1+1)^N-1=2^N-1$$种组合,它需要很长时间才能遍历。Apriori基于这样的原则:如果集合A不是频繁项集,那么所有以集合A为子集的集合都不是频繁项集。使用Apriori算法寻找频繁项集Apriori算法的两个输入参数是数据集和最小支持度(阈值)。过程如下:1)生成单个商品的所有项集,遍历所有交易记录,过滤掉单个商品的频繁项集。2)对于包含k个项的频繁项集,两两组合生成k+1个项集,删除非频繁项集,得到k+1个频繁项集,直到算法收敛。3)返回频繁项集列表。下面是代码实现:#生成候选集defCreat_C1(item_set):"""item_set是订单的集合,即每个订单购买的商品类型的集合组成的集合。"""C1=[]foriinitem_set:forjini:if{j}notinC1:C1.append(frozenset({j}))returnC1#计算候选集的支持度,选择k项频繁集defFre_Support_cal(D,Ck,minSupport):"""输入:D:数据集Ck:k项候选集minSupport:最小支持度输出:Freq_listk:k项频繁集support_data_dictk:k项频繁集支持度"""support_count_dictk={}forrawinD:foritem_setinCk:ifitem_set.issubset(raw):ifitem_setnotinsupport_count_dictk:support_count_dictk[item_set]=1else:support_count_dictk[item_set]+=1num_all=len(D)support_data_dictk={}Freq_listk=[]forkeyinsupport_count_dictk:support=support_count_dictk[key]/num_allsupport_data_dictk[key]=supportifsupport>=minSupport:Freq_listk.append(key)returnFreq_listk,support_data_dictk#byItemfrequentset生成kitemcandidate设置defCreat_Ck(Freq_listk_1,k):Ck=[]foriinrange(len(Freq_listk_1)):forjinrange(i+1,len(Freq_listk_1)):iflen(Freq_listk_1[i]-Freq_listk_1[j])==1:iffrozenset(Freq_listk_1[i]|Freq_listk_1[j])不在Ck中:Ck.append(frozenset(Freq_listk_1[i]|Freq_listk_1[j]))returnCk#生成频繁集defapriori(dataset,minSupport):C1=Creat_C1(dataset)Freq_list1,support_data_dict1=Fre_Support_cal(数据集,C1,minSupport)k=2Freq_listk_1=Freq_list1Creat_Ck(Freq_listk_1,k)Freq_listk,support_data_dictk=Fre_Support_cal(dataset,Ck,minSupport)Freq_list.extend(Freq_listk)support_data_dict.update(support_data_dictk)k+=1Freq_listk_1=Freq_listkreturnFreq_list,support_data_dict#测试代码[1,dataset3=,4],[2,3,5],[1,2,3,5],[2,5]]Fre_list,support_dict=apriori(dataset,0.5)Fre_list,support_dict([frozenset({1}),frozenset({3}),frozenset({2}),frozenset({5}),frozenset({1,3}),frozenset({2,3}),frozenset({3,5}),frozenset({2,5}),frozenset({2,3,5})],{frozenset({1}):0.5,frozenset({3}):0.75,frozenset({4}):0.25,frozenset({2}):0.75,frozenset({5}):0.75,frozenset({1,3}):0.5,frozenset({2,3}):0.5,frozenset({3,5}):0.5,frozenset({2,5}):0.75,frozenset({1,2}):0.25,frozenset({1,5}):0.25,frozenset({2,3,5}):0.5,frozenset({1,2,3}):0.25,frozenset({1,3,5}):0.25})从频繁项集中挖掘关联规则对于$N$项频繁项集,可能的频繁项集组合为:$$\sum_{i=1}^{N-1}C_N^i=2^N-2$$关联组合同样基于Apriori原则,对于频繁项集$A=(X_1,X_2,\cdots,X_N)$,关系$B\longrightarrowC$,其中$B=(X_1,X_2,\cdots,X_k)$,$C=(X_{k+1},X_{k+2},\cdots,X_N)$不构成关联规则,即$$P(C|B)=\frac{P(BC)}{P(B)}=\frac{P(X_1,X_2,\cdots,X_N)}{P(X_1,X_2,\cdots,X_k)} =min_conf:print(freq_list[i],"-->",freq_list[j]-freq_list[i],'frq:',frq,'conf:',conf)rules.append(rule)returnrules#测试代码association_rules(Fre_list,support_dict,0.5)frozenset({1})-->frozenset({3})frq:0.5conf:1.0frozenset({3})-->frozenset({1})frq:0.5conf:0.6666666666666666frozenset({3})-->frozenset({2})frq:0.5conf:0.6666666666666666frozenset({3})-->frozenset({5})frq:0.5会议:0.6666666666666666frozenset({3})-->frozenset({2,5})frq:0.5conf:0.6666666666666666frozenset({2})-->frozenset({3})frq:0.5conf:0.66666666666666666frozenset({2})-->frozenset({5})frq:0.75conf:1.0frozenset({2})-->frozenset({3,5})frq:0.5conf:0.6666666666666666frozenset({5})-->frozenset({3})frq:0.5conf:0.6666666666666666frozenset({5})-->frozenset({2})frq:0.75conf:1.0frozenset({5})-->frozenset({2,3})frq:0.5conf:0.6666666666666666frozenset({2,3})-->frozenset({5})frq:0.5conf:1.0frozenset({3,5})-->frozenset({2})frq:0.5conf:1.0frozenset({2,5})-->frozenset({3})frq:0.5conf:0.6666666666666666[(frozenset({1}),frozenset({3}),0.5,1.0),(frozenset({3}),frozenset({1}),0.5,0.6666666666666666),(frozenset({3}),frozenset({2}),0.5,0.6666666666666666),(frozenset({3}),frozenset({5}),0.5,0.6666666666666666),(冻结集({3}),冻结集({2,5}),0.5,0.6666666666666666),(frozenset({2}),frozenset({3}),0.5,0.6666666666666666),(frozenset({2}),frozenset({5}),0.75,1.0),(frozenset({2}),frozenset({3,5}),0.5,0.6666666666666666),(frozenset({5}),frozenset({3}),0.5,0.6666666666666666),(frozenset({5}),frozenset({2}),0.75,1.0),(frozenset({5}),frozenset({2,3}),0.5,0.6666666666666666),(frozenset({2,3}),frozenset({5}),0.5,1.0),(frozenset({3,5}),frozenset({2}),0.5,1.0),(frozenset({2,5}),frozenset({3}),0.5,0.6666666666666666)]
