基于Python的简单自然语言处理本文是对基于Python的简单自然语言处理任务的介绍,本文所有代码都放在这里。建议提前阅读Python语法快速概览和机器学习开发环境搭建。更多机器学习资料,可参考机器学习、深度学习和自然语言处理领域的推荐书籍清单,以及面向程序员的数据科学和机器学习知识体系和资料收集。20NewsGroup语料库处理20Newsgroup数据集包含来自不同新闻组的大约20,000篇文档,最早由KenLang收集整理。这部分包括数据集的抓取、特征提取、简单分类器训练、主题模型训练等。这部分代码包括主要处理代码封装库和基于Notebook的交互演示。我们首先需要取数据:deffetch_data(self,subset='train',categories=None):"""返回数据进行取数据操作参数:subset->string--抓取的目标集train/test/all"""rand=np.random.mtrand.RandomState(8675309)data=fetch_20newsgroups(subset=subset,categories=categories,shuffle=True,random_state=rand)self.data[subset]=data然后在Notebook中交互查看数据格式:#实例化对象twp=TwentyNewsGroup()#抓取数据twp.fetch_data()twenty_train=twp.data['train']print("数据集结构","->",twenty_train.keys())print("文档数量","->",len(twenty_train.data))print("目标分类","->",[twenty_train.target_names[t]fortintwenty_train.target[:10]])数据集结构->dict_keys(['data','filenames','target_names','target','DESCR','description'])文档数量->11314个目标类别->['sci.space','comp.sys.mac.hardware','sci.electronics','comp.sys.mac.hardware','sci.space','rec.sport.hockey','talk.religion.misc','sci.med','talk.religion.misc','talk.politics.guns']接下来我们可以分析语料库中的特征提取:#ProcessFeatureExtraction#BuildDocument-TermMatrix(Document-TermMatrix)fromsklearn.feature_extraction.textimportCountVectorizercount_vect=CountVectorizer()X_train_counts=count_vect.fit_transform(twenty_train.data)print("DTM结构","->",X_train_counts.shape)#查看词汇表中某个词的下标print("词对应下标","->",count_vect.vocabulary_.get(u'algorithm'))DTM结构->(11314,130107)该词对应于subscript->27366为了将文档用于分类任务,还需要使用TF-IDF等常用方法将其转化为特征向量:#构造文档的TF特征向量fromsklearn.feature_extraction.textimportTfidfTransformertf_transformer=TfidfTransformer(use_idf=False).fit(X_train_counts)X_train_tf=tf_transformer.transform(X_train_counts)print("一个文档TF特征向量","->",X_train_tf)#从sklearn.feature_extr构造文档的TF-IDF特征向量action.textimportTfidfTransformertf_transformer=TfidfTransformer()。fit(X_train_counts)X_train_tfidf=tf_transformer.transform(X_train_counts)print("文档TF-IDF特征向量","->",X_train_tfidf)文档TF特征向量->(0,6447)0.0380693493813(0,37842)0.0380693493813我们可以将特征提取、分类器训练和预测打包成单独的函数:defextract_feature(self):"""Extractdocumentfeaturesfromcorpus"""#获取训练数据的document-word矩阵self.train_dtm=self.count_vect.fit_transform(self.data['train'].data)#获取文档的TF特征tf_transformer=TfidfTransformer(use_idf=False)self.train_tf=tf_transformer.transform(self.train_dtm)#获取文档的TF-IDF特征tfidf_transformer=TfidfTransformer().fit(self.train_dtm)self.train_tfidf=tf_transformer.transform(self.train_dtm)deftrain_classifier(self):"""从训练集中训练一个分类器"""self.extract_feature();self.clf=MultinomialNB().fit(self.train_tfidf,self.data['train'].target)defpredict(self,docs):"""从训练集中训练一个分类器"""X_new_counts=self.count_vect.transform(文档)tfidf_transformer=TfidfTransformer().fit(X_new_counts)X_new_tfidf=tfidf_transformer.transform(X_new_counts)returnsself.clf.predict(X_new_tfidf)然后进行训练并进行预测和评估:#trainingclassifiertwp.train_classifier()#executepredictiondocs_new=['Godislove','OpenGLontheGPUisfast']predicted=twp.predict(docs_new)fordoc,categoryinzip(docs_new,predicted):print('%r=>%s'%(doc,twenty_train.target_names[category]))#执行模型评估twp.fetch_data(subset='test')预测=twp.predict(twp.data['test'].data)importnumpyasnp#errorcalculation#simpleerrormeannp.mean(predicted==twp.data['test'].target)#Metricsfromsklearnimportmetricsprint(metrics.classification_report(twp.data['test'].target,predicted,target_names=twp.data['test'].target_names))#ConfusionMatrixmetrics.confusion_matrix(twp.data['test'].target,predicted)'Godislove'=>soc.religion.christian'OpenGLontheGPUisfast'=>rec.autosprecisionrecallf1-scoresupportalt.atheism0.790.500.61319...talk.religion.misc1.000.080.15251avg/total0.820.790.777532Out[16]:array0([[158,,1,1,0,1,0,3,7,1,2,6,1,8,3,114,6,7,0,0],...[35,3,1,0,0,0,1,4,1,1,6,3,0,6,5,127,30,5,2,21]])我们还可以对文档集进行主题抽取:#进行主题抽取#进行主题抽取两次。topics_by_lda()Topic0:streams1astronautzoolaurentianmaynards2gtoalpemfpuTopic1:145cx0dbhsl75u6umm6sygldTopic2:apartmentwpimarsnazismonashpalestineottomansaswinnergerardTopic3:liveseycontestsatellitetamumatheworbitalwpdmarriagesolntzepopeTopic4:x11contestlibfontstringcontribvisualxtermahlbrakeTopic5:axg9vb8fa861d9pl0twm34ugizTopic6:printfnullcharmanesbehannasenatehandguncivilianshomicidesmagpieTopic7:bufjpegchitorbosdetqueuwopitblahTopic8:oracledit4riscnistinstructionmsgpostscriptdmaconvexTopic9:candidacrayyeastvikingdogvenusbloomsymptomsobservatoryrobyTopic10:cxckhzlkmvcrameradloptilinkk8uwTopic11:ripemrsasandvikw0bosniapsuvmhudsonutkdefensivevealTopic12:dbespnsabbathbrwidgetsliardavidianurartusdpacoolingTopic13:ripemdyerucsucarletonadaptectireschemalchemylockheedrsaTopic14:ingrsvalomarjupiterborlandhetintergraphfactoryparadoxcaptainTopic15:militiapalestiniancprptshandheldsharksigcapcjakelehighTopic16:alaskadukecolrussiauoknorauroraprincetonnsmcagenestereoTopic17:uuencodemsghelmeteossatandseghomosexualicsgearpyronTopic18:entriesmyersx11r4radarremarkciphermainehamburgseniorbontchevTopic19:cubsuflvitamintemplegsfcmcallastrobellcoreuraniumwesleyan常见自然语言处理工具封装经过以上对20NewsGroup语料处理的介绍,我们可以发现常见的自然语言处理任务包括数据采集、数据预处理、数据特征提取、分类模型训练、主题模型或词向量等高级特征提取等笔者也习惯使用python-fire将类快速封装成可以通过命令行调用的工具,同时也支持外部模块调用。这部分主要以中文语料库为例。比如我们需要分析中文维基百科数据,可以使用gensim中的维基百科处理类:classWiki(object):"""维基百科语料处理"""defwiki2texts(self,wiki_data_path,wiki_texts_path='./wiki_texts.txt'):"""将维基百科数据转换为文本数据参数:wiki_data_path--Wiki压缩文件地址"""ifnotwiki_data_path:print("请输入Wiki压缩文件路径或前往https://dumps.wikimedia.org/zhwiki/download")exit()#BuildWikiCorpuswiki_corpus=WikiCorpus(wiki_data_path,dictionary={})texts_num=0withopen(wiki_text_path,'w',encoding='utf-8')asoutput:fortextinwiki_corpus.get_texts():output.write(b''.join(text).decode('utf-8')+'\n')texts_num+=1iftexts_num%10000==0:logging.info("%d篇已处理"%texts_num)print(“已处理,请使用OpenCC转简体”)抓取完成后,我们需要使用OpenCC转简体。抓取完成后,我们可以使用口吃分词对生成的文本文件进行分词。这里的代码参考,我们直接使用pythonchinese_text_processor.pytokenize_file/output.txt直接执行任务,生成输出文件。得到分词文件后,我们可以将其转化为简单的词袋表示或文档-词向量。详细代码参考这里:classCorpusProcessor:"""corpusprocessing"""defcorpus2bow(self,tokenized_corpus=default_documents):"""returns(vocab,corpus_in_bow)将语料转换成BOW形式参数:tokenized_corpus--alistof分词后的文档Return:vocab--{'human':0,...'minors':11}corpus_in_bow--[[(0,1),(1,1),(2,1)]...]"""dictionary=corpora.Dictionary(tokenized_corpus)#获取词汇vocab=dictionary.token2id#获取文档的词袋表示corpus_in_bow=[dictionary.doc2bow(text)fortextintokenized_corpus]return(vocab,corpus_in_bow)defcorpus2dtm(self,tokenized_corpus=default_documents,min_df=10,max_df=100):"""returns(vocab,DTM)convertthecorpusintodocument-wordmatrix-dtm->matrix:document-wordmatrixIlikehatedatabasesD11101D21011"""iftype(tokenized_corpus[0])islist:documents=["".join(document)fordocumentintokenized_corpus]else:documents=tokenized_corpusifmax_df==-1:max_df=round(len(documents)/2)#构建语料统计向量vec=CountVectorizer(min_df=min_df,max_df=max_df,analyzer="word",token_pattern="[\S]+",tokenizer=None,preprocessor=None,stop_words=None)#分析数据DTM=vec.fit_transform(documents)#获取词汇vocab=vec。get_feature_names()return(vocab,DTM)我们还可以对分词后的文档进行主题模型或者词向量抽取。这里使用分词后的文档可以忽略中英文差异:deftopics_by_lda(self,tokenized_corpus_path,num_topics=20,num_words=10,max_lines=10000,split="\s+",max_df=100):"""读取分词文件,对其进行LDA训练参数:tokenized_corpus_path->string--语料库地址num_topics->integer--主题数num_words->integer--主题词数max_lines->integer--每次读取的最大行数split->string--文档单词之间的分隔符max_df->integer--避免常用词,过滤超过阈值的词"""#存储所有语料信息corpus=[]withopen(tokenized_corpus_path,'r',encoding='utf-8')astokenized_corpus:flag=0fordocumentintokenized_corpus:#判断是否读取足够的行if(flag>max_lines):break#将读取的内容添加到语料库中corpus.append(re.split(split,document))flag=flag+1#构建语料库的BOW表示(vocab,DTM)=self.corpus2dtm(corpus,max_df=max_df)#训练LDA模型lda=LdaMulticore(matutils.Sparse2Corpus(DTM,documents_columns=False),num_topics=num_topics,id2word=dict([(i,s)fori,sinenumerate(vocab)]),workers=4)#打印并返回主题数据topics=lda.show_topics(num_topics=num_topics,num_words=num_words,formatted=False,log=False)forti,topicinumerate(topics):print("Topic",ti,":","".join(word[0]forwordintopic[1]))这个函数也可以使用命令行直接调用,分词后传入文件我们也可以为其语料库创建词向量,代码在这里;如果不熟悉词向量的基本使用,可以参考基于Gensim的Word2Vec实践:defwv_train(self,tokenized_text_path,output_model_path='./wv_model.bin'):"""为文本训练词向量,并保存输出词向量"""sentences=word2vec.Text8Corpus(tokenized_text_path)#进行模型训练model=word2vec.Word2Vec(sentences,size=250)#保存模型model。save(output_model_path)defwv_visualize(self,model_path,word=["China","Airline"]):"""根据输入词搜索相邻词然后可视化参数:model_path:Word2Vec模型地址"""#Loadmodelmodel=word2vec.Word2Vec.load(model_path)#找到最相似的词words=[wp[0]forwpinmodel.most_similar(word,topn=20)]#提取对应的词向量wordsInVector=[model[word]forwordinwords]#进行PCA降维pca=PCA(n_components=2)pca.fit(wordsInVector)X=pca.transform(wordsInVector)#绘制图形xs=X[:,0]ys=X[:,1]plt.figure(figsize=(12,8))plt.scatter(xs,ys,marker='o')#遍历所有单词并添加点注释fori,winenumerate(words):plt.annotate(w,xy=(xs[i],ys[i]),xytext=(6,6),textcoords='偏移点',ha='左',va='顶部',**dict(fontsize=10))plt.show()【本文为专栏作家“张子雄”原创文章,如需转载请通过】联系作者】点击此处查看该作者更多好文
