【PythonSnippets】文章摘要抽取库

时间：2023-03-26 17:26:08 Python

Python文章摘要抽取库。示例文本来自http://news.steelcn.cn/a/105/...将文本保存到content.txt1。Textrank4zhhttps://github.com/letiantian...安装$pipinstalltextrank4zhexampleimportcodecsfromtextrank4zhimportTextRank4Keyword,TextRank4Sentencecontent=codecs.open('content.txt','r','utf-8').read()tr4s=TextRank4Sentence()tr4s.analyze(text=content,lower=True,source='all_filters')foritemintr4s.get_key_sentences(num=3):打印(item.index,item.weight,item.sentence)#结果：#00.11783211562891267日前获悉，世界上第一家采用日本神户制钢公司第三代炼铁法（ITmk3）的商业炼铁厂，SteelDynamics位于明尼苏达州的霍伊特莱克斯工厂#60.09533764028919228的产量炼铁厂后续投产预计2010年年中达到设计年产50万吨粒铁#10.08828227247879757已正式投产粒铁2.FastTextRankhttps://github.com/ArtistScri...安装$pipinstall示例导入来自FastTextRank.FastTextRank4Sentence的编解码器导入FastTextRank4Sentencemod=FastTextRank4Sentence(use_w2v=False,tol=0.0001)sentence_number=1content=codecs.open('content.txt','r','utf-8').read()print(mod.summarize(content,sentence_number))#Result:#['前几天了解到世界上第一家采用第三代炼铁法（ITmk3）的商业公司神户制钢所、SteelDynamics位于明尼苏达州的HoytLakes工厂已正式投产粒状铁']3。Sumyhttps://github.com/miso-belic...安装$pipinstallsumy示例from__future__importabsolute_importfrom__future__importdivision,print_function,unicode_literalsfromsumy.parsers.htmlimportHtmlParserfromsumy.nlp.tokenizersimportTokenizerfromsumy.parsers.plaintextimportPlaintextParserfromsumy.summarizers.lsaimportLsaSummarizerasSummarizerfromsumy.nlp.stemmersimportStemmerfromsumy.utilsimportget_stop_wordsLANGUAGE="chinese"SENTENCES_COUNT=1if__name__=="__main__":url="http://news.steelcn.cn/a/105/20100123/103370A9F83806.html"parser=HtmlParser.from_url(url,Tokenizer(LANGUAGE))#或者纯文本文件#parser=PlaintextParser.from_file("content.txt",Tokenizer(LANGUAGE))#parser=PlaintextParser.from_string("检查一下。",Tokenizer(LANGUAGE))stemmer=Stemmer(LANGUAGE)summarizer=Summarizer(stemmer)summarizer.stop_words=get_stop_words(LANGUAGE)forsentenceinsummarizer(parser.document,SENTENCES_COUNT):print(sentence)#结果：#除了北美，神户制钢所在越南、印度、俄罗斯、澳大利亚等国家也有粒铁项目，年产能总计数百万吨4.Gensimhttps://github.com/RaRe-Techn...安装$pipinstallgensimexampleimportcodecsfromgensim.summarization.summarizerimportsummarizecontent=codecs.open('content.txt','r','utf-8').read()summary=summarize(content,ratio=0.2)print(summary)#结果：#结果为空，可能gensim不适合短文本摘要提取5.SnowNLPhttps://github.com/isnowfy/sn...安装$pipinstallsnownlpexamplefromsnownlpimportSnowNLPimportcodecscontent=codecs.open('content.txt','r','utf-8').read()s=SnowNLP(content)print(s.keywords(3))print(s.summary(3))#结果：#['公司','铁','生产']#['已正式投入生产粒铁','该工厂于去年第四季度投产','SteelDynamics'明尼苏达州霍伊特湖工厂']6。Textteaserhttps://github.com/IndigoRese...好像只有英文importcodecscontent=codecs.open('content.txt','r','utf-8').read()title=""tt=TextTeaser(content)summary=tt.summarize(title,text)print(summary)上面的summaries是extractivesummaries，我都试了，感觉Textrank4zh和FastTextRank还不错，Sumy其次。将来会增加一些摘要的库。

上一篇：教你用Python批量在Excel后添加新列，内容为excel表格名称（附源码）

下一篇：Python数据清洗（二）：缺失值识别与处理

【PythonSnippets】文章摘要抽取库相关文章