当前位置: 首页 > 后端技术 > Python

NLP零基础入门——新闻文本分类数据阅读与数据分析

时间:2023-03-25 21:03:08 Python

资料下载下载资料!wgethttps://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/train_set.csv.zip!wgethttps://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a.csv.zip!wgethttps://tianchi-competition.oss-cn-hangzhou.aliyuncs.com/531810/test_a_sample_submit.csv解压后的数据,共3个文件,训练数据(train_set.csv),测试数据(test_a.csv),结果提交样本文件(test_a_sample_submit.csv)!mkdir/content/drive/My\Drive/competitions/NLPNews!解压缩/content/test_a.csv.zip-d/content/drive/My\Drive/competitions/NLPNews/test!unzip/content/train_set.csv.zip-d/content/drive/My\Drive/competitions/NLPNews/train!mv/content/test_a_sample_submit.csv/content/drive/My\Drive/competitions/NLPNews/submit.csv!mv/content/drive/My\Drive/competitions/NLPNews/test/test_a.csv/content/drive/My\Drive/competitions/NLPNews/test.csv!mv/content/drive/My\Drive/competitions/NLPNews/test/train_set.csv/content/drive/My\Drive/competitions/NLPNews/train.csv读取数据导入熊猫作为pdimportosfrom集合importCounterimportmatplotlib.pyplotasplt%matplotlibinlineroot_dir='/content/drive/MyDrive/competitions/NLPNews'train_df=pd.read_csv(root_dir+'/train.csv',sep='\t')train_df['word_cnt']=train_df['text'].apply(lambdax:len(x.split('')))train_df.head(10)标签文本word_cnt022967675833920211854373141093792414915...105711144??644866352561924654802145214523137577854...4862373464068507437473747568160931777222673546...33646305530552490465960653370581424655...307593819452511296725648521093800526410064...1050633074780681115807539588654863433664458...267710264270186659735233764446436594853517...8768122708221822185915455988148193144261166...3149314936545311348472213484722147450997541307330733073076307630763076VIFATION。'Word_cnt']=Train_df['Word_cnt']。应用(INT)Train_df['Word_cnt']。描述()Count200000.000000MEAN907.207110STD996.029036MIN2.00000025%_000025%374.0000000000550%676.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000来看,每条新闻包含的字数差别很大,最小的只有2个,最大的有57921.plt.hist(train_df['word_cnt'],bins=255)plt.title('wordcountsstatistics')plt.xlabel('wordcounts')plt.show()plt.bar(range(1,15),train_df['label'].value_counts().values)plt.title('labelcountsstatistic')#plt.xticks(range(1,15),labels=labels)plt.xlabel('label')plt.show()如下图可以看到,类别之间的不平衡依然严重labels=['Technology','股票','体育','娱乐','时事','社会','教育','财经','家居','游戏','房地产','时尚','Lottery','Constellation']forlabel,cntinzip(labels,train_df['label'].value_counts()):print(label,cnt)科技38918股票36945体育31425娱乐22133时事15016社会12232教育9985金融8841家居7847游戏5878房地产4920时尚3131彩票1821星座908s=''.join(list(train_df['text']))counter=Counter(s.split(''))counter=sorted(counter.items(),key=lambdax:x[1],reverse=True)print('出现次数最多的词:',counter[0])print('出现次数少的词:',counter[-1])假设第3750个字、第900个字、第648个字是句子的标点符号,请分析竞赛题每篇新闻平均由多少句组成?importretrain_df['sentence_cnt']=train_df['text'].apply(lambdax:len(re.split('3750|900|648',x)))pd.concate([train_df.head(5),train_df.tail(5)],0)foriintrain_df['label'].unique():s=''.join(list(train_df.loc[train_df['label']==i,'text']))counter=Counter(s.split(''))most=sorted(counter.items(),lambdax:x[1],reversed=True)[0]print(i,most)参考[1]Datawhale零基础入门NLPevent-Task2数据读取与数据分析