怎么样将一堆文章聚合到不同的主题上, 并且还能提取主题的关键词, 这样我们就能对每个主题有一个大概的感性认识. LDA(Latent Dirichlet Allocation)就是实现这个功能的算法, 今天我们在这里使用python的gensim库来试用一下LDA算法. 但是在使用LDA之前, 我们需要使用pyltp进行分词.
我的开发环境:
win10
python3.6
gensim
pyltp
我用到的语料库:
安装gensim
准备语料库 我用的是搜狗新闻语料库, 这个语料库是公开免费的, 网上很多地方可以下载得到, 我放到自己网盘做了备份, 下载链接:https://pan.baidu.com/s/1gg2y3Gf 密码:hk3y. 把语料库下载到本地, 然后解压到一个你方便使用的地方.
一些基本配置 1 2 3 4 5 6 7 8 9 10 import osfrom pathlib import Pathproject_dir = Path('D:\\项目\\lda主题提取' ) source_dir = project_dir / 'source' words_dir = project_dir / 'words'
分词 我先定义了一个函数fenci_all
可以将source_dir
目录下的文件进行遍历, 然后输入到fenci
函数进行分词, fenci
函数先读取文件的内容, 然后将篇章分为句子, 句子分为词, 然后将分词结果保存到words_dir
目录下.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import pyltpfrom pyltp import SentenceSplitterfrom pyltp import Segmentorimport jsonci_model_path = 'D:\\mysites\\text-characters\\tcharacters\\ltp\\ltp_data\\cws.model' sspliter = SentenceSplitter() wspliter = Segmentor() wspliter.load(ci_model_path) def fenci (filepath:Path ): relative=filepath.relative_to(source_dir) write_path = words_dir / relative try : passage = open (filepath, 'r' , encoding='gbk' ).read() except UnicodeDecodeError: print ('Error to Open file:{}' .format (filepath)) return sentences = sspliter.split(passage) splited = [] for sent in sentences: words = wspliter.segment(sent) splited.extend(list (words)) write_path.parent.mkdir(parents=True , exist_ok=True ) f=open (write_path, 'w' ) f.write(json.dumps(splited)) f.close() def fenci_all (): d: Path for d in source_dir.iterdir(): for fpath in d.iterdir(): print ('split {}' .format (fpath)) fenci(fpath) fenci_all() wspliter.release()
准备参数 经过上面的步骤, 我们的语料库就是已经分好词的了, 并且所有的预料都保存在了words_dir
的目录下. 下面我们要考察一下进行LDA需要整理什么数据.
我们写要提个加载数据的函数, 这个函数是从文件中读取分好词的数据, 然后把一篇文章的所有句子合并成一个句子(去掉句子信息).
1 2 3 4 5 6 7 8 9 10 11 def load_corpus (): d:Path corpus=[] for d in words_dir.iterdir(): for fpath in d.iterdir(): print ('reading {}' .format (fpath)) passage = json.load(open (fpath, 'r' )) passage = [word for sentence in passage for word in sentence] corpus.append(passage) return corpus
下面我们就可以通过这个方法来加载数据了:
因为计算机是无法处理文本数据的, 现在在我们的corpus中存储的还是词, 我们需要将词映射到一个数字上, 这就是要建立一个数字和词的对应关系:字典.
1 2 3 4 from gensim import models, corporadic = corpora.Dictionary(corpus) clean_data = [dic.doc2bow(words) for words in corpus]
主题提取 1 2 3 4 lda = models.ldamodel.LdaModel(clean_data, id2word=dic, num_topics=20 ) for topic in lda.print_topics(num_words=10 ): print (topic[1 ])
输出结果:(星号前面表示权重, 后面表示词)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 .029 *"," + 0 .026 *"航空" + 0 .021 *"飞机" + 0 .017 *"的" + 0 .014 *"," + 0 .012 *"。" + 0 .012 *"旅游" + 0 .010 *"、" + 0 .009 *"飞" + 0 .005 *")"0 .024 *"," + 0 .021 *"的" + 0 .020 *"》" + 0 .017 *"《" + 0 .015 *"&" + 0 .015 *"nbsp" + 0 .014 *"。" + 0 .008 *"、" + 0 .007 *":" + 0 .005 *"为"0 .035 *"." + 0 .017 *"的" + 0 .016 *"," + 0 .016 *"," + 0 .014 *"防务" + 0 .010 *"。" + 0 .009 *")" + 0 .007 *"a" + 0 .005 *"A" + 0 .004 *"、"0 .021 *"的" + 0 .016 *"。" + 0 .014 *")" + 0 .013 *"," + 0 .013 *"(" + 0 .011 *"=" + 0 .010 *"战斗机" + 0 .009 *"+" + 0 .008 *"、" + 0 .008 *"坦克"0 .054 *"," + 0 .050 *"的" + 0 .025 *"。" + 0 .018 *"海军" + 0 .017 *"训练" + 0 .015 *"美军" + 0 .012 *"、" + 0 .009 *"在" + 0 .008 *"潜艇" + 0 .008 *"作战"0 .059 *"," + 0 .051 *"的" + 0 .034 *"。" + 0 .019 *"、" + 0 .014 *"”" + 0 .014 *"在" + 0 .013 *"“" + 0 .010 *"和" + 0 .009 *"中国" + 0 .008 *"了"0 .040 *"、" + 0 .032 *"," + 0 .022 *"的" + 0 .019 *"。" + 0 .016 *"专业" + 0 .014 *"考生" + 0 .011 *"学校" + 0 .011 *"大学" + 0 .010 *"考试" + 0 .009 *":"0 .134 *"," + 0 .014 *"的" + 0 .013 *")" + 0 .013 *"、" + 0 .013 *""" + 0 .012 *"(" + 0 .011 *"自卫队" + 0 .011 *"&" + 0 .010 *"nbsp" + 0 .009 *"'"0 .026 *"," + 0 .017 *"的" + 0 .017 *"。" + 0 .008 *"兵力" + 0 .007 *"师" + 0 .007 *"编队" + 0 .007 *"3 " + 0 .006 *"在" + 0 .006 *"1 " + 0 .005 *"主力"0 .025 *"," + 0 .025 *"的" + 0 .018 *"。" + 0 .009 *"、" + 0 .007 *"推演" + 0 .006 *"在" + 0 .005 *"和" + 0 .005 *"生产" + 0 .005 *"断代" + 0 .005 *"”"0 .125 *""" + 0 .064 *")" + 0 .062 *"(" + 0 .011 *"," + 0 .010 *"。" + 0 .008 *":" + 0 .007 *"," + 0 .007 *"的" + 0 .006 *"&" + 0 .006 *"nbsp"0 .175 *";" + 0 .168 *"&" + 0 .168 *"nbsp" + 0 .014 *"的" + 0 .013 *":" + 0 .010 *"," + 0 .010 *"。" + 0 .006 *"、" + 0 .004 *"1 " + 0 .004 *"2 "0 .047 *"伊朗" + 0 .024 *"," + 0 .022 *"。" + 0 .019 *"的" + 0 .010 *"nbsp" + 0 .009 *"、" + 0 .009 *"军费" + 0 .008 *"&" + 0 .007 *"公司" + 0 .006 *"("0 .037 *"," + 0 .025 *"演习" + 0 .025 *"(" + 0 .024 *")" + 0 .019 *"。" + 0 .016 *"、" + 0 .009 *"的" + 0 .009 *"在" + 0 .009 *"部队" + 0 .008 *"“"0 .075 *"," + 0 .065 *"的" + 0 .030 *"。" + 0 .018 *"是" + 0 .012 *"在" + 0 .012 *"“" + 0 .011 *"”" + 0 .009 *"和" + 0 .009 *"、" + 0 .007 *"不"0 .053 *";" + 0 .047 *"," + 0 .030 *"&" + 0 .029 *"的" + 0 .023 *"。" + 0 .021 *"nbsp" + 0 .008 *":" + 0 .008 *"是" + 0 .008 *"、" + 0 .005 *"gt"0 .030 *"," + 0 .023 *"《" + 0 .022 *"》" + 0 .019 *"的" + 0 .015 *"。" + 0 .015 *"、" + 0 .014 *"之" + 0 .009 *"&" + 0 .008 *"nbsp" + 0 .007 *"为"0 .041 *"," + 0 .028 *"。" + 0 .021 *"解放军" + 0 .018 *"、" + 0 .015 *"的" + 0 .014 *"台军" + 0 .009 *"空军" + 0 .008 *"架" + 0 .007 *"陆军" + 0 .007 *")"0 .085 *"," + 0 .058 *"的" + 0 .035 *"。" + 0 .017 *"了" + 0 .017 *"是" + 0 .014 *"”" + 0 .013 *"“" + 0 .012 *"在" + 0 .011 *"一" + 0 .011 *"不"0 .116 *"[" + 0 .115 *"]" + 0 .048 *":" + 0 .043 *"-" + 0 .031 *"," + 0 .017 *"防空" + 0 .011 *"1 " + 0 .009 *"。" + 0 .008 *"2 " + 0 .007 *"的"
清理标点和无意义词 上面打印的每个主题的关键词, 有没有发现好多标点符号和各种无意义的词也进入了模型, 这会大大降低我们主题分类的效果, 那么, 我们要事先删除标点符号和无意义词.
我们需要使用pyltp对词性进行标注, 然后只保留词性为['a', 'b', 'd', 'i', 'j', 'n', 'nh','ni', 'nl', 'ns', 'nt', 'nz', 'v']
的词.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from pyltp import Postaggerposter = Postagger() poster.load('D:\\mysites\\text-characters\\tcharacters\\ltp\\ltp_data\\pos.model' ) def select_word (sentence:list ): '''只保留词性为['a', 'b', 'd', 'i', 'j', 'n', 'nh', 'ni', 'nl', 'ns', 'nt', 'nz', 'v']的词, 具体词性意义参考https://www.ltp-cloud.com/intro/#pos_how''' keep = ['a' , 'b' , 'd' , 'i' , 'j' , 'n' , 'nh' , 'ni' , 'nl' , 'ns' , 'nt' , 'nz' , 'v' ] tags = poster.postag(sentence) for i in range (len (sentence)-1 , -1 , -1 ): if tags[i] not in keep: del sentence[i]
然后在加载数据的时候调用这个方法进行词语筛选, 所以修改一下这个方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 def load_corpus (): d:Path corpus=[] for d in words_dir.iterdir(): for fpath in d.iterdir(): print ('reading {}' .format (fpath)) passage = json.load(open (fpath, 'r' )) for sentence in passage: select_word(sentence) passage = [word for sentence in passage for word in sentence] corpus.append(passage) return corpus
最终结果 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 .010 *"工作" + 0 .010 *"专业" + 0 .008 *"是" + 0 .007 *"大学" + 0 .007 *"解放军" + 0 .006 *"教育" + 0 .005 *"能力" + 0 .005 *"发展" + 0 .005 *"建设" + 0 .005 *"技术"0 .031 *"是" + 0 .019 *"不" + 0 .014 *"也" + 0 .011 *"人" + 0 .011 *"有" + 0 .011 *"就" + 0 .009 *"都" + 0 .009 *"说" + 0 .007 *"没有" + 0 .007 *"到"0 .014 *"航母" + 0 .014 *"我军" + 0 .009 *"是" + 0 .007 *"建设" + 0 .006 *"发展" + 0 .005 *"教育部" + 0 .005 *"毛" + 0 .005 *"主权" + 0 .005 *"大" + 0 .005 *"为"0 .012 *"师" + 0 .009 *"军" + 0 .009 *"直播员" + 0 .008 *"战斗力" + 0 .006 *"球" + 0 .005 *"拦截" + 0 .004 *"5 月" + 0 .004 *"为" + 0 .003 *"指挥" + 0 .003 *"退役"0 .017 *"伊朗" + 0 .013 *"空军" + 0 .013 *"飞机" + 0 .009 *"旅游" + 0 .009 *"考生" + 0 .008 *"志愿" + 0 .008 *"军事" + 0 .008 *"是" + 0 .007 *"录取" + 0 .006 *"陈水扁"0 .015 *"将" + 0 .015 *"美国" + 0 .010 *"是" + 0 .008 *"中国" + 0 .007 *"航空" + 0 .007 *"为" + 0 .006 *"目前" + 0 .006 *"台" + 0 .006 *"飞机" + 0 .006 *"到"0 .013 *"战争" + 0 .010 *"防务" + 0 .006 *"文章" + 0 .006 *"是" + 0 .005 *"为" + 0 .005 *"考研" + 0 .005 *"蒋介石" + 0 .005 *"有" + 0 .003 *"师团" + 0 .003 *"作者"0 .011 *"为" + 0 .010 *"农民" + 0 .007 *"政权" + 0 .006 *"农村" + 0 .006 *"是" + 0 .005 *"斯大林" + 0 .004 *"有" + 0 .003 *"不" + 0 .003 *"大" + 0 .003 *"历史"0 .030 *"是" + 0 .024 *"不" + 0 .016 *"有" + 0 .014 *"就" + 0 .013 *"人" + 0 .009 *"都" + 0 .009 *"也" + 0 .008 *"要" + 0 .008 *"说" + 0 .007 *"会"0 .018 *"公司" + 0 .013 *"银行" + 0 .009 *"贷款" + 0 .007 *"北京" + 0 .007 *"投资" + 0 .006 *"市场" + 0 .006 *"毛泽东" + 0 .005 *"会计" + 0 .005 *"中国" + 0 .005 *"集团"0 .024 *"部队" + 0 .014 *"防空" + 0 .013 *"官兵" + 0 .009 *"台军" + 0 .008 *"军队" + 0 .007 *"战士" + 0 .007 *"图" + 0 .006 *"指挥" + 0 .005 *"球" + 0 .004 *"兵力"0 .015 *"是" + 0 .012 *"不" + 0 .008 *"有" + 0 .007 *"人" + 0 .005 *"就" + 0 .005 *"也" + 0 .005 *"时" + 0 .004 *"吃" + 0 .004 *"都" + 0 .004 *"可"0 .025 *"中国" + 0 .021 *"是" + 0 .012 *"美国" + 0 .009 *"日本" + 0 .007 *"有" + 0 .006 *"也" + 0 .006 *"大" + 0 .005 *"导弹" + 0 .005 *"文化" + 0 .005 *"最"0 .040 *"公司" + 0 .023 *"美国" + 0 .016 *"雷达" + 0 .010 *"英国" + 0 .006 *"隐形" + 0 .006 *"系统" + 0 .005 *"进行" + 0 .005 *"时间" + 0 .005 *"机动" + 0 .005 *"将"0 .044 *"训练" + 0 .006 *"干部" + 0 .005 *"职工" + 0 .005 *"工资" + 0 .005 *"军区" + 0 .005 *"兵团" + 0 .004 *"规定" + 0 .004 *"不" + 0 .004 *"野战" + 0 .004 *"沈阳"0 .045 *"是" + 0 .016 *"有" + 0 .014 *"不" + 0 .012 *"就" + 0 .011 *"要" + 0 .010 *"也" + 0 .007 *"问题" + 0 .007 *"可以" + 0 .006 *"会" + 0 .006 *"能"0 .023 *"演习" + 0 .022 *"海军" + 0 .013 *"是" + 0 .010 *"潜艇" + 0 .010 *"有" + 0 .008 *"美军" + 0 .007 *"军演" + 0 .007 *"舰队" + 0 .006 *"军事" + 0 .005 *"直升机"0 .016 *"武器" + 0 .015 *"发射" + 0 .015 *"导弹" + 0 .012 *"是" + 0 .008 *"试验" + 0 .006 *"不" + 0 .006 *"系统" + 0 .006 *"卫星" + 0 .006 *"弹道导弹" + 0 .005 *"将"0 .022 *"俄" + 0 .020 *"俄罗斯" + 0 .010 *"自卫队" + 0 .010 *"将" + 0 .006 *"系统" + 0 .005 *"进行" + 0 .005 *"是" + 0 .004 *"主席" + 0 .004 *"编队" + 0 .004 *"拦截"0 .014 *"为" + 0 .005 *"是" + 0 .004 *"右派" + 0 .004 *"大" + 0 .004 *"有" + 0 .003 *"位" + 0 .003 *"夏" + 0 .003 *"义和团" + 0 .003 *"不" + 0 .003 *"将"
如果这个结果不是很清晰的话, 我们再看一下前20个关键词:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0 .015 *"空军" + 0 .009 *"不" + 0 .008 *"战机" + 0 .008 *"记者" + 0 .008 *"训练" + 0 .008 *"到" + 0 .007 *"解放军" + 0 .006 *"是" + 0 .006 *"时" + 0 .005 *"人" + 0 .005 *"学校" + 0 .005 *"学生" + 0 .005 *"还" + 0 .004 *"北京" + 0 .004 *"考试" + 0 .004 *"飞行" + 0 .004 *"有" + 0 .004 *"飞行员" + 0 .003 *"也" + 0 .003 *"就"0 .038 *"专业" + 0 .030 *"考生" + 0 .022 *"招生" + 0 .016 *"录取" + 0 .015 *"志愿" + 0 .013 *"大学" + 0 .011 *"学校" + 0 .010 *"院校" + 0 .010 *"本科" + 0 .009 *"计划" + 0 .007 *"高考" + 0 .007 *"成绩" + 0 .007 *"高校" + 0 .007 *"规定" + 0 .006 *"工作" + 0 .006 *"为" + 0 .006 *"填报" + 0 .005 *"不" + 0 .005 *"考试" + 0 .005 *"分"0 .015 *"导弹" + 0 .015 *"是" + 0 .011 *"海军" + 0 .010 *"美军" + 0 .008 *"有" + 0 .007 *"美国" + 0 .005 *"战斗机" + 0 .005 *"军事" + 0 .005 *"飞机" + 0 .005 *"不" + 0 .005 *"潜艇" + 0 .005 *"到" + 0 .004 *"也" + 0 .004 *"朝鲜" + 0 .004 *"为" + 0 .004 *"时"+ 0 .004 *"将" + 0 .004 *"说" + 0 .003 *"还" + 0 .003 *"大"0 .012 *"军事" + 0 .009 *"部队" + 0 .009 *"演习" + 0 .009 *"系统" + 0 .009 *"作战" + 0 .009 *"台湾" + 0 .008 *"中国" + 0 .007 *"武器" + 0 .007 *"是" + 0 .007 *"装备" + 0 .006 *"将" + 0 .006 *"能力" + 0 .006 *"发展" + 0 .006 *"伊朗" + 0 .006 *"技术" + 0 .006 *"战略" + 0 .005 *"军队" + 0 .005 *"进行" + 0 .005 *"美国" + 0 .005 *"国家"0 .008 *"时间" + 0 .007 *"5 月" + 0 .005 *"考生" + 0 .005 *"是" + 0 .004 *"作文" + 0 .004 *"大臣" + 0 .004 *"为" + 0 .004 *"答题" + 0 .004 *"语文" + 0 .004 *"南沙" + 0 .004 *"关羽" + 0 .003 *"朝" + 0 .003 *"豆腐块" + 0 .003 *"庐山" + 0 .003 *"有" + 0 .003 *"项" + 0 .003 *"地点" + 0 .003 *"高考" + 0 .003 *"改" + 0 .003 *"横"0 .012 *"教育" + 0 .011 *"是" + 0 .011 *"工作" + 0 .010 *"中国" + 0 .010 *"文化" + 0 .010 *"发展" + 0 .008 *"人才" + 0 .006 *"大学" + 0 .006 *"社会" + 0 .006 *"建设" + 0 .006 *"旅游" + 0 .005 *"国家" + 0 .005 *"毕业生" + 0 .004 *"职业" + 0 .004 *"学生" + 0 .004 *"为" + 0 .004 *"学校" + 0 .004 *"新" + 0 .004 *"有" + 0 .004 *"学习"0 .012 *"是" + 0 .011 *"面试" + 0 .007 *"不" + 0 .006 *"工作" + 0 .006 *"学生" + 0 .006 *"复习" + 0 .005 *"大学" + 0 .005 *"也" + 0 .005 *"有" + 0 .005 *"要" + 0 .004 *"进行" + 0 .004 *"能" + 0 .004 *"人" + 0 .004 *"指挥员" + 0 .004 *"时" + 0 .003 *"可" + 0 .003 *"多" + 0 .003 *"教学" + 0 .003 *"简历" + 0 .003 *"到"0 .017 *"旅游" + 0 .017 *"飞机" + 0 .010 *"游客" + 0 .009 *"成都" + 0 .007 *"旅游局" + 0 .007 *"集团军" + 0 .007 *"机场" + 0 .007 *"铁路" + 0 .006 *"飞" + 0 .006 *"航班" + 0 .006 *"上海" + 0 .005 *"鱼雷" + 0 .005 *"旅客" + 0 .005 *"景区" + 0 .005 *"到" + 0 .005 *"将" + 0 .004 *"沈阳" + 0 .004 *"线路" + 0 .004 *"是" + 0 .004 *"为"0 .041 *"中国" + 0 .033 *"美国" + 0 .028 *"日本" + 0 .011 *"将" + 0 .010 *"印度" + 0 .009 *"俄罗斯" + 0 .008 *"是" + 0 .008 *"国家" + 0 .005 *"国际" + 0 .005 *"计划" + 0 .005 *"中" + 0 .005 *"韩国" + 0 .004 *"战争" + 0 .004 *"英国" + 0 .004 *"伊朗" + 0 .004 *"世界" + 0 .004 *"国" + 0 .004 *"技术" + 0 .004 *"美" + 0 .004 *"俄"0 .026 *"训练" + 0 .012 *"战术" + 0 .011 *"搜狐" + 0 .006 *"是" + 0 .006 *"为" + 0 .005 *"不" + 0 .005 *"右派" + 0 .005 *"远程" + 0 .005 *"军" + 0 .004 *"击败" + 0 .004 *"到" + 0 .004 *"将" + 0 .004 *"歼" + 0 .003 *"大" + 0 .003 *"直播员" + 0 .003 *"队长" +0 .003 *"杀" + 0 .003 *"反击" + 0 .003 *"指数" + 0 .003 *"防守"0 .058 *"公司" + 0 .008 *"投资" + 0 .008 *"为" + 0 .007 *"会计" + 0 .007 *"贷款" + 0 .006 *"银行" + 0 .005 *"相关" + 0 .005 *"有限公司" + 0 .005 *"合同" + 0 .004 *"集团" + 0 .004 *"资产" + 0 .004 *"网页." + 0 .004 *"共" + 0 .004 *"保险" + 0 .003 *"审计" +0 .003 *"财务" + 0 .003 *"会议" + 0 .003 *"明思克" + 0 .003 *"不" + 0 .003 *"支付"0 .006 *"派" + 0 .005 *"战舰" + 0 .004 *"为" + 0 .004 *"不" + 0 .003 *"连长" + 0 .003 *"出" + 0 .003 *"打" + 0 .003 *"利刃" + 0 .003 *"江青" + 0 .003 *"澳门" + 0 .003 *"国军" + 0 .002 *"任" + 0 .002 *"军长" + 0 .002 *"下" + 0 .002 *"关岛" + 0 .002 *"胡适" +0 .002 *"刘彻" + 0 .002 *"总司令" + 0 .002 *"入" + 0 .002 *"陈独秀"0 .038 *"是" + 0 .020 *"不" + 0 .015 *"有" + 0 .013 *"就" + 0 .012 *"也" + 0 .012 *"人" + 0 .010 *"都" + 0 .008 *"说" + 0 .007 *"要" + 0 .007 *"会" + 0 .006 *"到" + 0 .006 *"能" + 0 .006 *"很" + 0 .005 *"没有" + 0 .005 *"大" + 0 .005 *"来" + 0 .004 *"还" + 0 .004 *"好" + 0 .004 *"多" + 0 .004 *"让"0 .010 *"有" + 0 .009 *"是" + 0 .007 *"不" + 0 .004 *"就" + 0 .004 *"吃" + 0 .004 *"可以" + 0 .004 *"人" + 0 .004 *"可" + 0 .003 *"要" + 0 .003 *"时" + 0 .003 *"为" + 0 .003 *"都" + 0 .003 *"也" + 0 .003 *"大" + 0 .003 *"多" + 0 .003 *"会" + 0 .003 *"如" + 0 .002 *"最" + 0 .002 *"断代" + 0 .002 *"到"0 .008 *"作品" + 0 .006 *"敌" + 0 .006 *"照片" + 0 .005 *"兵力" + 0 .005 *"将" + 0 .005 *"电影" + 0 .005 *"中国" + 0 .005 *"观众" + 0 .005 *"图片" + 0 .004 *"比赛" + 0 .004 *"部队" + 0 .004 *"师" + 0 .004 *"伊拉克" + 0 .004 *"参演" + 0 .004 *"图" + 0 .004 *"是" + 0 .003 *"侦察" + 0 .003 *"对手" + 0 .003 *"战机" + 0 .003 *"埃及"0 .014 *"解放军" + 0 .013 *"战斗" + 0 .010 *"防空" + 0 .008 *"是" + 0 .007 *"空军" + 0 .007 *"战斗力" + 0 .007 *"中国" + 0 .006 *"比赛" + 0 .006 *"'" + 0 .005 *"演练" + 0 .004 *"将" + 0 .004 *"也" + 0 .004 *"军" + 0 .004 *"队员" + 0 .004 *"编队" + 0 .004 *"大" + 0 .003 *"到" + 0 .003 *"胜利" + 0 .003 *"不" + 0 .003 *"体育"0 .015 *"为" + 0 .008 *"我军" + 0 .006 *"推演" + 0 .006 *"文明" + 0 .005 *"是" + 0 .005 *"蒋介石" + 0 .004 *"斯大林" + 0 .004 *"历史" + 0 .003 *"师团" + 0 .003 *"不" + 0 .003 *"有" + 0 .003 *"苏" + 0 .003 *"作者" + 0 .003 *"考古" + 0 .003 *"文章" + 0 .002 *"文化" + 0 .002 *"机动" + 0 .002 *"大" + 0 .002 *"猎鹰" + 0 .002 *"实"0 .023 *"是" + 0 .017 *"不" + 0 .011 *"企业" + 0 .010 *"有" + 0 .008 *"问题" + 0 .006 *"也" + 0 .005 *"工作" + 0 .005 *"就" + 0 .005 *"会" + 0 .005 *"要" + 0 .004 *"到" + 0 .004 *"社会" + 0 .004 *"人" + 0 .004 *"大" + 0 .004 *"没有" + 0 .004 *"经济" + 0 .004 *"高" + 0 .003 *"能" + 0 .003 *"工资" + 0 .003 *"认为"0 .016 *"公司" + 0 .013 *"是" + 0 .013 *"航空" + 0 .012 *"中国" + 0 .011 *"将" + 0 .010 *"市场" + 0 .009 *"大" + 0 .007 *"也" + 0 .007 *"有" + 0 .006 *"目前" + 0 .006 *"发展" + 0 .006 *"网络" + 0 .006 *"企业" + 0 .005 *"航母" + 0 .005 *"到" + 0 .004 *"最" +0 .004 *"飞机" + 0 .004 *"更" + 0 .004 *"新" + 0 .004 *"行业"0 .015 *"是" + 0 .007 *"皇帝" + 0 .006 *"有" + 0 .006 *"历史" + 0 .005 *"大" + 0 .004 *"不" + 0 .004 *"为" + 0 .004 *"文化" + 0 .003 *"出" + 0 .003 *"也" + 0 .003 *"义和团" + 0 .003 *"仿佛" + 0 .002 *"文章" + 0 .002 *"又" + 0 .002 *"时期" + 0 .002 *"就" + 0 .002 *"宗教" + 0 .002 *"纳粹" + 0 .002 *"人" + 0 .002 *"开"
以下是完整的源码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 import osfrom pathlib import Pathimport jsonproject_dir = Path('D:\\项目\\lda主题提取' ) source_dir = project_dir / 'source' words_dir = project_dir / 'words' import pyltpfrom pyltp import SentenceSplitterfrom pyltp import Segmentorci_model_path = 'D:\\mysites\\text-characters\\tcharacters\\ltp\\ltp_data\\cws.model' sspliter = SentenceSplitter() wspliter = Segmentor() wspliter.load(ci_model_path) def fenci (filepath:Path ): relative=filepath.relative_to(source_dir) write_path = words_dir / relative try : passage = open (filepath, 'r' , encoding='gbk' ).read() except UnicodeDecodeError: print ('Error to Open file:{}' .format (filepath)) return sentences = sspliter.split(passage) splited = [] for sent in sentences: words = wspliter.segment(sent) splited.append(list (words)) write_path.parent.mkdir(parents=True , exist_ok=True ) f=open (write_path, 'w' ) f.write(json.dumps(splited)) f.close() def fenci_all (): d: Path for d in source_dir.iterdir(): for fpath in d.iterdir(): print ('split {}' .format (fpath)) fenci(fpath) fenci_all() wspliter.release() from pyltp import Postaggerposter = Postagger() poster.load('D:\\mysites\\text-characters\\tcharacters\\ltp\\ltp_data\\pos.model' ) def select_word (sentence:list ): '''只保留词性为['a', 'b', 'd', 'i', 'j', 'n', 'nh', 'ni', 'nl', 'ns', 'nt', 'nz', 'v']的词, 具体词性意义参考https://www.ltp-cloud.com/intro/#pos_how''' keep = ['a' , 'b' , 'd' , 'i' , 'j' , 'n' , 'nh' , 'ni' , 'nl' , 'ns' , 'nt' , 'nz' , 'v' ] tags = poster.postag(sentence) for i in range (len (sentence)-1 , -1 , -1 ): if tags[i] not in keep: del sentence[i] def load_corpus (): d: Path corpus = [] for d in words_dir.iterdir(): for fpath in d.iterdir(): print ('reading {}' .format (fpath)) passage = json.load(open (fpath, 'r' )) for sentence in passage: select_word(sentence) passage = [word for sentence in passage for word in sentence] corpus.append(passage) return corpus corpus = load_corpus() from gensim import models, corporadic = corpora.Dictionary(corpus) clean_data = [dic.doc2bow(words) for words in corpus] lda = models.ldamodel.LdaModel(clean_data, id2word=dic, num_topics=20 ) for topic in lda.print_topics(num_words=10 ): print (topic[1 ])