大数据揭示心理学规律(利用gensim训练词向量)

分享时@该用户已经被封, 我就能回答你的问题奥!

文章目录
  1. 1. 简介
  2. 2. 技术路线
  3. 3. 代码部分
    1. 3.1. 这是构建语料库的类
    2. 3.2. 实例化一个语料库
    3. 3.3. 保存分词结果
    4. 3.4. 训练词向量
    5. 3.5. 保存词向量模型
  4. 4. 结果展示
    1. 4.1. 用100维向量表示一个词
    2. 4.2. 找到最相近的n个词
    3. 4.3. 概念间的计算
    4. 4.4. 寻找概念间的关系
  5. 5. 视频演示


简介

本文通过训练词向量的方式试图解释一些心理学现象. 在实验之前, 我并没有太多的想法和对结果的预期, 只是抱着试试看的态度来的. 不过最终看效果, 还是蛮有意思的. 对代码不感兴趣的人可以直接跳过代码部分, 直接看结果展示部分.

技术路线

本次实验是前两次实验的继承, 前两篇分别是<中文语料库的构建>和<爬去知网期刊论文信息和摘要信息>. 具体过程如下:

  • 这次实验的语料来自知网论文的摘要, 爬取了大概有7000篇论文摘要(搜索关键词有完美主义, 幸福, 自卑等, 还有一些无关词语).
  • 然后构建了一个中文语料库.
  • 使用pyltp进行了分句和分词
  • 使用gensim进行词向量的训练

代码部分

这是构建语料库的类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# author: mlln.cn
# email: xxxspy@126.com
# qq: 675495787
from gensim.corpora.textcorpus import TextDirectoryCorpus
import jieba
from pyltp import Segmentor
from pyltp import SentenceSplitter
from gensim.corpora.dictionary import Dictionary
class MyTextDirCorpus(TextDirectoryCorpus):

def __init__(self, input, **kwargs):

kwargs['tokenizer'] = self.tokenizer
super().__init__(input, **kwargs)

def tokenizer(self, text):
if not hasattr(self, '_segmentor'):
model_path = r'D:\mysites\text-characters\tcharacters\ltp\ltp_data\cws.model'
segmentor = Segmentor() # 初始化实例
segmentor.load(model_path)
self._segmentor = segmentor
segmentor = self._segmentor
sents = SentenceSplitter.split(text)
words = [list(segmentor.segment(s)) for s in sents]
return words

def __del__(self):
'''释放资源'''
if hasattr(self, '_segmentor'):
self._segmentor.release()
try:
super().__del__()
except AttributeError:
pass

# 为了强制使用'utf8'编码, 我们复写了这个方法
def getstream(self):
"""Yield documents from the underlying plain text collection (of one or more files).
Each item yielded from this method will be considered a document by subsequent
preprocessing methods.
If `lines_are_documents` was set to True, items will be lines from files. Otherwise
there will be one item per file, containing the entire contents of the file.
"""
num_texts = 0
for path in self.iter_filepaths():
with open(path, 'rt', encoding='utf8') as f:
if self.lines_are_documents:
for line in f:
yield line.strip()
num_texts += 1
else:
content = f.read().strip()
yield content
num_texts += 1

self.length = num_texts

def get_texts_from_tokens(self):
for fpath in self.iter_filepaths():
fpath = Path(fpath)
token_path = fpath.parent / (fpath.name + '.cached_tokens')
yield pickle.loads(token_path.read_bytes())

def save_tokens(self):
'''保存tokens到硬盘, 只需要运行一次'''
for fpath in self.iter_filepaths():
fpath = Path(fpath)
token_path = fpath.parent / (fpath.name + '.cached_tokens')
txt = fpath.read_text(encoding='utf8').strip()
tokens = self.tokenizer(txt)
token_path.write_bytes(pickle.dumps(list(tokens)))

def save_dictionary(self, dpath):
'''把字典保存到硬盘'''
self.dictionary.save_as_text(fname=dpath)

def sentence_iter(self):
texts = self.get_texts_from_tokens()
for text in texts:
for sent in text:
yield sent

实例化一个语料库

1
2
3
4
5
6
7
8
9
10
dpath = r'D:\mysites\cnki-search\OUTPUTS'
dic_path = 'dictionary.dict'
dic = Dictionary.load_from_text(dic_path)
corpus = MyTextDirCorpus(dpath,
min_depth=1,
dictionary=dic,
pattern='.*\.abstract$',
lines_are_documents=False,
token_filters=[])
len(dic)
输出: 47345

保存分词结果

1
2
3
4
# 保存tokens
from pathlib import Path
import pickle
corpus.save_tokens()

训练词向量

1
2
3
4
5
6
from gensim.models.word2vec import Word2Vec   


model= Word2Vec()
model.build_vocab(corpus.sentence_iter())
model.train(corpus.sentence_iter(), total_examples=model.corpus_count, epochs=model.epochs)
输出: (2390588, 3090228)

保存词向量模型

1
2
model_path = r"D:\mysites\cnki-search\OUTPUTS\vector.model"
model.save(model_path)

结果展示

用100维向量表示一个词

1
model['基于']
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). """Entry point for launching an IPython kernel.
输出: array([ 0.20481601, 0.5714214 , -1.3871742 , 0.06323086, -0.9507892 , 0.74798256, -0.7659924 , 1.5762316 , -0.48113433, -0.47097418, -0.26745576, 0.27904513, -0.3113326 , -0.14472784, -0.5994872 , -0.04263587, -1.0676603 , -0.3256847 , 0.15162984, -0.04746385, -0.67858386, 0.06392714, -0.65016 , -0.21514435, -0.7567799 , 0.2408507 , -0.8350066 , -0.44951594, 0.17692119, -0.32248417, -0.5894241 , -0.39148644, -0.2975546 , 0.16461797, -0.30091375, -0.43784717, 0.31717032, 0.4437195 , -0.9399812 , -0.22935824, 0.17857902, 0.95873344, 0.5222857 , 0.36783326, -0.00854115, -1.5987526 , 0.00588302, 0.13465759, 0.7462688 , 0.54626375, -0.36806348, 0.92049253, -0.05161149, -0.77675563, 0.36425716, -0.5438 , -0.32902893, 0.02579404, 0.52441996, -0.31838223, -0.02959022, -0.9052237 , -0.24905084, -0.4933001 , -0.05206076, -0.30906802, 0.5123301 , -0.7087073 , 0.08212207, -0.80762994, 0.08510961, 0.2877228 , -1.2759535 , 1.4057134 , 0.56749785, -1.0927128 , 0.30251205, -0.513295 , 0.18849482, -1.0221461 , -0.07769577, 1.0415772 , -0.60217077, -0.25533095, 0.55302244, 0.6333061 , 0.41302222, 0.46321914, -1.2012811 , -0.1631474 , 0.5391119 , -0.08092288, -0.26842424, 0.14027756, 0.88385963, 0.05708294, -0.629938 , -1.2831743 , -0.16318122, -0.51519424], dtype=float32)

找到最相近的n个词

1
model.most_similar(['我'])
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
输出: [('他', 0.8343698382377625), ('善', 0.8118964433670044), ('你', 0.7913036942481995), ('大人', 0.7887542247772217), ('作家', 0.7695958018302917), ('五四', 0.7694318294525146), ('童年', 0.76503586769104), ('她', 0.7642527222633362), ('说', 0.7629194259643555), ('至善', 0.7568099498748779)]

概念间的计算

找到没有自尊的完美主义者:

1
2
v =  model['完美主义'] - model['自尊']
model.most_similar(positive=[v], topn=10)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead). """Entry point for launching an IPython kernel. d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
输出: [('完美主义', 0.7952172756195068), ('meta', 0.4440059959888458), ('定式', 0.4229651689529419), ('幸福观', 0.42237338423728943), ('感性认识', 0.4190525710582733), ('人格', 0.3875422179698944), ('犯罪', 0.36745792627334595), ('自我', 0.3505035638809204), ('拖延', 0.3485129177570343), ('历时性', 0.345354288816452)]

寻找概念间的关系

1
2

model.most_similar(positive=['完美主义', '消极'], negative=[], topn=20)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead).
输出: [('自尊', 0.8451846837997437), ('倾向', 0.8100054860115051), ('反刍', 0.7790679931640625), ('特质', 0.7788870334625244), ('自悯', 0.7783106565475464), ('情绪', 0.7755159139633179), ('自卑感', 0.7730893492698669), ('态度', 0.7722681760787964), ('自杀', 0.7719931602478027), ('拖延', 0.7688606381416321), ('适应性', 0.761386513710022), ('抑郁', 0.7555066347122192), ('人格', 0.7441696524620056), ('学业', 0.7429016828536987), ('强迫', 0.7383431792259216), ('风格', 0.734379768371582), ('信念', 0.732912540435791), ('完美主义者', 0.731903612613678), ('外表', 0.720329225063324), ('显', 0.7195327281951904)]
1
model.most_similar(positive=['完美主义', '幸福'], negative=[], topn=20)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
输出: [('幸福感', 0.6743337512016296), ('幸福观', 0.6489542722702026), ('自卑感', 0.6120619177818298), ('自我', 0.607430100440979), ('婚恋观', 0.5844637155532837), ('自我批评', 0.5719105005264282), ('自然界', 0.5718179941177368), ('幸福度', 0.5650844573974609), ('阶级性', 0.5615861415863037), ('自尊', 0.5536830425262451), ('追求', 0.5530370473861694), ('个人', 0.5523806214332581), ('整饰', 0.547888457775116), ('人格', 0.5456241369247437), ('畅爽', 0.5450509190559387), ('信仰', 0.5448511838912964), ('世界观', 0.5446830987930298), ('人生观', 0.542322039604187), ('金钱观', 0.5381442308425903), ('优越感', 0.5363531112670898)]
1
model.most_similar(positive=['完美主义'], negative=['卓越'], topn=20)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
输出: [('感性认识', 0.5076549053192139), ('自信感', 0.4832497835159302), ('定式', 0.43401259183883667), ('拖延三者', 0.4103550910949707), ('差异.', 0.4100268483161926), ('回归法', 0.4001217484474182), ('变量', 0.38459616899490356), ('6-12', 0.36245742440223694), ('拖延', 0.34290611743927), ('meta', 0.33427050709724426), ('流利', 0.3265514373779297), ('产后', 0.32360613346099854), ('结果', 0.30865421891212463), ('预调查', 0.30783766508102417), ('政界', 0.30159276723861694), ('488', 0.2899872660636902), ('指导语', 0.28705894947052), ('318', 0.28507083654403687), ('1986', 0.2814595699310303), ('因子', 0.280342698097229)]
1
model.most_similar(positive=['完美主义', '成功'], negative=[], topn=20)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
输出: [('成就', 0.7487295269966125), ('自我', 0.7096918225288391), ('乐观', 0.7087293863296509), ('自尊', 0.6956290006637573), ('积极', 0.6946340799331665), ('依赖', 0.6929352283477783), ('归因', 0.6833783388137817), ('倾向', 0.6816843748092651), ('保存', 0.6750534772872925), ('自信', 0.6739674806594849), ('创造力', 0.6691659688949585), ('倦怠', 0.6682588458061218), ('生涯', 0.663487434387207), ('兴趣', 0.6600993871688843), ('创造性', 0.6511282324790955), ('效能', 0.6472607851028442), ('成功感', 0.6456454992294312), ('能力', 0.6446733474731445), ('定向', 0.6438237428665161), ('反刍', 0.6435714364051819)]
1
model.most_similar(positive=['完美主义', '失败'], negative=[], topn=20)
输出: d:\mysites\deeplearning.ai-master\.env\lib\site-packages\ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `most_similar` (Method will be removed in 4.0.0, use self.wv.most_similar() instead). """Entry point for launching an IPython kernel.
输出: [('自尊', 0.8850756287574768), ('倾向', 0.8513725996017456), ('自悯', 0.8307703733444214), ('成就', 0.8234578371047974), ('消极', 0.8121293783187866), ('学业', 0.8060901165008545), ('抑郁', 0.8046209812164307), ('回避', 0.804327130317688), ('归因', 0.8038434982299805), ('强迫', 0.8031525611877441), ('信念', 0.7968993186950684), ('拖延', 0.7954033017158508), ('情绪', 0.7894195318222046), ('知觉', 0.7877265810966492), ('风格', 0.7872787714004517), ('适应性', 0.785957932472229), ('神经质', 0.7785789370536804), ('反刍', 0.7781296372413635), ('苦恼', 0.7760308384895325), ('自卑感', 0.7698969841003418)]

视频演示

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook.ipynb)
有问题可以直接在下方留言
或者给我发邮件675495787[at]qq.com
请记住我的网址: mlln.cn 或者 jupyter.cn