{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: 文本向量系列-如何基于词频矩阵和TF-IDF权重构建词向量\n", "date: 2018-08-18 18:17:55\n", "tags: [python, 文本挖掘]\n", "toc: true\n", "xiongzhang: true\n", "xiongzhang_images: [main.jpg]\n", "\n", "---\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 系列文章\n", "\n", "这个是系列博客, 所有文章链接都列在这里, 并持续更新中。\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 篇章/词频矩阵\n", "\n", "基于矩阵的分布表示通常又称为语义分布模型PU,该方法的主要思想是构建一个共现矩阵,矩阵的每行对应一个单词,每列表示一种上下文(通常是一篇文章),而每个元素的值为对应单词与上下文在语料库中的共现次数。因此,每个单词可由矩阵中对应的行向量表示,而任意两个单词的相似性可直接由它们向量的相似性衡量。\n", "\n", "下面就是一个共现矩阵, 每一行就是一个词向量:\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "
文档1文档2文档3...
词1103...
词210...
词3100...
.1343...
\n", "
\n", "\n", "### TF-IDF权重\n", "\n", "TF-IDF基于上面提到的共现矩阵。从实际操作角度来说, [TF-IDF](http://mlln.cn)只是对共现矩阵进行了加权。\n", "\n", "传统经典模型TF-IDF以及一些基于它改进的方法:主要思想是通过提取文本中词语的权重来标识句子, 使文本构成向量表达。权重主要由两部分组成, 即该词语在文本中的频率 (term frequency, TF) 与反文档频率 (inverse document frequency, IDF) 。它衡量了一个词的常见程度,TF-IDF的假设是:如果某个词或短语在一篇文章中出现的频率高,并且在其他文章中很少出,那么它很可能就反映了这篇文章的特性,因此要提高它的权值。然而这种方法太过于依赖词语的共现, 加上本身短文本消息就由很少的字组成, 往往实际应用中得不到很好的效果。因为两个文本消息可能没有共同的词语但也可以语义相关, 相反如果两个文本消息有一些共同的词语也不一定语义相关。如”富士苹果很好吃, 赶紧买”, “苹果六代真好用, 赶紧买”和”乔布斯逝世了”。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 语料库\n", "\n", "因为中文语料库往往需要涉及[分词](http://mlln.cn), 之后分词后才能对词进行向量化, 但是目前分词还不是我们教程的内容, 所以为了降低学习难度, 我们使用英文作为这次教程的语料库, 这个语料库比较小, 只有几个文档, 我们直接写在代码里了:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "documents = [\"Human machine interface for lab abc computer applications\",\n", " \"A survey of user opinion of computer system response time\",\n", " \"The EPS user interface management system\",\n", " \"System and human system engineering testing of EPS\",\n", " \"Relation of user perceived response time to error measurement\",\n", " \"The generation of random binary unordered trees\",\n", " \"The intersection graph of paths in trees\",\n", " \"Graph minors IV Widths of trees and well quasi ordering\",\n", " \"Graph minors A survey\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 计算词典和词频数矩阵\n", "\n", "词典就是词到词id的映射, 这样我们可以用id(一个整数)表示一个词了。词频矩阵参考上面的表格。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from itertools import chain\n", "import numpy as np\n", "\n", "def word_matrix(documents):\n", " '''计算词频矩阵'''\n", " # 所有字母转换位小写\n", " docs = [d.lower() for d in documents]\n", " # 分词\n", " docs = [d.split() for d in docs]\n", " # 获取所有词\n", " words = list(set(chain(*docs)))\n", " # 词到ID的映射, 使得每个词有一个ID\n", " dictionary = dict(zip(words, range(len(words))))\n", " # 创建一个空的矩阵, 行数等于词数, 列数等于文档数\n", " matrix = np.zeros((len(words), len(docs)))\n", " # 逐个文档统计词频\n", " for col, d in enumerate(docs):\n", " # 统计词频\n", " count = Counter(d)\n", " for word in count:\n", " # 用word的id表示word在矩阵中的行数\n", " id = dictionary[word]\n", " # 把词频赋值给矩阵\n", " matrix[id, col] = count[word]\n", " return matrix, dictionary\n", "\n", "matrix, dictionary = word_matrix(documents)\n" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 0. 0. 0. 0. 1. 0. 0. 0.]\n", " [0. 0. 0. 1. 0. 0. 0. 0. 0.]\n", " [0. 1. 0. 0. 0. 0. 0. 0. 0.]\n", " [0. 0. 0. 0. 0. 0. 0. 1. 1.]\n", " [0. 0. 1. 1. 0. 0. 0. 0. 0.]\n", " [1. 0. 0. 0. 0. 0. 0. 0. 0.]\n", " [0. 0. 0. 0. 0. 0. 0. 1. 0.]\n", " [0. 0. 1. 0. 0. 0. 0. 0. 0.]\n", " [1. 0. 0. 0. 0. 0. 0. 0. 0.]\n", " [0. 1. 1. 2. 0. 0. 0. 0. 0.]]\n" ] }, { "data": { "text/plain": [ "{'unordered': 0,\n", " 'engineering': 1,\n", " 'opinion': 2,\n", " 'minors': 3,\n", " 'eps': 4,\n", " 'abc': 5,\n", " 'iv': 6,\n", " 'management': 7,\n", " 'machine': 8,\n", " 'system': 9,\n", " 'random': 10,\n", " 'ordering': 11,\n", " 'well': 12,\n", " 'the': 13,\n", " 'relation': 14,\n", " 'a': 15,\n", " 'perceived': 16,\n", " 'and': 17,\n", " 'of': 18,\n", " 'graph': 19,\n", " 'widths': 20,\n", " 'computer': 21,\n", " 'quasi': 22,\n", " 'user': 23,\n", " 'lab': 24,\n", " 'for': 25,\n", " 'trees': 26,\n", " 'to': 27,\n", " 'binary': 28,\n", " 'time': 29,\n", " 'testing': 30,\n", " 'in': 31,\n", " 'survey': 32,\n", " 'error': 33,\n", " 'human': 34,\n", " 'intersection': 35,\n", " 'paths': 36,\n", " 'interface': 37,\n", " 'applications': 38,\n", " 'measurement': 39,\n", " 'generation': 40,\n", " 'response': 41}" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(matrix[:10,:10])\n", "dictionary" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'unordered': 0, 'engineering': 1, 'opinion': 2, 'minors': 3, 'eps': 4, 'abc': 5, 'iv': 6, 'management': 7, 'machine': 8, 'system': 9, 'random': 10, 'ordering': 11, 'well': 12, 'the': 13, 'relation': 14, 'a': 15, 'perceived': 16, 'and': 17, 'of': 18, 'graph': 19, 'widths': 20, 'computer': 21, 'quasi': 22, 'user': 23, 'lab': 24, 'for': 25, 'trees': 26, 'to': 27, 'binary': 28, 'time': 29, 'testing': 30, 'in': 31, 'survey': 32, 'error': 33, 'human': 34, 'intersection': 35, 'paths': 36, 'interface': 37, 'applications': 38, 'measurement': 39, 'generation': 40, 'response': 41}\n" ] } ], "source": [ "print(dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF权重计算\n", "\n", "#### 计算TF\n", "\n", "词在每个文档中的频数已经由上面的步骤计算得到, 下面我们要计算TF(词频率term frequency), 它其实就是每个词的频数除以文档的总词数:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 8. 10. 6. 8. 9. 7. 7. 10. 4.]\n" ] }, { "data": { "text/plain": [ "array([[0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0.125 , 0. ],\n", " [0. , 0.1 , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0.16666667, 0.125 , 0. ]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def tf(matrix):\n", " # 计算每个文档的总次数\n", " sm = np.sum(matrix, axis=0)\n", " print(sm)\n", " # 每个词的词频除以每个文档的词频\n", " return matrix / sm\n", "\n", "tf(matrix)[:5, :5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 计算IDF\n", "\n", "逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量, 某一特定词语的idf,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取以10为底的对数得到:\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "其中:\n", "\n", "|D|: 语料库中的文档总数\n", "分子: 包含词ti的文档数\n", "\n", "下面我们用代码来计算idf:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1 1 1 2 2 1 1 1 1 3 1 1 1 3 1 2 1 2 6 3 1 2 1 3 1 1 3 1 1 2 1 1 2 1 2 1 1\n", " 2 1 1 1 2]\n" ] }, { "data": { "text/plain": [ "array([9. , 9. , 9. , 4.5, 4.5, 9. , 9. , 9. , 9. , 3. , 9. , 9. , 9. ,\n", " 3. , 9. , 4.5, 9. , 4.5, 1.5, 3. , 9. , 4.5, 9. , 3. , 9. , 9. ,\n", " 3. , 9. , 9. , 4.5, 9. , 9. , 4.5, 9. , 4.5, 9. , 9. , 4.5, 9. ,\n", " 9. , 9. , 4.5])" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def idf(matrix):\n", " '''计算IDF'''\n", " # 文档总数\n", " D = matrix.shape[1]\n", " # 包含每个词的文档数\n", " j = np.sum(matrix>0, axis=1)\n", " print(j)\n", " return D / j\n", "\n", "idf(matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 计算tf-idf\n", "\n", "这个就很简单了:\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "下面是用代码实现的tf-idf函数:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[ 8. 10. 6. 8. 9. 7. 7. 10. 4.]\n", "[1 1 1 2 2 1 1 1 1 3 1 1 1 3 1 2 1 2 6 3 1 2 1 3 1 1 3 1 1 2 1 1 2 1 2 1 1\n", " 2 1 1 1 2]\n" ] }, { "data": { "text/plain": [ "array([[0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 1.125 , 0. ],\n", " [0. , 0.9 , 0. , 0. , 0. ],\n", " [0. , 0. , 0. , 0. , 0. ],\n", " [0. , 0. , 0.75 , 0.5625, 0. ]])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def tf_idf(matrx):\n", " return tf(matrix) * idf(matrix).reshape(matrix.shape[0], 1)\n", "\n", "weights = tf_idf(matrix)\n", "weights[:5, :5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 词向量\n", "\n", "我们得到了weights矩阵, 它的每一行就是一个词的词向量, 每一列就是一个文档的向量, 如果要计算词的相似性可以计算行向量的余弦值, 如果要计算文档的相似性, 我们可以计算列向量的余弦值。\n", "\n", "下面分别计算词和文档的相关矩阵:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 词相关矩阵\n", "\n", "由于语料库太小了, 只有十几篇文档, 导致很多词的词向量都是相同的, 所以计算得到的词之间的相关很可能出现一些强相关, 但并不意味着他们由语义相关。但是在语料库比较大的时候, 很难出现两个词的词向量相同, 哪怕两个词是同义词也不太可能。\n", "\n", "但是, 就今天的例子而言, 结果不是很好。大部分cosine值都是1附近, 说明词之间的相关几乎是0。这种方法的效果较差, 很难揭示语义相似性。" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQQAAAECCAYAAAAYUakXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAFSpJREFUeJzt3X+0VWWdx/H3N0Ipf6QoGoEjSGkRYzcjpEEdxQzFlmLLmbJmctRSMwpXMyg2a000U6PmGFrmb1GcTG1MB5dhykKcdJUYKBevgvJTwYuABgaZBvidP86+T9ez9+buc/be595z7+e11l33nOfufZ7nOV6+7vPs536/5u6IiAC8q7sHICI9hwKCiAQKCCISKCCISKCAICKBAoKIBA0PCGZ2opk9b2YrzGxayX2tMbNnzGyxmS0s+LVnmtlGM2vr1DbQzOaa2fLo+74l9TPdzF6O5rXYzCYW0M9BZjbfzJaa2bNmNiVqL2NOaX2VMa8BZvakmbVGfX03ah9uZguied1tZruV1M9tZra605xa8s4pet1+Zva0mT1Q6HzcvWFfQD9gJXAIsBvQCowssb81wP4lvfYxwBFAW6e2HwDTosfTgMtL6mc68C8Fz2cwcET0eC/gBWBkSXNK66uMeRmwZ/S4P7AAGAv8HPhC1H498LWS+rkNOL2E379vAT8DHoieFzKfRl8hjAFWuPsqd/8zcBdwaoPHUAh3/zXw+6rmU4FZ0eNZwKSS+imcu69396eix1uBpcAQyplTWl+F84pt0dP+0ZcD44F7ovbc89pFP4Uzs6HAycDN0XOjoPk0OiAMAdZ2er6Okn4RIg48bGaLzOzcEvvpcKC7r4fKLz1wQIl9TTazJdFHityX8Z2Z2TDg41T+L1fqnKr6ghLmFV1eLwY2AnOpXKVucfcd0SGF/B5W9+PuHXP6fjSnGWa2e95+gKuAi4C3o+f7UdB8Gh0QLKGtzL3T49z9COAk4OtmdkyJfTXSdcAIoAVYD1xZ1Aub2Z7AL4AL3f0PRb1uxr5KmZe773T3FmAolavUjyQdVnQ/ZjYKuAT4MPBJYCBwcZ4+zOyzwEZ3X9S5OWk49bx+owPCOuCgTs+HAu1ldebu7dH3jcB9VH4ZyrTBzAYDRN83ltGJu2+IfvneBm6ioHmZWX8q/0DvcPd7o+ZS5pTUV1nz6uDuW4BHqXy238fM3h39qNDfw079nBh9PHJ3fwu4lfxzGgecYmZrqHzkHk/liqGQ+TQ6IPwO+FC0Irob8AXg/jI6MrM9zGyvjsfAZ4C2XZ+V2/3AmdHjM4HZZXTS8Q80choFzCv6HHoLsNTdf9jpR4XPKa2vkuY1yMz2iR6/B/g0lTWL+cDp0WG555XSz7JOwdSofK7PNSd3v8Tdh7r7MCr/fh5x9y9R1HyKXv3MsDo6kcqq8krgX0vs5xAqdzFagWeL7gu4k8pl7XYqVz7nUPksNw9YHn0fWFI//w08Ayyh8g92cAH9HEXlMnMJsDj6mljSnNL6KmNehwNPR6/ZBvxbp9+PJ4EVwP8Au5fUzyPRnNqAnxLdiSjod/BY/nKXoZD5WPRiIiLaqSgif6GAICKBAoKIBAoIIhIoIIhI0G0BoUFbiRvWTyP70pzUV1n9dOcVQqP+ozTsP34D+9Kc1Fcp/eQKCNbA3AYiUr66NyaZWT8qOw5PoLKD7nfAGe7+XNo5u9nuPoA9ANjOW/Rndw49/I26+s9q02s7GbRfv9Je/4Ul7w2PO+YEZJ5X26uDEttH7b8p9Zx65/TM1v1ibX+912u7PKfs9y+tn+deib8vI98ff086v/+d7er9b9ScGtFXx/w7/+5BfP5r1m7n1d/vTPojqHd4d1cH7ELIbQBgZh25DVIDwgD24Eg7/h1tDz20OMcQut+EDyQnwMk6r8Nmfi2x/cmzr6t7TGlGzDsr3s/xtxbeTxFaLrsg1vbktGtjbXnf/2aXdf5jJqxNPK5ano8Mjc5tICIly3OFkOlvsKPVz3MBBpB8eSciPUOeNYRPAdPdfUL0/BIAd7807Zy9baBXf2RI81B7tku+UVfHLy0B2qbELy/zGv/lc2Jtj9x+S+H9FOG4s78aa3vt/G2xtiVj7mzEcHqEpMvrrL9nacZOPT+x/Ykrrs/1ukUbM2EtC1vf7HINIc9HhoblNhCRxqj7I4O77zCzycBDVLIpz3T3ZwsbmYg0XJ41BNx9DjCnoLGISDfT3zKISKCAICJBQ1Oo1XKXIUneFeG8jm2L1754dNT/Nqz/j/72S7G2Zz91R+bzR/4kfkfmua9nvxvzsSvi57dOTT5/+C/jdzlWn3xT5r7yGv2d+Iavhd/Nvtkr73vd3arnv+zeGbyxaW2pdxlEpJdRQBCRQAFBRAIFBBEJmmpRMUktC42NXChK6itJLf2fdNIZsbYHH0zeenzBy2NjbZP2XRRru+rk5OLbL/5nvCZp0ljT5pn3ff3x5oNjbd/Y98Vcr1mLWt7rZtCIrcsi0ssoIIhIoIAgIoECgogETb+omCbrYuPx/xDPcQAw76fxPAcTjzkt1jbn1/fVNrAqH/xZ8t/Tr/hivr+nHz47noT3c2MWxtquHPxUrn7SjPpRfFdj2zeLz1FRi6TcGXnzZhwy9+zE9lUnzMx0fsulyfk8Fl9S7HulRUURqZkCgogECggiEiggiEiQK2OSma0BtgI7gR3uPrqIQYlI98h1lyEKCKPd/dUsxzfyLkOSWrY5l5Ght9klZXKePzN7joO0oiJJmuW9zpt1OS1r+JDLfxNry/Oe6C6DiNQsb0Bw4GEzW9TIEtsiUo5cawjAOHdvN7MDgLlmtszdf935AFVuEmkeua4Q3L09+r4RuI9KAdjqY25099HuPrpzdVoR6XnylHLbA3iXu2+NHs8F/t3df5V2TncvKiYpa/Eq6wJaWf0nJUQdcsqaWNvrbw1IPH/PE1fF2pLGmlp9OOe8hs/5Sqxt9cSbc71mX5Z1UTHPR4YDgfvMrON1frarYCAiPV+eUm6rgI8VOBYR6Wa67SgigQKCiAS9Nh9CXkmLYh+6PV4NaPmXs1cDStJyWcrfw0/r3twBeR09+bxY22PX3NANI+nZasnHkYd2KopIzRQQRCRQQBCRQAFBRAIFBBEJdJehBs3yN/plUT6EOOVDEJFeSwFBRAIFBBEJFBBEJMibMalPKSPx6tjFpye2P9FyT6zt/j/GM06dsscbiecffmV8sWrriB2xttWTbuxqiMExCQtdaQ6dFd/m/UJ7vm3etZj6ysdjbYsuOiLW9sjtyVuEs77XWw7N9//UNz76ZmJ73t+r6vmv2/5apvN0hSAigQKCiAQKCCISKCCISNDlTkUzmwl8Ftjo7qOitoHA3cAwYA3w9+6+uavOmn2nYi2yLgodNjO++Abw/NnFL8CNmHdWrG3l8bcW3k8RkvJEJOWIKCvJa7PIOv8idyreBpxY1TYNmOfuHwLmRc9FpMl1GRCiwiu/r2o+FZgVPZ4FTCp4XCLSDepdQzjQ3dcDRN8PKG5IItJdSt+YpFJuIs2j3iuEDWY2GCD6vjHtQJVyE2ke9V4h3A+cCVwWfZ9d2Ih6iax/+/98Sdt5k0q5HZZQym3cks8lnt/dpdxeP/zPmY7rK3cT0hQ9/y6vEMzsTuC3wGFmts7MzqESCE4ws+XACdFzEWlyXV4huPsZKT/qGxsKRPoQ7VQUkUABQUQCBQQRCRQQRCRQQBCRQAFBRAIFBBEJlGS1m5W10691ajx3wPAH4pWXxo1annj+TzP2nzbOQ+47L9a26rQbMr0mwKDH+8cbJ8ab8uaTKCsfRdrrVts5IDkfyYovZqv8VPT4dYUgIoECgogECggiEiggiEiggCAiQZdZl4vUl7IulyFvPoKjEnIfPH74vfkHlmDE3efH2lZ+PtvKeZpaSukdPTl+l+Oxa7Lf5cjbfxnyzL913tVs27y2kKzLItJHKCCISKCAICKBAoKIBF1uXU4p5TYd+CqwKTrs2+4+p6xBSkXeRa0jB62Jtf3teecmHvt/N9wYa7vg5bGxtmuHPJF4/oWfeTDzuLKqZa4fvOi5XH119wJikjzzf/7ZP2U6r95SbgAz3L0l+lIwEOkF6i3lJiK9UJ41hMlmtsTMZprZvmkHmdm5ZrbQzBZu560c3YlI2eoNCNcBI4AWYD1wZdqBqtwk0jzqyofg7hs6HpvZTcADhY1IalLLTsUFm4bF2h5PWDxMk7aAmOSqh0+KtX2jgTsVV/xgZLzxmscy95V1V2hP3alYPf83X5mbqY+6rhA66jpGTgPa6nkdEelZstx2vBM4FtjfzNYB3wGONbMWwIE1QHzjuIg0nXpLud1SwlhEpJtpp6KIBAoIIhIoH0If8sL1Y2JtZxyZfOfgl7cfFWtLyuT8sSsuSDw/6dhafG7FCbG2ez+YbaU8zXFnx7NOz595U67X/PDNyVmPl30lX9bmoo2ZsJaFrW8qH4KIZKeAICKBAoKIBAoIIhJoUbGP6zf/A5mPXff6+2JtQ9/3euKxy9a+P9a26tMzM/d1xH/EF+vee+qGWNv6TfExAey56D2xNj92c6ztj1sHJJ4/aWRrrC1p6/ewvZP/EPi3K4cntldbefytmY6D5CS5Wee/4o4f8qcNSrIqIjVQQBCRQAFBRAIFBBEJ6sqHIL3HzuPaE9u3/eqQWNuSMXfG2sYlLHRBbQuISTZ/Ynus7alaqkxlXLsePjs5yezC6aNjbbXkjhjeemjmY7OqqcpW1fzHzNuUfFwVXSGISKCAICKBAoKIBAoIIhJ0GRDM7CAzm29mS83sWTObErUPNLO5ZrY8+p6ail1EmkOWuww7gH9296fMbC9gkZnNBf4JmOful5nZNGAacHF5Q5VG2vPEVfHGhBsSicelHNsTrT415c7BqSW9bg+XpXLTend/Knq8FVgKDKHyls2KDpsFTCprkCLSGDWtIZjZMODjwALgQHdfD5WgARxQ9OBEpLEyBwQz2xP4BXChu/+hhvNUyk2kSWQKCGbWn0owuMPdO7ZLbego2BJ935h0rkq5iTSPLIVajEodhqXu/sNOP7ofOBO4LPo+u5QRSo9RSymxScsnxNpev/SvMve1Omfy0xHzzoq11ZJ74IKXx8bakkrZ5U2ymrZ1etjseJ6SWhLCVs+/fetPMp2X5S7DOOAfgWfMrOO//repBIKfm9k5wEvA32UdrIj0TFkqNz0OpGVaUfojkV5EOxVFJFBAEJFASValFGmLjVkNn/OVWNvqiTfnes2+TJWbRKRmCggiEiggiEiggCAigQKCiAS6yyANk/fOw4i7z4+1rfz89YnHfmJ6fEvxounZthMD/HjzwbG2qx4+KXP/WY360QWJ7W3fvDbWlmf+z/9iBm9sUik3EamBAoKIBAoIIhIoIIhIoEVF6ZHyLkA2ytip8YU+gCeuyLbYOOrq5EXFIZf/JtaW5z3R1mURqZkCgogECggiEiggiEiQp5TbdDN72cwWR18Tyx+uiJQpTyk3gBnu/l/lDU9EGilLktX1QEeFpq1m1lHKTUR6mTyl3AAmm9kSM5up6s8izS9PKbfrgBFAC5UriCtTzlMpN5EmUXcpN3ff4O473f1t4CZgTNK5KuUm0jzqLuVmZoM7qj8DpwFt5QxR+qKsZeNaLkve+rt4WjyfQJK856dtUU573Wptaf1MyXR67vFXy1PK7QwzawEcWAOcV9cIRKTHyFPKbU7xwxGR7qSdiiISKCCISKB8CNLU0nIEHD05vqT12DU35Oor60JnWWrpv3r+rfOuZttmJVkVkRooIIhIoIAgIoECgogEWlSUXqlZkrTmlbTQCPH5K8mqiNRMAUFEAgUEEQkUEEQkUEAQkSDLnz+LNJ2824zPeunoWNuKH4yMteXdDl2LxO3Y7cn9V8//BX8tUx+6QhCRQAFBRAIFBBEJslRuGmBmT5pZa1S56btR+3AzW2Bmy83sbjPbrfzhikiZuty6HCVZ3cPdt0XZlx+nkgLyW8C97n6XmV0PtLr7dbt6LW1dlp4m7xbnsVPPT2xPS77aXQrbuuwV26Kn/aMvB8YD90Tts4BJdY5VRHqIrHUZ+kUZlzcCc4GVwBZ33xEdsg6VdxNpepkCQlSQpQUYSqUgy0eSDks6V5WbRJpHTXcZ3H0L8CgwFtjHzDo2Ng0F2lPOUeUmkSaRpXLTIGC7u28xs/cAnwYuB+YDpwN3AWcCs8scqEgZsuYTSJO2eJj2uvX20yhZti4PBmaZWT8qVxQ/d/cHzOw54C4z+x7wNJVybyLSxLJUblpCpQR8dfsqUgq8ikhz0k5FEQkUEEQkUEAQkUD5EEQS5M2n0N13D6rzOaz58/2ZztMVgogECggiEiggiEiggCAigRYVRTLKu9DYyP6rE8K++crcTH3oCkFEAgUEEQkUEEQkUEAQkaDLJKtFUpJV6SuyLjYeMvfsxPYDHo4nMc+TuLWwJKsi0ncoIIhIoIAgIoECgogEeUq53WZmq81scfSVLaukiPRYWbYuvwWM71zKzcwejH421d3v2cW5In1S1qzLq9pnJv/ghAIHU4MsSVYdSCrlJiK9TF2l3Nx9QfSj75vZEjObYWaqwiLS5Ooq5WZmo4BLgA8DnwQGAhcnnatSbiLNo95Sbie6+/qoMvRbwK2k1GhQKTeR5pHlLsMgM9snetxRym2ZmQ2O2oxKKfi2MgcqIuXLU8rtkajuowGLgfNLHKeINECeUm7jSxmRiHQb7VQUkUABQUQCJVkV6UZpOxpfvvhvYm1tU64tezi6QhCRv1BAEJFAAUFEAgUEEQkUEEQk0F0GkR5oyOW/iTdOKb9fXSGISKCAICKBAoKIBAoIIhJoUVGkSSRtc04rGXf05PPe8XzZS1dn6kNXCCISKCCISKCAICJB5oAQpWJ/2sweiJ4PN7MFZrbczO42s3j9ahFpKrUsKk4BlgJ7R88vB2a4+11mdj1wDnBdweMTkV1Iy6fwWPsN73g+ZsKmTK+XtVDLUOBk4ObouQHjgY4ybrOoZF4WkSaW9SPDVcBFwNvR8/2ALe6+I3q+DhhS8NhEpMGy1GX4LLDR3Rd1bk44NLHeoyo3iTSPLGsI44BTzGwiMIDKGsJVwD5m9u7oKmEo0J50srvfCNwIsLcNVJFYkR6syysEd7/E3Ye6+zDgC8Aj7v4lYD5wenTYmcDs0kYpIg2RZ+vyxcBdZvY94GnglmKGJCJ5Vd99eMFfy3ReTQHB3R+lUuwVd19FSoFXEWlO2qkoIoECgogECggiEph74+4Emtkm4MXo6f7Aqw3otlH9NLIvzUl91drPwe4+qKsXaWhAeEfHZgvdfXRv6aeRfWlO6qusfvSRQUQCBQQRCbozINzYy/ppZF+ak/oqpZ9uW0MQkZ5HHxlEJFBAEJFAAUFEAgUEEQkUEEQk+H9fXzjWrIpViAAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from scipy.spatial.distance import cosine\n", "\n", "\n", "def word_relations(weights, ):\n", " relations = np.zeros((len(weights), len(weights)))\n", " for i in range(len(weights)):\n", " vec1 = weights[i]\n", " for j in range(i, len(weights)):\n", " vec2 = weights[j]\n", " relations[i, j] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " \n", "word_relations(weights)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 文档相关矩阵\n", "\n", "我们得到词向量的同时也得到了文档向量, 我们可以计算文档之间的相关性几乎是没有的, 这不符合我们的直观感受。" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0. 0.93513419 0.92300095 0.94191977 1. 1.\n", " 1. 1. 1. ]\n", " [0. 0. 0.89788806 0.90372145 0.82532862 0.98323348\n", " 0.98114633 0.98558537 0.66476436]\n", " [0. 0. 0. 0.80571111 0.96544268 0.96019457\n", " 0.95523946 1. 1. ]\n", " [0. 0. 0. 0. 0.99348337 0.99249371\n", " 0.9915593 0.93546641 1. ]\n", " [0. 0. 0. 0. 0. 0.99432578\n", " 0.99361943 0.99512172 1. ]\n", " [0. 0. 0. 0. 0. 0.\n", " 0.93385392 0.97190435 1. ]\n", " [0. 0. 0. 0. 0. 0.\n", " 0. 0.94313243 0.93468898]\n", " [0. 0. 0. 0. 0. 0.\n", " 0. 0. 0.83771539]\n", " [0. 0. 0. 0. 0. 0.\n", " 0. 0. 0. ]]\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAECCAYAAADesWqHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADTlJREFUeJzt3WuMXdV5xvH/w9jGNhcZAk0JRlyqFhVRBRByCFS0xSSBJCJf+gEkIiVqRaumKaSVUFJVQvkeRfRDGwkBCVKAiBBQI5RQkBKEEC2JMVAuJlW4m0sMoQQTUmzstx/OIXUst7MH9jqemfX/SSOf8Wyv9z2eec7a58w+a6WqkNSXA/Z3A5Jmz+BLHTL4UocMvtQhgy91yOBLHdqvwU9yXpKfJPlpki82qnFtkm1JHmk0/jFJfphkS5JHk1zaoMbqJD9K8tC0xpfHrjGtM5fkgSS3tRh/WuPpJA8neTDJpgbjr0tyc5LHp9+TD488/onT3t/5eD3JZWPWmNb5wvR7/UiSG5OsHrVAVe2XD2AOeAI4AVgFPASc1KDO2cBpwCON7sdRwGnT24cA/zn2/QACHDy9vRK4DzijwX35W+AG4LaG3/engSMajn8d8OfT26uAdQ1rzQEvAceOPO7RwFPAmunnNwGfGbPG/pzxNwA/raonq2oH8C3gU2MXqaq7gVfHHneP8V+sqs3T29uBLUy+cWPWqKp6Y/rpyunHqFdeJVkPfAK4esxxZynJoUwe6K8BqKodVfVaw5IbgSeq6pkGY68A1iRZAawFXhhz8P0Z/KOB5/b4fCsjB2bWkhwHnMpkRh577LkkDwLbgDurauwaVwKXA7tHHndvBdyR5P4kl4w89gnAy8DXp09Zrk5y0Mg19nQhcOPYg1bV88BXgGeBF4FfVNUdY9bYn8HPPv5uyV4/nORg4DvAZVX1+tjjV9WuqjoFWA9sSHLyWGMn+SSwraruH2vM/8dZVXUacD7wuSRnjzj2CiZP675WVacCvwRavXa0CrgA+HaDsQ9jcvZ7PPAB4KAkF49ZY38GfytwzB6fr2fk05lZSbKSSeivr6pbWtaanrreBZw34rBnARckeZrJU65zknxzxPF/rapemP65DbiVyVO+sWwFtu5xNnQzkweCFs4HNlfVzxqMfS7wVFW9XFU7gVuAM8cssD+D/2Pgd5McP330vBD47n7s511JEibPKbdU1Vcb1Tgyybrp7TVMfjAeH2v8qvpSVa2vquOYfB9+UFWjzjAASQ5Kcsg7t4GPAqP9tqWqXgKeS3Li9K82Ao+NNf5eLqLBaf7Us8AZSdZOf742MnntaDQrxhxsIarq7SR/Dfwrk1dHr62qR8euk+RG4I+BI5JsBa6oqmtGLHEW8Gng4elzcIC/r6rvjVjjKOC6JHNMHqxvqqpmv3Jr6P3ArZOfZVYAN1TV7SPX+Dxw/XQyeRL47Mjjk2Qt8BHgL8YeG6Cq7ktyM7AZeBt4ALhqzBqZ/rpAUke8ck/qkMGXOmTwpQ4ZfKlDBl/q0KIIfoNLN5dljeVwH6yxOMZfFMEHmn+TlkmN5XAfrLEIxl8swZc0Q00u4FmVA2s1w98UtZO3WMmBC6pxzB+8Mf9Be/ivV3dz2OELe5xb6P/Mu6kxt4Aqr766m8MXOD7AgZkbfOzLP9/Fke8bfvy7YY124z/93E5eeXXXvt4A9xuaXLK7moP4UDa2GPrXrrzt3qbjA7xVbX9wAA45YGfzGr+z8uDmNbQ4bPjYc/MfhKf6UpcMvtQhgy91yOBLHTL4UocMvtQhgy91aFDwZ7HjjaTZmTf403Xe/onJqqInARclOal1Y5LaGTLjz2THG0mzMyT4y27HG6l3Q67VH7TjzfS9w5cArGbte2xLUktDZvxBO95U1VVVdXpVnb7Qd9pJmq0hwV8WO95I+l/znurPascbSbMz6P340+2gxtwSStJ+5JV7UocMvtQhgy91yOBLHTL4UocMvtShJstrz8Jlx53ZvMblTzzcvMbrtat5jX/ZfkLT8f9y3eNNx5+Vucy7HP179kcPXdR0/C2/+sag45zxpQ4ZfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOjRkee1rk2xL8sgsGpLU3pAZ/xvAeY37kDRD8wa/qu4GXp1BL5JmxOf4UodGe5OO6+pLS8doM77r6ktLh6f6UoeG/DrvRuDfgBOTbE3yZ+3bktTSkA012q4cIGnmPNWXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ4ZfKlDqarRBz00h9eHsnH0cZejMx/a0bzG5w7/cdPxdzb4Gdpb+21HYP2Kg2dQpa0NH3uOTQ/997w7gzjjSx0y+FKHDL7UIYMvdcjgSx0y+FKHDL7UIYMvdWjICjzHJPlhki1JHk1y6Swak9TOkFV23wb+rqo2JzkEuD/JnVX1WOPeJDUyZEONF6tq8/T2dmALcHTrxiS1s6Dn+EmOA04F7mvRjKTZGLyhRpKDge8Al1XV6/v4uhtqSEvEoBk/yUomob++qm7Z1zFuqCEtHUNe1Q9wDbClqr7aviVJrQ2Z8c8CPg2ck+TB6cfHG/clqaEhG2rcA8z7xn5JS4dX7kkdMvhShwy+1CGDL3XI4EsdMvhShwy+1KHB1+qrjXs/uKp5jX94fk3zGq3tpv2mHVt2vNm8xu+vWhzvY3HGlzpk8KUOGXypQwZf6pDBlzpk8KUOGXypQwZf6tCQpbdWJ/lRkoemG2p8eRaNSWpnyJV7bwHnVNUb00U370ny/ar698a9SWpkyNJbBbwx/XTl9KP99ZOSmhm6vPZckgeBbcCdVeWGGtISNij4VbWrqk4B1gMbkpy89zFJLkmyKcmmnbw1dp+SRrSgV/Wr6jXgLuC8fXzNDTWkJWLIq/pHJlk3vb0GOBd4vHVjktoZ8qr+UcB1SeaYPFDcVFW3tW1LUktDXtX/DyY75EpaJrxyT+qQwZc6ZPClDhl8qUMGX+qQwZc6ZPClDrmhRgc+fvRpTcf/3vObm44PcABpXuP3Vq5uXmOxcMaXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ4ZfKlDBl/q0ODgT1fafSCJq+9IS9xCZvxLgS2tGpE0O0PX1V8PfAK4um07kmZh6Ix/JXA5sLthL5JmZMjy2p8EtlXV/fMc54Ya0hIxZMY/C7ggydPAt4Bzknxz74PcUENaOuYNflV9qarWV9VxwIXAD6rq4uadSWrG3+NLHVrQQhxVdReTvfMkLWHO+FKHDL7UIYMvdcjgSx0y+FKHDL7UIdfV13vWet1+gCufvrd5jd+ea16Cw+bWti8ygDO+1CGDL3XI4EsdMvhShwy+1CGDL3XI4EsdMvhShwy+1KFBV+5N19vbDuwC3q6q01s2JamthVyy+ydV9UqzTiTNjKf6UoeGBr+AO5Lcn+SSlg1Jam/oqf5ZVfVCkt8C7kzyeFXdvecB0weESwBWszjegSRp3wbN+FX1wvTPbcCtwIZ9HOOGGtISMWQLrYOSHPLObeCjwCOtG5PUzpBT/fcDtyZ55/gbqur2pl1Jamre4FfVk8AHZ9CLpBnx13lShwy+1CGDL3XI4EsdMvhShwy+1CE31NCScNlxZzav8c/P3NO8xku73mw6/q9q96DjnPGlDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+qQwZc6NCj4SdYluTnJ40m2JPlw68YktTP0yr1/BG6vqj9NsgpcTVNayuYNfpJDgbOBzwBU1Q5gR9u2JLU05FT/BOBl4OtJHkhy9XTRTUlL1JDgrwBOA75WVacCvwS+uPdBSS5JsinJpp28NXKbksY0JPhbga1Vdd/085uZPBD8BtfVl5aOeYNfVS8BzyU5cfpXG4HHmnYlqamhr+p/Hrh++or+k8Bn27UkqbVBwa+qB4HTG/ciaUa8ck/qkMGXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ65oYY09VfH/mHzGhc89vOm42/fvX3Qcc74UocMvtQhgy91yOBLHTL4UocMvtQhgy91yOBLHZo3+ElOTPLgHh+vJ7lsFs1JamPeK/eq6ifAKQBJ5oDngVsb9yWpoYWe6m8EnqiqZ1o0I2k2Fhr8C4EbWzQiaXYGB3+6wu4FwLf/j6+7oYa0RCxkxj8f2FxVP9vXF91QQ1o6FhL8i/A0X1oWBgU/yVrgI8AtbduRNAtDN9R4E3hf414kzYhX7kkdMvhShwy+1CGDL3XI4EsdMvhShwy+1CE31JBm6Lsntb0c5rUaFmlnfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ4NXYHnC0keTfJIkhuTrG7dmKR2huykczTwN8DpVXUyMMdkmW1JS9TQU/0VwJokK4C1wAvtWpLU2rzBr6rnga8AzwIvAr+oqjtaNyapnSGn+ocBnwKOBz4AHJTk4n0c54Ya0hIx5FT/XOCpqnq5qnYyWWL7zL0PckMNaekYEvxngTOSrE0SJhtnbmnblqSWhjzHvw+4GdgMPDz9N1c17ktSQ0M31LgCuKJxL5JmxCv3pA4ZfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qUKpq/EGTl4FnFvBPjgBeGb2R5VdjOdwHa7Qd/9iqOnK+g5oEf6GSbKqq062xf8e3xuKq0XJ8T/WlDhl8qUOLJfizeNPPcqixHO6DNRbB+IviOb6k2VosM76kGTL4UocMvtQhgy91yOBLHfofEdsTx480Vq8AAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def doc_relations(weights):\n", " n = weights.shape[1]\n", " relations = np.zeros((n, n))\n", " for i in range(n):\n", " vec1 = weights[:, i]\n", " for j in range(i, n):\n", " vec2 = weights[:, j]\n", " relations[i, j] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " print(relations)\n", " \n", "doc_relations(weights)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }