{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: 文本向量系列-基于潜在语义分析的词向量\n", "date: 2018-08-18 18:17:55\n", "tags: [python, 文本挖掘]\n", "toc: true\n", "xiongzhang: true\n", "xiongzhang_images: [main.jpg]\n", "\n", "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 系列文章\n", "\n", "这个是系列博客, 所有文章链接都列在这里, 并持续更新中。\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 概念\n", "\n", "摘自(潜在语义分析理论及其应用)\n", "\n", "潜在语义分析(LSA)是一种用于知识获取和展示的计算理论和方法,它使用统计计算的方法对大量的文本集进行分析,从而提取和表示出词的语义, 这种潜在语义,是词语所有的上下文语境信息的总和。这是因为,上下文环境对其中的事物提供了一组相互联系和制约,在很大程度上决定了词语之间应用义上的相关性。\n", "\n", "潜在语义分析出发点就是文本中的词与词之间存在某种联系,即存在某种潜在的语义结构。这种潜在的语义结构隐含在文本中词语的上下文使用模式中。 因此采用统计计算的方法,对大量的文本中进行分析来寻找这种潜在的语义结构,它不需要确定的语义编码,仅依赖于上下文中事物的联系, 并用语义结构来表示词和文本, 达到消除词之间的相关性, 简化文本向量的目的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 实现方法\n", "\n", "为实现 LSA , 需要通过数学方法建立潜在语义空间模型, 这是 LSA 一个关键性的问题, 直接影响运用 LSA 的性能。从提出 LSA 思想方法以来,研究者不断尝试与改进, 努力寻求最佳的提取潜在语义空间的数学方法, 使 LSA 思想得到有效的应用。模型选择时,需要综合考虑处理大数据量的计算复杂度、存储空间代价、计算时内存消耗、语义模型的表达能力、模型的最优化衡量标准、模型的结合能力、更新算法复杂度等多种因素。还可以根据特定的使用要求、所需处理数据的特点等,选择适合特定需求的最佳方法。\n", "\n", "一般常用的方法是: LSA/SVD, PLSA, SOM等, 今天我们主要使用LSA/SVD方法来示范潜在语义分析, 以后有时间会把PLSA和SOM方法也写出来, 不过那就是另一篇文章了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LSA/SVD实现\n", "\n", "LSA/SVD 是目前普遍使用的典型LSA空间的构造方法。通过对文本集的词-文档矩阵的奇异值分解(Singular ValueDecomposition ,SVD)计算, 并提取K 个最大的奇异值及其对应的奇异矢量构成新矩阵来近似表示原文本集的词条-文本矩阵。\n", "\n", "#### 语料库\n", "\n", "为了加快运算, 我们使用一个很小的语料库, 这个语料库就直接写在代码里了。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "documents = [\"Human machine interface for lab abc computer applications\",\n", " \"A survey of user opinion of computer system response time\",\n", " \"The EPS user interface management system\",\n", " \"System and human system engineering testing of EPS\",\n", " \"Relation of user perceived response time to error measurement\",\n", " \"The generation of random binary unordered trees\",\n", " \"The intersection graph of paths in trees\",\n", " \"Graph minors IV Widths of trees and well quasi ordering\",\n", " \"Graph minors A survey\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 计算词典和词频数矩阵\n", "\n", "词典就是词到词id的映射, 这样我们可以用id(一个整数)表示一个词了。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from itertools import chain\n", "import numpy as np\n", "\n", "def word_matrix(documents):\n", " '''计算词频矩阵'''\n", " # 所有字母转换位小写\n", " docs = [d.lower() for d in documents]\n", " # 分词\n", " docs = [d.split() for d in docs]\n", " # 获取所有词\n", " words = list(set(chain(*docs)))\n", " # 词到ID的映射, 使得每个词有一个ID\n", " dictionary = dict(zip(words, range(len(words))))\n", " # 创建一个空的矩阵, 行数等于词数, 列数等于文档数\n", " matrix = np.zeros((len(words), len(docs)))\n", " # 逐个文档统计词频\n", " for col, d in enumerate(docs):\n", " # 统计词频\n", " count = Counter(d)\n", " for word in count:\n", " # 用word的id表示word在矩阵中的行数\n", " id = dictionary[word]\n", " # 把词频赋值给矩阵\n", " matrix[id, col] = count[word]\n", " return matrix, dictionary\n", "\n", "matrix, dictionary = word_matrix(documents)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0., 0., 0., 0., 0., 1., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 1., 1.],\n", " [1., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 1., 1., 1., 0.],\n", " [0., 2., 0., 1., 1., 1., 1., 1., 0.],\n", " [0., 0., 0., 0., 0., 0., 1., 1., 1.],\n", " [0., 0., 1., 1., 0., 0., 0., 0., 0.],\n", " [0., 1., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 1., 0., 0., 0.],\n", " [0., 0., 0., 0., 1., 0., 0., 0., 0.]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix[:10, :10]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'iv': 0, 'minors': 1, 'lab': 2, 'trees': 3, 'of': 4, 'graph': 5, 'eps': 6, 'opinion': 7, 'binary': 8, 'measurement': 9, 'survey': 10, 'paths': 11, 'widths': 12, 'testing': 13, 'ordering': 14, 'engineering': 15, 'machine': 16, 'for': 17, 'computer': 18, 'to': 19, 'applications': 20, 'response': 21, 'system': 22, 'perceived': 23, 'intersection': 24, 'a': 25, 'time': 26, 'relation': 27, 'human': 28, 'random': 29, 'well': 30, 'unordered': 31, 'the': 32, 'generation': 33, 'quasi': 34, 'user': 35, 'in': 36, 'management': 37, 'error': 38, 'abc': 39, 'interface': 40, 'and': 41}\n" ] } ], "source": [ "print(dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SVD分解\n", "\n", "关于svd分解可以参阅网上相关资料, 宗旨使用svd分解的目的就是降低噪音并降低词向量/文档向量所在空间的维度, 因为我们的语料库比较小, 所以我们就让维度降到3好了。但是通常在大型的语料库进行LSA分析的时候, 通常降维到100-300之间。\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from scipy import linalg\n", "# 使用scipy模块进行svd分解, 得到三个矩阵\n", "U, sigma, VT = linalg.svd(matrix)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.shape: (42, 42)\n", "sigma.shape: (9,)\n", "VT.shape: (9, 9)\n" ] } ], "source": [ "print('U.shape:', U.shape)\n", "print('sigma.shape:', sigma.shape)\n", "print('VT.shape:', VT.shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "估计得到的潜在语义空间\n", "X_new.shape: (42, 9)\n" ] } ], "source": [ "# 降维后的维数\n", "n = 3\n", "\n", "# 取前n个向量\n", "U2=U[:, :n]\n", "VT2=VT[:n]\n", "\n", "# 把向量转换位对角矩阵\n", "sigma_matrix=linalg.diagsvd(sigma, U2.shape[0], VT2.shape[1])\n", "# 截取相应的部分\n", "sigma_matrix=sigma_matrix[:U2.shape[1],:VT2.shape[0]]\n", "\n", "# SVD的逆运算, 得到新的语义空间, 它和词频矩阵有相同的形状\n", "X_new=np.dot(U2, sigma_matrix)\n", "X_new=np.dot(X_new, VT2)\n", "print('估计得到的潜在语义空间')\n", "print('X_new.shape:', X_new.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 使用词向量计算词之间的相关性\n", "\n", "我们可以看到语义接近的词的cosine值偏小。感兴趣的人可以拿这个图跟前一篇文章进行对比(文本向量系列-如何基于词频矩阵和TF-IDF权重构建词向量), 可以看出LSA可以揭示词的语义相关性。而tf-idf方法效果较差。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from scipy.spatial.distance import cosine\n", "\n", "\n", "def word_relations(weights, ):\n", " relations = np.zeros((len(weights), len(weights)))\n", " for i in range(len(weights)):\n", " vec1 = weights[i]\n", " for j in range(i, len(weights)):\n", " vec2 = weights[j]\n", " relations[i, j] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " \n", "word_relations(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 根据文档向量计算文档的相似性\n", "\n", "同类的文档cosine值较小。感兴趣的人可以拿这个图跟前一篇文章进行对比(文本向量系列-如何基于词频矩阵和TF-IDF权重构建词向量), 可以看出LSA可以揭示文档的语义相关性。而tf-idf方法效果较差。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAECCAYAAADesWqHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADYxJREFUeJzt3W/MnXV9x/H3h7ut/YulwhQo4U8ymhgTgTQMJTEbRYPT4LL4ABZNNFu6LJsDZ2Z0T4hP9mTGuAeLiUGciVCnFRLjJoOoxJg4XFvKBItGsELlT3Gk/Omgf797cA71puns1Xn9Ttv7934lJz1379Pv93t67s+5rnPu61y/VBWS+nLGyR5A0uwZfKlDBl/qkMGXOmTwpQ4ZfKlDJzX4Sa5L8pMkP0vyiUY9bkuyO8lDjepfkOS7SXYkeTjJTQ16LE3ywyQPTnt8auwe0z5zSR5I8s0W9ac9dib5UZLtSbY0qL86yeYkj0wfk7eNXH/ddPZXLy8kuXnMHtM+H50+1g8l2ZRk6agNquqkXIA54FHgEmAJ8CDw5gZ93gFcATzU6H6cC1wxvb4K+OnY9wMIsHJ6fTFwP3BVg/vyN8AdwDcbPu47gbMb1v8S8GfT60uA1Q17zQFPAxeOXPd84OfAsunXXwU+NGaPk7nFvxL4WVU9VlX7ga8A7xu7SVV9D3hu7Lrz6j9VVdum118EdjB54MbsUVX10vTLxdPLqEdeJVkLvAe4dcy6s5TkTCZP9F8AqKr9VbWnYcsNwKNV9YsGtRcBy5IsApYDT45Z/GQG/3zgiXlf72LkwMxakouAy5lskceuPZdkO7AbuLeqxu7xWeDjwOGR6x6tgHuSbE2yceTalwDPAl+cvmS5NcmKkXvMdwOwaeyiVfVL4NPA48BTwPNVdc+YPU5m8HOMvzttjx9OshL4OnBzVb0wdv2qOlRVlwFrgSuTvGWs2kneC+yuqq1j1fwNrq6qK4B3A3+Z5B0j1l7E5GXd56rqcmAv0Oq9oyXA9cDXGtQ+i8ne78XAecCKJB8Ys8fJDP4u4IJ5X69l5N2ZWUmymEnob6+qO1v2mu663gdcN2LZq4Hrk+xk8pLrmiRfHrH+EVX15PTP3cBdTF7yjWUXsGve3tBmJk8ELbwb2FZVzzSofS3w86p6tqoOAHcCbx+zwckM/n8Cv5vk4umz5w3AN07iPP8vScLkNeWOqvpMox7nJFk9vb6MyQ/GI2PVr6pPVtXaqrqIyePwnaoadQsDkGRFklWvXgfeBYz225aqehp4Ism66V9tAH48Vv2j3EiD3fypx4Grkiyf/nxtYPLe0WgWjVnsRFTVwSR/Bfw7k3dHb6uqh8fuk2QT8PvA2Ul2AbdU1RdGbHE18EHgR9PX4AB/V1X/NmKPc4EvJZlj8mT91apq9iu3ht4I3DX5WWYRcEdV3T1yj48At083Jo8BHx65PkmWA+8E/nzs2gBVdX+SzcA24CDwAPD5MXtk+usCSR3xyD2pQwZf6pDBlzpk8KUOGXypQ6dE8BscurkgeyyE+2CPU6P+KRF8oPmDtEB6LIT7YI9ToP6pEnxJM9TkAJ4leV0tZfiHog6wj8W8bvQ5FlqPhXAf7NG2/ivsZX/tO9YH4F6jySG7S1nB72VDi9KSfoP769uDbueuvtQhgy91yOBLHTL4UocMvtQhgy91yOBLHRoU/FmseCNpdo4b/Ol53v6JyVlF3wzcmOTNrQeT1M6QLf5MVryRNDtDgr/gVryRejfkWP1BK95MPzu8EWApy3/LsSS1NGSLP2jFm6r6fFWtr6r1rT8VJem3MyT4C2LFG0m/dtxd/VmteCNpdgZ9Hn+6HNSYS0JJOok8ck/qkMGXOmTwpQ4ZfKlDBl/qkMGXOtTk9Nr7Ll7GY39/WYvSR1zyJ9ub1pcWMrf4UocMvtQhgy91yOBLHTL4UocMvtQhgy91yOBLHRpyeu3bkuxO8tAsBpLU3pAt/j8D1zWeQ9IMHTf4VfU94LkZzCJpRnyNL3VotOAn2ZhkS5Ith1/cO1ZZSQ2MFvz559U/Y9WKscpKasBdfalDQ36dtwn4AbAuya4kf9p+LEktDVlQ48ZZDCJpdtzVlzpk8KUOGXypQwZf6pDBlzpk8KUOGXypQ00W1Fi66wCXfuzpFqWP2P3htzWtD7Dmiz9o3kM6GdziSx0y+FKHDL7UIYMvdcjgSx0y+FKHDL7UIYMvdWjIGXguSPLdJDuSPJzkplkMJqmdIUfuHQQ+VlXbkqwCtia5t6p+3Hg2SY0MWVDjqaraNr3+IrADOL/1YJLaOaHX+EkuAi4H7m8xjKTZGPwhnSQrga8DN1fVC8f4/kZgI8DSuZWjDShpfIO2+EkWMwn97VV157FuM39BjSVnLBtzRkkjG/KufoAvADuq6jPtR5LU2pAt/tXAB4FrkmyfXv6w8VySGhqyoMb3gcxgFkkz4pF7UocMvtQhgy91yOBLHTL4UocMvtQhgy91qMmCGiSwZHGT0q86sLL9oQVZ1Oa/Z746eLB5D+lobvGlDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+qQwZc6NOTUW0uT/DDJg9MFNT41i8EktTPk0LR9wDVV9dL0pJvfT/KtqvqPxrNJamTIqbcKeGn65eLppVoOJamtoafXnkuyHdgN3FtVLqghncYGBb+qDlXVZcBa4Mokbzn6Nkk2JtmSZMv+Qy+PPaekEZ3Qu/pVtQe4D7juGN/79YIacy6oIZ3Khryrf06S1dPry4BrgUdaDyapnSHv6p8LfCnJHJMniq9W1TfbjiWppSHv6v8XkxVyJS0QHrkndcjgSx0y+FKHDL7UIYMvdcjgSx0y+FKHmqwY8aZLn+dvv/GvLUof8Q/XvKdpfYCd/7KueY+1q59v3oMNu9r30GnFLb7UIYMvdcjgSx0y+FKHDL7UIYMvdcjgSx0y+FKHBgd/eqbdB5J49h3pNHciW/ybgB2tBpE0O0PPq78WeA9wa9txJM3C0C3+Z4GPA4cbziJpRoacXvu9wO6q2nqc2x1ZUGPPc4dGG1DS+IZs8a8Grk+yE/gKcE2SLx99o/kLaqxeMzfymJLGdNzgV9Unq2ptVV0E3AB8p6o+0HwySc34e3ypQyd0Io6quo/J2nmSTmNu8aUOGXypQwZf6pDBlzpk8KUOGXypQ03Oq/9KLean+9/UovQRT7z/gqb1AV55+eXmPR7dubZ5j0OfO69p/Uv/4odN62t8bvGlDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+qQwZc6ZPClDg06cm96vr0XgUPAwapa33IoSW2dyCG7f1BVv2o2iaSZcVdf6tDQ4BdwT5KtSTa2HEhSe0N39a+uqieT/A5wb5JHqup7828wfULYCLDmvNeNPKakMQ3a4lfVk9M/dwN3AVce4zZHFtRYedbicaeUNKohS2itSLLq1evAu4CHWg8mqZ0hu/pvBO5K8urt76iqu5tOJamp4wa/qh4D3jqDWSTNiL/Okzpk8KUOGXypQwZf6pDBlzpk8KUONVlQY0kOcv7i51qUPmLfmmpaH+DMVe0X1Nizt8lD8BqLVh1o22ByjEdb1f7x7olbfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ4NCn6S1Uk2J3kkyY4kb2s9mKR2hh429o/A3VX1/iRLgOUNZ5LU2HGDn+RM4B3AhwCqaj+wv+1Ykloasqt/CfAs8MUkDyS5dXrSTUmnqSHBXwRcAXyuqi4H9gKfOPpGSTYm2ZJky/PPHRp5TEljGhL8XcCuqrp/+vVmJk8ErzH/vPqvXzM35oySRnbc4FfV08ATSdZN/2oD8OOmU0lqaui7+h8Bbp++o/8Y8OF2I0lqbVDwq2o7sL7xLJJmxCP3pA4ZfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qUJPVHIpwoNouFHFg7b6m9QFWLW3fY8/ilc17vPWCXU3r7122rGl9gMOvtH8sONzPZ0zc4ksdMvhShwy+1CGDL3XI4EsdMvhShwy+1CGDL3XouMFPsi7J9nmXF5LcPIvhJLVx3MPrquonwGUASeaAXwJ3NZ5LUkMnuqu/AXi0qn7RYhhJs3Giwb8B2NRiEEmzMzj40zPsXg987f/4/rwFNQ6ONZ+kBk5ki/9uYFtVPXOsb752QY22n8yT9Ns5keDfiLv50oIwKPhJlgPvBO5sO46kWRi6oMb/AG9oPIukGfHIPalDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOtTk2NrVZxzmj1a81KL0EbdsW9q0PsDqP365eY89a/Y27/Hwty9tWv/CQ1ub1gc4Y8ni5j3y+vaHqhx6ZnfzHkO4xZc6ZPClDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+rQ0DPwfDTJw0keSrIpSfujZyQ1M2QlnfOBvwbWV9VbgDkmp9mWdJoauqu/CFiWZBGwHHiy3UiSWjtu8Kvql8CngceBp4Dnq+qe1oNJamfIrv5ZwPuAi4HzgBVJPnCM2x1ZUOPZ/z40/qSSRjNkV/9a4OdV9WxVHWByiu23H32j+QtqnPOGubHnlDSiIcF/HLgqyfIkYbJw5o62Y0lqachr/PuBzcA24EfTf/P5xnNJamjoghq3ALc0nkXSjHjkntQhgy91yOBLHTL4UocMvtQhgy91yOBLHUpVjV80eRb4xQn8k7OBX40+yMLrsRDugz3a1r+wqs453o2aBP9EJdlSVevtcXLr2+PU6tGyvrv6UocMvtShUyX4s/jQz0LosRDugz1OgfqnxGt8SbN1qmzxJc2QwZc6ZPClDhl8qUMGX+rQ/wK9uhrEXCWiNwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def doc_relations(weights):\n", " n = weights.shape[1]\n", " relations = np.zeros((n, n))\n", " for i in range(n):\n", " vec1 = weights[:, i]\n", " for j in range(i, n):\n", " vec2 = weights[:, j]\n", " relations[j, i] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " \n", "doc_relations(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 总结\n", "\n", "根据上面的实验, 可以看出LSA即便在较小的语料库上也能显现出效果, 而tf-idf或者词频矩阵这些方法构建的词向量的效果要差很多。所以, LSA经常被用来计算词/文档的相似性。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }