{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "title: 文本向量系列-基于潜在语义分析的词向量\n", "date: 2018-08-18 18:17:55\n", "tags: [python, 文本挖掘]\n", "toc: true\n", "xiongzhang: true\n", "xiongzhang_images: [main.jpg]\n", "\n", "---\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 系列文章\n", "\n", "这个是系列博客, 所有文章链接都列在这里, 并持续更新中。\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 概念\n", "\n", "摘自(潜在语义分析理论及其应用)\n", "\n", "潜在语义分析(LSA)是一种用于知识获取和展示的计算理论和方法,它使用统计计算的方法对大量的文本集进行分析,从而提取和表示出词的语义, 这种潜在语义,是词语所有的上下文语境信息的总和。这是因为,上下文环境对其中的事物提供了一组相互联系和制约,在很大程度上决定了词语之间应用义上的相关性。\n", "\n", "潜在语义分析出发点就是文本中的词与词之间存在某种联系,即存在某种潜在的语义结构。这种潜在的语义结构隐含在文本中词语的上下文使用模式中。 因此采用统计计算的方法,对大量的文本中进行分析来寻找这种潜在的语义结构,它不需要确定的语义编码,仅依赖于上下文中事物的联系, 并用语义结构来表示词和文本, 达到消除词之间的相关性, 简化文本向量的目的" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 实现方法\n", "\n", "为实现 LSA , 需要通过数学方法建立潜在语义空间模型, 这是 LSA 一个关键性的问题, 直接影响运用 LSA 的性能。从提出 LSA 思想方法以来,研究者不断尝试与改进, 努力寻求最佳的提取潜在语义空间的数学方法, 使 LSA 思想得到有效的应用。模型选择时,需要综合考虑处理大数据量的计算复杂度、存储空间代价、计算时内存消耗、语义模型的表达能力、模型的最优化衡量标准、模型的结合能力、更新算法复杂度等多种因素。还可以根据特定的使用要求、所需处理数据的特点等,选择适合特定需求的最佳方法。\n", "\n", "一般常用的方法是: LSA/SVD, PLSA, SOM等, 今天我们主要使用LSA/SVD方法来示范潜在语义分析, 以后有时间会把PLSA和SOM方法也写出来, 不过那就是另一篇文章了。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### LSA/SVD实现\n", "\n", "LSA/SVD 是目前普遍使用的典型LSA空间的构造方法。通过对文本集的词-文档矩阵的奇异值分解(Singular ValueDecomposition ,SVD)计算, 并提取K 个最大的奇异值及其对应的奇异矢量构成新矩阵来近似表示原文本集的词条-文本矩阵。\n", "\n", "#### 语料库\n", "\n", "为了加快运算, 我们使用一个很小的语料库, 这个语料库就直接写在代码里了。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "documents = [\"Human machine interface for lab abc computer applications\",\n", " \"A survey of user opinion of computer system response time\",\n", " \"The EPS user interface management system\",\n", " \"System and human system engineering testing of EPS\",\n", " \"Relation of user perceived response time to error measurement\",\n", " \"The generation of random binary unordered trees\",\n", " \"The intersection graph of paths in trees\",\n", " \"Graph minors IV Widths of trees and well quasi ordering\",\n", " \"Graph minors A survey\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 计算词典和词频数矩阵\n", "\n", "词典就是词到词id的映射, 这样我们可以用id(一个整数)表示一个词了。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from collections import Counter\n", "from itertools import chain\n", "import numpy as np\n", "\n", "def word_matrix(documents):\n", " '''计算词频矩阵'''\n", " # 所有字母转换位小写\n", " docs = [d.lower() for d in documents]\n", " # 分词\n", " docs = [d.split() for d in docs]\n", " # 获取所有词\n", " words = list(set(chain(*docs)))\n", " # 词到ID的映射, 使得每个词有一个ID\n", " dictionary = dict(zip(words, range(len(words))))\n", " # 创建一个空的矩阵, 行数等于词数, 列数等于文档数\n", " matrix = np.zeros((len(words), len(docs)))\n", " # 逐个文档统计词频\n", " for col, d in enumerate(docs):\n", " # 统计词频\n", " count = Counter(d)\n", " for word in count:\n", " # 用word的id表示word在矩阵中的行数\n", " id = dictionary[word]\n", " # 把词频赋值给矩阵\n", " matrix[id, col] = count[word]\n", " return matrix, dictionary\n", "\n", "matrix, dictionary = word_matrix(documents)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 0., 0., 0., 0., 0., 1., 0.],\n", " [0., 0., 0., 0., 0., 0., 0., 1., 1.],\n", " [1., 0., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 1., 1., 1., 0.],\n", " [0., 2., 0., 1., 1., 1., 1., 1., 0.],\n", " [0., 0., 0., 0., 0., 0., 1., 1., 1.],\n", " [0., 0., 1., 1., 0., 0., 0., 0., 0.],\n", " [0., 1., 0., 0., 0., 0., 0., 0., 0.],\n", " [0., 0., 0., 0., 0., 1., 0., 0., 0.],\n", " [0., 0., 0., 0., 1., 0., 0., 0., 0.]])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "matrix[:10, :10]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'iv': 0, 'minors': 1, 'lab': 2, 'trees': 3, 'of': 4, 'graph': 5, 'eps': 6, 'opinion': 7, 'binary': 8, 'measurement': 9, 'survey': 10, 'paths': 11, 'widths': 12, 'testing': 13, 'ordering': 14, 'engineering': 15, 'machine': 16, 'for': 17, 'computer': 18, 'to': 19, 'applications': 20, 'response': 21, 'system': 22, 'perceived': 23, 'intersection': 24, 'a': 25, 'time': 26, 'relation': 27, 'human': 28, 'random': 29, 'well': 30, 'unordered': 31, 'the': 32, 'generation': 33, 'quasi': 34, 'user': 35, 'in': 36, 'management': 37, 'error': 38, 'abc': 39, 'interface': 40, 'and': 41}\n" ] } ], "source": [ "print(dictionary)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### SVD分解\n", "\n", "关于svd分解可以参阅网上相关资料, 宗旨使用svd分解的目的就是降低噪音并降低词向量/文档向量所在空间的维度, 因为我们的语料库比较小, 所以我们就让维度降到3好了。但是通常在大型的语料库进行LSA分析的时候, 通常降维到100-300之间。\n", "\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from scipy import linalg\n", "# 使用scipy模块进行svd分解, 得到三个矩阵\n", "U, sigma, VT = linalg.svd(matrix)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.shape: (42, 42)\n", "sigma.shape: (9,)\n", "VT.shape: (9, 9)\n" ] } ], "source": [ "print('U.shape:', U.shape)\n", "print('sigma.shape:', sigma.shape)\n", "print('VT.shape:', VT.shape)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "估计得到的潜在语义空间\n", "X_new.shape: (42, 9)\n" ] } ], "source": [ "# 降维后的维数\n", "n = 3\n", "\n", "# 取前n个向量\n", "U2=U[:, :n]\n", "VT2=VT[:n]\n", "\n", "# 把向量转换位对角矩阵\n", "sigma_matrix=linalg.diagsvd(sigma, U2.shape[0], VT2.shape[1])\n", "# 截取相应的部分\n", "sigma_matrix=sigma_matrix[:U2.shape[1],:VT2.shape[0]]\n", "\n", "# SVD的逆运算, 得到新的语义空间, 它和词频矩阵有相同的形状\n", "X_new=np.dot(U2, sigma_matrix)\n", "X_new=np.dot(X_new, VT2)\n", "print('估计得到的潜在语义空间')\n", "print('X_new.shape:', X_new.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 使用词向量计算词之间的相关性\n", "\n", "我们可以看到语义接近的词的cosine值偏小。感兴趣的人可以拿这个图跟前一篇文章进行对比(文本向量系列-如何基于词频矩阵和TF-IDF权重构建词向量), 可以看出LSA可以揭示词的语义相关性。而tf-idf方法效果较差。" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQQAAAECCAYAAAAYUakXAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAAHtxJREFUeJzt3Xt8VeWZL/DfkwuEOwQCDQQloCiUYkTxUtQ6oPWGWk/xDLb1cuoZW87oKac6Vj2fTul02qH2qHX8dJhiZWDGCzpSDz0MahlEQLDc76TcJEhIuIVLEiD35/yxV5YJz7PNInvvQPD3/XzySfaT9a53rb13nqz97nc/r6gqiIgAIO1sHwARnTuYEIgoxIRARCEmBCIKMSEQUYgJgYhCbZ4QRORWEdkmIjtF5KkU91UkIptEZL2IrE7yvmeIyEER2dwkli0iC0RkR/C9V4r6mSIi+4LzWi8ityehn4EiskhECkVki4j8IIin4pzi9ZWK88oSkZUisiHo66dBPF9EVgTn9aaIdEhRPzNFZHeTcypI9JyC/aaLyDoRmZfU81HVNvsCkA5gF4DBADoA2ABgeAr7KwLQJ0X7vgHAKACbm8SeBfBU8PNTAH6Zon6mAHgiyeeTC2BU8HM3ANsBDE/ROcXrKxXnJQC6Bj9nAlgB4BoAbwGYGMT/GcCkFPUzE8CEFDz/fgjgdQDzgttJOZ+2vkK4CsBOVf1EVWsAzAZwdxsfQ1Ko6hIAR04L3w1gVvDzLADfSFE/Saeqpaq6Nvi5AkAhgAFIzTnF6yvpNKYyuJkZfCmAsQDeDuIJn9fn9JN0IpIH4A4AvwtuC5J0Pm2dEAYA2NvkdjFS9EQIKIA/isgaEXkkhf006qeqpUDsSQ+gbwr7elRENgYvKRK+jG9KRAYBuByx/3IpPafT+gJScF7B5fV6AAcBLEDsKvWYqtYFmyTleXh6P6raeE4/D87pBRHpmGg/AH4N4EkADcHt3kjS+bR1QhAnlsq502NUdRSA2wD8tYjckMK+2tI0AEMAFAAoBfBcsnYsIl0BzAEwWVXLk7XfiH2l5LxUtV5VCwDkIXaVOszbLNn9iMgIAE8DuBTAaADZAH6USB8iMh7AQVVd0zTsHU5r9t/WCaEYwMAmt/MAlKSqM1UtCb4fBPAOYk+GVDogIrkAEHw/mIpOVPVA8ORrAPAyknReIpKJ2B/oa6r6+yCcknPy+krVeTVS1WMAPkTstX1PEckIfpXU52GTfm4NXh6pqlYD+Bckfk5jANwlIkWIveQei9gVQ1LOp60TwioAFwcjoh0ATATwh1R0JCJdRKRb488Avg5g8+e3StgfADwY/PwggLmp6KTxDzRwD5JwXsHr0FcAFKrq801+lfRzitdXis4rR0R6Bj93AnATYmMWiwBMCDZL+Lzi9PPnJslUEHtdn9A5qerTqpqnqoMQ+/v5QFW/jWSdT7JHPyOMjt6O2KjyLgD/O4X9DEbsXYwNALYkuy8AbyB2WVuL2JXPw4i9llsIYEfwPTtF/fwbgE0ANiL2B5ubhH6uQ+wycyOA9cHX7Sk6p3h9peK8RgJYF+xzM4C/bfL8WAlgJ4B/B9AxRf18EJzTZgCvIngnIknPwRvx2bsMSTkfCXZGRMSZikT0GSYEIgoxIRBRiAmBiEJMCEQUOmsJoY2mErdZP23ZF8+JfaWqn7N5hdBWD0qbPfht2BfPiX2lpJ+EEoK0YW0DIkq9Vk9MEpF0xGYc3ozYDLpVAO5T1a3x2nSQjpqFLgCAWlQjE/E/+PWlESdNbP/mziZWl9PFbZ9x6ITtZ2im2S4zrcHEAKBum41nD682sZKTPcOf68tPIL177Hg67rbH76m/yL8P0nfavhq3rTt+Ehk9YvfFgKyjkfoBgFNq+8pAvbttpsQ+OHf8SD16ZKcDACobssx2lXX+8ffOPGFiZVtszY7q/Nh5NL3vAKB3p0qzbUepM7Hy+k5u/97j1/hcb/qciM0odmTa5wpqa+Pu83Tdhsf6P3G0Bl16xc67Yqv9/9vZ+5gVgE5pNfaQnPOv1thxVh6tRddenx3z0U97NNuuquooampOxDnZzySSEK4FMEVVbwluPw0AqvoP8dp0l2y9WsZF2v+TuzaZ2LMXjTSxg5Ouddv3nfaxPeaF/U2sT5Z94gJA2dgqE7tv/U4Tm7LyLrf9xQ9tsMEG+8d3ZN5Qt332+O2Rtp06bI7b3rPh1IUm1ifD/0DjgEybaJZUXmpiyw8PdtvfP8De/68Nt/3vmHmZ3/6yFSY2pOMBE/vjkRFue+/xa6iysbQsm+QAIC23n21favv39gkAX9t4ysQWF3Q1sZGr/X9IX+m818S8x2RXjT1OAHhr0q3Nbq9e/RuUlxe3mBASecnQ1rUNiCjFMlreJK5In8EORj8fAYAs2Et+Ijp3JJIQItU2UNXpAKYDsZcMUXc+0LuUdV7eZJyKs0tn254d7WXcvhM9TAwAOtbZ17CbTg60G1Y4rzUB9+WBp3uWf8kZdVv3forjo3p7eVxa45//Nb33mNi2Snt5mpnmn6d7X3n3SZz7z+vrZL0dg8jNOu62P1yX7sZPp3X2dTkAwBlbiLutY09Vtg022OdfH2esBfDvP+8xqVL//qvPan7+2uKLhZhEXjK0WW0DImobrb5CUNU6EXkUwPuIVVOeoapbknZkRNTmEnnJAFWdD2B+ko6FiM4yfpaBiEJMCEQUSuglQyo9duEYE/vpJ2tM7Ce789z2L01ZZmKTC8abWP1sPyf+ePtKExuTZbetvtq/CycVLTexkvpuJvbc1b3d9o87E7O8bR87bO+neMoetpO4Vv9smrvtLf3tfu/Zus3E5k683m3/q/fsK8nCIjt7c9phf5R8eGdbNPj7PfeZ2K35V7vtf7zdTozyRuSzxM4+BIDvrhpuYjNG2+dEvFH+Jzbfa2KvF80wsccLbnPbz9+62MS8xwRp/rspZZObH1fd2mhvM/AKgYhCTAhEFGJCIKIQEwIRhdp0XYYz+bRjVOXfusbv6/U/2eBCOwB5tMr/+GyvOz8xsUtW2vz5H0uucNtf9EOnf2c67JH/d7HbPvvOHZG2fWn4G257z7+W2UGpDPGnHn8r2x7/L/beYWKHTvofPx+d86mJbbvSDuDtfN5//L58RZGJDepSZmIVdf6nFUu+6nz8XJ1PFor/PzGj/5dMrK5kf7R9Auix1E5dPn6dPf5ha/xB6QZnrrH3mGyp9j9P+Mb/uL3Z7dWrUv9pRyI6zzAhEFGICYGIQkwIRBRq94OK8Rz/jh2sOjbUjqkMfuOw2766f3cTqxhoP4/fZ54tdQYA5TdeZGIn+tpZZaf8CljoZKt1udteeJ0dvIun9h/sDirz7DkBQK/7bQmv+p/1te0H+O0bMu193aHSDsB1/9CWpQOAmhG23Fr5IFu/8dDV/qDo4LedcnWX2PbZ22ztSgDIPG7jtT1se2+fANBvVYWJVV5oCwQdKvD/J/dw7hbvMdmxzz4mANDv3eaPy+b3fo0TZXs5qEhE0TEhEFGICYGIQkwIRBRK6OPPIlIEoAJAPYA6Vb0yGQdFRGdHQu8yBAnhSlX1h+pP05bvMni6Le1jYsUVPZ0tgR7jd5vYkBX2s+/vf1Tgtr/of0Wbulz9vh1NB4COt9gKu9620y95zW3vebnsOhM7Ue+Pkk/uu9DEnij6pomdqvPrAVzc/ZCJ7Rptq0bvfCHO1OVRRSaW1/mYiXkrHAHA5tHOgPqZTF2+wE4JrvvU1mOIN3W590f2eVU2xi608tUN/vEfqLHvcnmPydY4C7W8+OjEZrfXLn8JFcc5dZmIzkCiCUEB/FFE1rTlEttElBqJllAbo6olItIXwAIR+bOqLmm6AVduImo/ErpCUNWS4PtBAO8AuMrZZrqqXqmqV37eas9EdPa1+gpBRLoASFPViuDnrwP4u6QdWQpUXG/HPg+9YKcYA0DGfDtYNK7Hf5rYu9381Yc7Lrafp6+pt1OXO44tctvLB3ZQy9t2ci9bODaemoJ8E5v027fdbb2CtJWzbe2ILv/loNt+3LqtJla8+Ksm1rDDXx7tmFOn4sFcW7h2+iVD3PYn3h1kYl072OnIlTX+P6nyBfbx636z/XPx9gkAK5dfYGLDF9uB6uUFzhx1AI9ss/ef95g0VPpLwWGsH25JIi8Z+gF4R2Ij5xkAXlfV9xLYHxGdZYks5fYJgMuSeCxEdJbxbUciCjEhEFHovK2HkKjt00eb2LBn7aBkw167whAA7H3dDlZenmtnui1bP9RtP6bA1lnwth04xM4IjKfbJPtYa5o/ea3yNzbW5e5SEyv7y8vd9n2W2YKke35p33Ye+C2/HoI4szolL9fE/vy3vdz2WmUHcMeMsIVrl232i9xmHLOvput62gFQb58AcOirdlZl8Zwv230W2hmJADBkpr3/vMdkb4kt5goAmVnNj7X4mWmo2rWPMxWJKDomBCIKMSEQUYgJgYhCTAhEFOK7DGeg8r3BJlb+gZ3iCgD9f/WxiaV1tNNk982204kBYMBEO83V23bayOj1EJ7cPiHyts8OtVOa/9vKh0wsLc1//vTubqfUdr3NnlPJ31zrtq8eVWliOT1t7MJutsYA4NceSMuyy741VNkaDQCQ3s9WM64/YKdpe/sEgD2v2neZBt67xcRylvVw2+8ut+8eeI/J2lP+82fuo83/zriUGxGdMSYEIgoxIRBRiAmBiEKJVkz6Qul66ycm1mOYnSILAJ2X9Daxu3LWm9igDk4xVgBF621BWG/bMVnRc/qpubYgZ840O/gJAGNK7H6HPGSnU5fNGei29+4rr8jtwEf8qdf6j3bZMm8A8Gd7PnLbP/DeAyb2vUFLTOy3RTe47fO7HzGx3eV2UNnbJwDMHm+nWXvPiUu7FrvtvanP3mOSDns/A8AtM5vXU5hwR6Q6yLxCIKLPMCEQUYgJgYhCTAhEFGpxUFFEZgAYD+Cgqo4IYtkA3gQwCEARgP+qqv6UsfNcfaH/efiNS+0MvLV5tvAmGuJMHvNmADrbjh+x8XOPr6mcdXb2YPowvx7A/yyxKzJJlp1pWftujts+fZiNbXQGFfML/UHN9J52Bl96vh3AnLD+Urd93TI702/KiLtMrMtmf6bhpmo7A7XBqcfq7RMALj1o6zxsXGrrIazpZQcqAWDYMPvn5D0mH+/3ZyqW7WleJ6L0+IvudqeLcoUwE8Ctp8WeArBQVS8GsDC4TUTtXIsJIVh45fT3YO4GMCv4eRaAbyT5uIjoLGjtGEI/VS0FgOC7/SQIEbU7KZ+YxKXciNqP1l4hHBCRXAAIvvvL94BLuRG1J629QvgDgAcBTA2+z03aEZ0n8p9xRs/T7DTnE/MvdNt3uX1PpG0n9Vkc+Zie/bUdUa+o9ZO0t9//Pvs7JtYZtjowAHT+y3ITy7/eviOz+xd+PYScUXaJs9wuduT9+k7H3fbbfrXNBsX5/6d2yT4AyMizS+nVFduq2e4+AdQu6G9i+ePsc+KKdX7/20fYV+HeY3JddzudHACmvXRvs9tlx6LVPWnxCkFE3gDwMYBLRKRYRB5GLBHcLCI7ANwc3Caidq7FKwRVvS/Or9pv6SMicnGmIhGFmBCIKMR6CG2pod6Equv8h6BLxG1L6rtF7v4rXe2gWJ8MO/gXb7835dqBuuWH/am3Xu2H19LsoGhdXrXb3utrSEc70PjHIyPc9mkdbe0Er55CvCKpyLADwF6R3HhFWv8ixw72LU7ramK16k9d9+4/7zE5Vt/FbZ9e3XywUiLWUuYVAhGFmBCIKMSEQEQhJgQiCnFQ8SzLHu/PNDsyb2ikbZ/rc2Pkvk6OtgOAf/ebl91tp15lp5mkvd3BxDLv8WcKDlpvi3qemO8UNN3vD6qtOWprR4y7wK58dGiM33/pO7ZOQscMb6DWL5KbvqCXidXfbD+L4+0TAH632LbvP98WlN101adu+zu32/vPe0y00ta4AAAZfdooYsQV2niFQEQhJgQiCjEhEFGICYGIQkwIRBTiuwznKO8dBe+dh6nD5kTe54ZTu02sqMZWQgaAx1d8aGJLKu3I/fJ3/KnL3n69Gg+Y2dNtf0UvO/ru7TNnma3ODABpY+25nsnU5bTck7b9K3bqdLypy/dsPGViiwvs1OWvrPbrIXjn6j0mu2rs8nwA8Nak4c0DEqe692l4hUBEISYEIgoxIRBRiAmBiEKtXcptCoC/AtA4F/MZVZ2fqoOkGE5d5tTl052NqcszYZdyA4AXVLUg+GIyIDoPtHYpNyI6DyUyhvCoiGwUkRkiYq+PAiLyiIisFpHVtfDLZRHRuaG1CWEagCEACgCUAngu3oZcuYmo/WjVTEVVDadsicjLAOYl7YjojNQfLou8bWV/O9Pxhjg1Rn/u7PeuvnZQbO6g69323n5fvvRVE5vW52tu++GdSyLt8xcd7EAnAEwb+ZqJVWmmiWVJrdv+u9UPmdiMiPsEgCfq7zUx7/wf73qb2947V+8x8VYDA4ADozs1u127Ldr//lZdITSu6xi4B8Dm1uyHiM4tUd52fAPAjQD6iEgxgJ8AuFFECgAogCIA30vhMRJRG2ntUm6vpOBYiOgs40xFIgoxIRBRiAmBiEJMCEQUYkIgohATAhGFmBCIKCQa8XPSydBdsvVqsZ/pprPn0KRr3XhajY312G2DWdv3u+1L77T1DC570E5oLZmc77av62ynBB+9xE5T7rLfL1La8D07zfqbeetNbE5xgdv+0DpbvDTncltk1dsnACy8bbiJ9f13W7uh9LFBbvuDV9qCrN5j0nV/ndu+9JrmU4z2/tMLqNq3t8VKq7xCIKIQEwIRhZgQiCjEhEBEIa7c9AWXM+1jN/5+iR0su23wNSZ2aM5Af7932P1WfMuuRpR+3K5wBABp621B2ZwP7CpJ/7TnI7f9A4UPmFifjHJ3W89VXys0sd3l2ZH3qR3toGhFrS0QdN0rq9z2S0faggjeY/KnKr/Ia3Z68/tqwpu2aKuHVwhEFGJCIKIQEwIRhZgQiCjUYkIQkYEiskhECkVki4j8IIhni8gCEdkRfI9bip2I2ocWpy4HBVVzVXWtiHQDsAbANwA8BOCIqk4VkacA9FLVH33evjh1uX27Z6tTdXmiX3X53fdmm1hhzUkTm3Y4etXl7/fcZ2K35l/ttv9xoX2X44yqLq96yMRmjJ4ZaZ8A8MRmW3X59ctmmNjjBX7V5flbF5vYLf2dadZxqi6XTm5+v+x69Xmc2p+EqcuqWqqqa4OfKwAUAhgA4G4As4LNZiGWJIioHTujMQQRGQTgcgArAPRT1VIgljQA9E32wRFR24qcEESkK4A5ACarauQZHlzKjaj9iJQQRCQTsWTwmqr+PggfaFywJfh+0GvLpdyI2o8og4qC2BjBEVWd3CT+KwBlTQYVs1X1yc/bFwcVzz+7p/r1FIZMWWdi5f+3v4n1mOgvRVd/zNYOSMuy03kvWOKPky365GITu2/YGhN7o/AKt31U3j4BYOW1PUys/PdfMrFjJzqZGABc8O1dJnb32r0mtuSoXZ4PAHp2aD4lfM7983Foa1mLg4pRPsswBsD9ADaJSONk6mcATAXwlog8DOBTAHZYlYjalSgrN30EIF5m4b97ovMIZyoSUYgJgYhCrIdACcl/yq+nUPYfdlAvE/az+6Xf+bLb/kSeHezu+RX7mX650w4+AkDaS7b98sOD7XZp/qB650W2yOnJv6iMtE8AKP9mnollwBZp9QYPAaDMqTMxd+IgE0s7dMxtv+/K5vd/Vckidzuzv0hbEdEXAhMCEYWYEIgoxIRARCEmBCIK8V0GSoled+wwsW5LbdXlkwv9z/PrbjtNt6HKVl1+6QyqLt8/wL4j8tvaG9z2+ffb/r2qy94+AWD2x7ZOROdHKkysYKX/LsHSkfZc302k6vIdrLpMRGeICYGIQkwIRBRiQiCiUIv1EJKJ9RDodLt/4ddTyBllp/nmdrGFuvp38qcubxtdZ4Pi/P/TBrd9Rt4AE6srtkVe3X0CaFhgaz+kjbMDlVes8/vfXmkrEv5s4B9MbFNNrtt+2mPNqxGsXf4SKo4XJ15klYi+OJgQiCjEhEBEISYEIgpFKbI6EMC/AvgSgAYA01X1RRGZAuCvADQu5/OMqs7/vH1xUJGiSu9pi5Qi1w60lU71ZzrWLbOzCk+MsLP/umy2hVsBIM1ZMaDBKRru7RMALn10p4ntfMrWfqjt5c80HPaPR03son8rMrGP9+e77cv2NF9ZsXTqi6je0/LKTVGmLtcBeLzpUm4isiD43Quq+n8i7IOI2oEoRVZLATSu0FQhIo1LuRHReSaRpdwA4FER2SgiM7j6M1H7l8hSbtMADAFQgNgVxHNx2nEpN6J2otVLuanqAVWtV9UGAC8DuMpry6XciNqPFscQgqXcXgFQqKrPN4nnNq7+DOAeAJtTc4j0ReQu5VZlrzCvyo2zlNuoDiZ2v7eUW4fElnLz9gkAK+vsuyTedOx4S7l59SCGdy4xscO9bHVoABjd79Nmt+d0Oelud7pElnK7T0QKACiAIgDfi9QjEZ2zElnK7XPnHBBR+8OZikQUYkIgohCLrFK74RVZfbKfX2R18xFbJ2BIRzuol9PTLs8GAPndj5iYV2TV2ycArOo/1MS8eg639d/qtl9aZadUf7+nrcdQkPWpiQG2yOpHGbbAq4dXCEQUYkIgohATAhGFmBCIKMQiq3Re6r3MftYuTexzvUH9mY5r/3OYiY26qTDSPgFgWeFFJnbnZRtMbPs1fvvf7PrQxP56qP3b0Zoat/2pu0c3u71+0YuoPMoiq0R0BpgQiCjEhEBEISYEIgoxIRBRiFOX6bxUNsZWLU7LstOBvenQADC4n62aXPaTaPsEgE6v2v16y8vlLHOqSwN4oPABE3u28G0TW3vKr7o899HmfaX5xZ0NXiEQUYgJgYhCTAhEFGoxIYhIloisFJENIrJFRH4axPNFZIWI7BCRN0XEFrEjonYlyqBiNYCxqloZVF/+SETeBfBDxFZumi0i/wzgYcRKsxOdk7wBxHiDgtLZFj89k0HJ7wxdZWKLxRZEzc2yxWQB4OvZtmZxlWaaWMe0Wrc9WvmJhBavEDSmsYpEZvClAMYCaBz2nAXgG607BCI6V0RdlyE9qLh8EMACALsAHFPVxvc2isHl3YjavUgJIViQpQBAHmILstiPgsW5SOHKTUTtxxm9y6CqxwB8COAaAD1FpHEMIg+AXUUCXLmJqD2JsnJTDoBaVT0mIp0A3ATglwAWAZgAYDaABwHMTeWBEqVCvEFBOXkq0rbxBiVf3T7CxAbqFhMrrfJnKi4/aGcgPjvUzlSsbrADjQD8lVQiiPIuQy6AWSKSjtgVxVuqOk9EtgKYLSJ/D2AdYsu9EVE7FmXlpo2ILQF/evwTxFnglYjaJ85UJKIQEwIRhZgQiCjEeghEjvoDB00svV9fE/OmOAPAqcOdTSwjz87dW7VwoNt+9Dhb4fm7qx4ysZoy/12OYet3NLstJ6PNAeIVAhGFmBCIKMSEQEQhJgQiCnFQkSgib6AxbpHVPt1MrK54n4mNHldpYgCwuzzbxGaMnmlicYusFjRf9k1XRfscEa8QiCjEhEBEISYEIgoxIRBRiIOKRAn4whVZJaIvDiYEIgoxIRBRiAmBiEKJLOU2U0R2i8j64Ksg9YdLRKmUyFJuAPA3qmpLwRJ9wS0eaeskNCzsb2Lz5vr1ELa8NsjESp+z04+PlXZ323efXNHsdu0PG9ztThelyKoC8JZyI6LzTKuWclPVFcGvfi4iG0XkBRHhKixE7VyrlnITkREAngZwKYDRALIB/Mhry6XciNqP1i7ldquqlgYrQ1cD+BfEWaOBS7kRtR+tXspNRHJVtVREBLGl4O1cSyIKpY3ba2L3rNvjbrv9JlvQ9fWBr5nYpmG5bvtpj93b7HbGoWj/+xNZyu2DIFkIgPUAvh+pRyI6ZyWylNvYlBwREZ01nKlIRCEmBCIKsR4C0Vm05nL/f/L1G4tN7PGC20ysofKE277q3uZ/2g0ZEul4eIVARCEmBCIKMSEQUYgJgYhCTAhEFOK7DETnoKUj7RJx75csNrE/VdW77bPTm287YdPhSP3yCoGIQkwIRBRiQiCiEBMCEYU4qEjUTtzS3ylsnpbubls6+epmt3cdfD5SH7xCIKIQEwIRhZgQiCgUOSEEpdjXici84Ha+iKwQkR0i8qaIdEjdYRJRWziTK4QfAChscvuXAF5Q1YsBHAXwcDIPjIgiaKh3vzqUa7Mv8Sc0GlEXaskDcAeA3wW3BcBYAI3LuM1CrPIyEbVjUa8Qfg3gSQCNC8T1BnBMVeuC28UABiT52IiojUVZ/Xk8gIOquqZp2NnUXe+RKzcRtR9RJiaNAXCXiNwOIAtAd8SuGHqKSEZwlZAHoMRrrKrTAUwHgO6SzUViic5hLV4hqOrTqpqnqoMATATwgap+G8AiABOCzR4EMDdlR0lEbSKRqcs/AjBbRP4ewDoAryTnkIgoUb1/93Gz2xnqV2c+3RklBFX9ELHFXqGqnyDOAq9E1D5xpiIRhZgQiCjEhEBEIVFtu3cCReQQgD3BzT4AolV+TExb9dOWffGc2NeZ9nOhqua0tJM2TQjNOhZZrapXni/9tGVfPCf2lap++JKBiEJMCEQUOpsJYfp51k9b9sVzYl8p6eesjSEQ0bmHLxmIKMSEQEQhJgQiCjEhEFGICYGIQv8fykLEwV4h+qcAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "from scipy.spatial.distance import cosine\n", "\n", "\n", "def word_relations(weights, ):\n", " relations = np.zeros((len(weights), len(weights)))\n", " for i in range(len(weights)):\n", " vec1 = weights[i]\n", " for j in range(i, len(weights)):\n", " vec2 = weights[j]\n", " relations[i, j] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " \n", "word_relations(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 根据文档向量计算文档的相似性\n", "\n", "同类的文档cosine值较小。感兴趣的人可以拿这个图跟前一篇文章进行对比(文本向量系列-如何基于词频矩阵和TF-IDF权重构建词向量), 可以看出LSA可以揭示文档的语义相关性。而tf-idf方法效果较差。" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAECCAYAAADesWqHAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvhp/UCwAADYxJREFUeJzt3W/MnXV9x/H3h7ut/YulwhQo4U8ymhgTgTQMJTEbRYPT4LL4ABZNNFu6LJsDZ2Z0T4hP9mTGuAeLiUGciVCnFRLjJoOoxJg4XFvKBItGsELlT3Gk/Omgf797cA71puns1Xn9Ttv7934lJz1379Pv93t67s+5rnPu61y/VBWS+nLGyR5A0uwZfKlDBl/qkMGXOmTwpQ4ZfKlDJzX4Sa5L8pMkP0vyiUY9bkuyO8lDjepfkOS7SXYkeTjJTQ16LE3ywyQPTnt8auwe0z5zSR5I8s0W9ac9dib5UZLtSbY0qL86yeYkj0wfk7eNXH/ddPZXLy8kuXnMHtM+H50+1g8l2ZRk6agNquqkXIA54FHgEmAJ8CDw5gZ93gFcATzU6H6cC1wxvb4K+OnY9wMIsHJ6fTFwP3BVg/vyN8AdwDcbPu47gbMb1v8S8GfT60uA1Q17zQFPAxeOXPd84OfAsunXXwU+NGaPk7nFvxL4WVU9VlX7ga8A7xu7SVV9D3hu7Lrz6j9VVdum118EdjB54MbsUVX10vTLxdPLqEdeJVkLvAe4dcy6s5TkTCZP9F8AqKr9VbWnYcsNwKNV9YsGtRcBy5IsApYDT45Z/GQG/3zgiXlf72LkwMxakouAy5lskceuPZdkO7AbuLeqxu7xWeDjwOGR6x6tgHuSbE2yceTalwDPAl+cvmS5NcmKkXvMdwOwaeyiVfVL4NPA48BTwPNVdc+YPU5m8HOMvzttjx9OshL4OnBzVb0wdv2qOlRVlwFrgSuTvGWs2kneC+yuqq1j1fwNrq6qK4B3A3+Z5B0j1l7E5GXd56rqcmAv0Oq9oyXA9cDXGtQ+i8ne78XAecCKJB8Ys8fJDP4u4IJ5X69l5N2ZWUmymEnob6+qO1v2mu663gdcN2LZq4Hrk+xk8pLrmiRfHrH+EVX15PTP3cBdTF7yjWUXsGve3tBmJk8ELbwb2FZVzzSofS3w86p6tqoOAHcCbx+zwckM/n8Cv5vk4umz5w3AN07iPP8vScLkNeWOqvpMox7nJFk9vb6MyQ/GI2PVr6pPVtXaqrqIyePwnaoadQsDkGRFklWvXgfeBYz225aqehp4Ism66V9tAH48Vv2j3EiD3fypx4Grkiyf/nxtYPLe0WgWjVnsRFTVwSR/Bfw7k3dHb6uqh8fuk2QT8PvA2Ul2AbdU1RdGbHE18EHgR9PX4AB/V1X/NmKPc4EvJZlj8mT91apq9iu3ht4I3DX5WWYRcEdV3T1yj48At083Jo8BHx65PkmWA+8E/nzs2gBVdX+SzcA24CDwAPD5MXtk+usCSR3xyD2pQwZf6pDBlzpk8KUOGXypQ6dE8BscurkgeyyE+2CPU6P+KRF8oPmDtEB6LIT7YI9ToP6pEnxJM9TkAJ4leV0tZfiHog6wj8W8bvQ5FlqPhXAf7NG2/ivsZX/tO9YH4F6jySG7S1nB72VDi9KSfoP769uDbueuvtQhgy91yOBLHTL4UocMvtQhgy91yOBLHRoU/FmseCNpdo4b/Ol53v6JyVlF3wzcmOTNrQeT1M6QLf5MVryRNDtDgr/gVryRejfkWP1BK95MPzu8EWApy3/LsSS1NGSLP2jFm6r6fFWtr6r1rT8VJem3MyT4C2LFG0m/dtxd/VmteCNpdgZ9Hn+6HNSYS0JJOok8ck/qkMGXOmTwpQ4ZfKlDBl/qkMGXOtTk9Nr7Ll7GY39/WYvSR1zyJ9ub1pcWMrf4UocMvtQhgy91yOBLHTL4UocMvtQhgy91yOBLHRpyeu3bkuxO8tAsBpLU3pAt/j8D1zWeQ9IMHTf4VfU94LkZzCJpRnyNL3VotOAn2ZhkS5Ith1/cO1ZZSQ2MFvz559U/Y9WKscpKasBdfalDQ36dtwn4AbAuya4kf9p+LEktDVlQ48ZZDCJpdtzVlzpk8KUOGXypQwZf6pDBlzpk8KUOGXypQ00W1Fi66wCXfuzpFqWP2P3htzWtD7Dmiz9o3kM6GdziSx0y+FKHDL7UIYMvdcjgSx0y+FKHDL7UIYMvdWjIGXguSPLdJDuSPJzkplkMJqmdIUfuHQQ+VlXbkqwCtia5t6p+3Hg2SY0MWVDjqaraNr3+IrADOL/1YJLaOaHX+EkuAi4H7m8xjKTZGPwhnSQrga8DN1fVC8f4/kZgI8DSuZWjDShpfIO2+EkWMwn97VV157FuM39BjSVnLBtzRkkjG/KufoAvADuq6jPtR5LU2pAt/tXAB4FrkmyfXv6w8VySGhqyoMb3gcxgFkkz4pF7UocMvtQhgy91yOBLHTL4UocMvtQhgy91qMmCGiSwZHGT0q86sLL9oQVZ1Oa/Z746eLB5D+lobvGlDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+qQwZc6NOTUW0uT/DDJg9MFNT41i8EktTPk0LR9wDVV9dL0pJvfT/KtqvqPxrNJamTIqbcKeGn65eLppVoOJamtoafXnkuyHdgN3FtVLqghncYGBb+qDlXVZcBa4Mokbzn6Nkk2JtmSZMv+Qy+PPaekEZ3Qu/pVtQe4D7juGN/79YIacy6oIZ3Khryrf06S1dPry4BrgUdaDyapnSHv6p8LfCnJHJMniq9W1TfbjiWppSHv6v8XkxVyJS0QHrkndcjgSx0y+FKHDL7UIYMvdcjgSx0y+FKHmqwY8aZLn+dvv/GvLUof8Q/XvKdpfYCd/7KueY+1q59v3oMNu9r30GnFLb7UIYMvdcjgSx0y+FKHDL7UIYMvdcjgSx0y+FKHBgd/eqbdB5J49h3pNHciW/ybgB2tBpE0O0PPq78WeA9wa9txJM3C0C3+Z4GPA4cbziJpRoacXvu9wO6q2nqc2x1ZUGPPc4dGG1DS+IZs8a8Grk+yE/gKcE2SLx99o/kLaqxeMzfymJLGdNzgV9Unq2ptVV0E3AB8p6o+0HwySc34e3ypQyd0Io6quo/J2nmSTmNu8aUOGXypQwZf6pDBlzpk8KUOGXypQ03Oq/9KLean+9/UovQRT7z/gqb1AV55+eXmPR7dubZ5j0OfO69p/Uv/4odN62t8bvGlDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+qQwZc6ZPClDg06cm96vr0XgUPAwapa33IoSW2dyCG7f1BVv2o2iaSZcVdf6tDQ4BdwT5KtSTa2HEhSe0N39a+uqieT/A5wb5JHqup7828wfULYCLDmvNeNPKakMQ3a4lfVk9M/dwN3AVce4zZHFtRYedbicaeUNKohS2itSLLq1evAu4CHWg8mqZ0hu/pvBO5K8urt76iqu5tOJamp4wa/qh4D3jqDWSTNiL/Okzpk8KUOGXypQwZf6pDBlzpk8KUONVlQY0kOcv7i51qUPmLfmmpaH+DMVe0X1Nizt8lD8BqLVh1o22ByjEdb1f7x7olbfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOmTwpQ4NCn6S1Uk2J3kkyY4kb2s9mKR2hh429o/A3VX1/iRLgOUNZ5LU2HGDn+RM4B3AhwCqaj+wv+1Ykloasqt/CfAs8MUkDyS5dXrSTUmnqSHBXwRcAXyuqi4H9gKfOPpGSTYm2ZJky/PPHRp5TEljGhL8XcCuqrp/+vVmJk8ErzH/vPqvXzM35oySRnbc4FfV08ATSdZN/2oD8OOmU0lqaui7+h8Bbp++o/8Y8OF2I0lqbVDwq2o7sL7xLJJmxCP3pA4ZfKlDBl/qkMGXOmTwpQ4ZfKlDBl/qUJPVHIpwoNouFHFg7b6m9QFWLW3fY8/ilc17vPWCXU3r7122rGl9gMOvtH8sONzPZ0zc4ksdMvhShwy+1CGDL3XI4EsdMvhShwy+1CGDL3XouMFPsi7J9nmXF5LcPIvhJLVx3MPrquonwGUASeaAXwJ3NZ5LUkMnuqu/AXi0qn7RYhhJs3Giwb8B2NRiEEmzMzj40zPsXg987f/4/rwFNQ6ONZ+kBk5ki/9uYFtVPXOsb752QY22n8yT9Ns5keDfiLv50oIwKPhJlgPvBO5sO46kWRi6oMb/AG9oPIukGfHIPalDBl/qkMGXOmTwpQ4ZfKlDBl/qkMGXOtTk2NrVZxzmj1a81KL0EbdsW9q0PsDqP365eY89a/Y27/Hwty9tWv/CQ1ub1gc4Y8ni5j3y+vaHqhx6ZnfzHkO4xZc6ZPClDhl8qUMGX+qQwZc6ZPClDhl8qUMGX+rQ0DPwfDTJw0keSrIpSfujZyQ1M2QlnfOBvwbWV9VbgDkmp9mWdJoauqu/CFiWZBGwHHiy3UiSWjtu8Kvql8CngceBp4Dnq+qe1oNJamfIrv5ZwPuAi4HzgBVJPnCM2x1ZUOPZ/z40/qSSRjNkV/9a4OdV9WxVHWByiu23H32j+QtqnPOGubHnlDSiIcF/HLgqyfIkYbJw5o62Y0lqachr/PuBzcA24EfTf/P5xnNJamjoghq3ALc0nkXSjHjkntQhgy91yOBLHTL4UocMvtQhgy91yOBLHUpVjV80eRb4xQn8k7OBX40+yMLrsRDugz3a1r+wqs453o2aBP9EJdlSVevtcXLr2+PU6tGyvrv6UocMvtShUyX4s/jQz0LosRDugz1OgfqnxGt8SbN1qmzxJc2QwZc6ZPClDhl8qUMGX+rQ/wK9uhrEXCWiNwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "def doc_relations(weights):\n", " n = weights.shape[1]\n", " relations = np.zeros((n, n))\n", " for i in range(n):\n", " vec1 = weights[:, i]\n", " for j in range(i, n):\n", " vec2 = weights[:, j]\n", " relations[j, i] = cosine(vec1, vec2)\n", " reverse = dict(zip(dictionary.values(), dictionary.keys()))\n", " plt.matshow(relations)\n", " \n", "doc_relations(X_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 总结\n", "\n", "根据上面的实验, 可以看出LSA即便在较小的语料库上也能显现出效果, 而tf-idf或者词频矩阵这些方法构建的词向量的效果要差很多。所以, LSA经常被用来计算词/文档的相似性。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }