Python使用gensim计算文档相似性

2025-04-14 13:10:38

pre_file.py

#-*-coding:utf-8-*-
import MySQLdb
import MySQLdb as mdb
import os,sys,string
import jieba
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')
#连接数据库
try:
  conn=mdb.connect(host='127.0.0.1',user='root',passwd='kongjunli',db='test1',charset='utf8')
except Exception,e:
  print e
  sys.exit()
#获取cursor对象操作数据库
cursor=conn.cursor(mdb.cursors.DictCursor) #cursor游标
#获取内容
sql='SELECT link,content FROM test1.spider;'
cursor.execute(sql)   #execute()方法，将字符串当命令执行
data=cursor.fetchall()#fetchall()接收全部返回结果行
f=codecs.open('C:\Users\kk\Desktop\hello-result1.txt','w','utf-8')

for row in data:    #row接收结果行的每行数据
  seg='/'.join(list(jieba.cut(row['content'],cut_all='False')))
  f.write(row['link']+' '+seg+'\r\n')
f.close()

cursor.close()
      #提交事务，在插入数据时必须

jiansuo.py

#-*-coding:utf-8-*-
import sys
import string
import MySQLdb
import MySQLdb as mdb
import gensim
from gensim import corpora,models,similarities
from gensim.similarities import MatrixSimilarity
import logging
import codecs
reload(sys)
sys.setdefaultencoding('utf-8')

con=mdb.connect(host='127.0.0.1',user='root',passwd='kongjunli',db='test1',charset='utf8')
with con:
  cur=con.cursor()
  cur.execute('SELECT * FROM cutresult_copy')
  rows=cur.fetchall()
  class MyCorpus(object):
    def __iter__(self):
      for row in rows:
        yield str(row[1]).split('/')
#开启日志
logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s',level=logging.INFO)
Corp=MyCorpus()
#将网页文档转化为tf-idf
dictionary=corpora.Dictionary(Corp)
corpus=[dictionary.doc2bow(text) for text in Corp] #将文档转化为词袋模型
#print corpus
tfidf=models.TfidfModel(corpus)#使用tf-idf模型得出文档的tf-idf模型
corpus_tfidf=tfidf[corpus]#计算得出tf-idf值
#for doc in corpus_tfidf:
  #print doc
###
'''
q_file=open('C:\Users\kk\Desktop\q.txt','r')
query=q_file.readline()
q_file.close()
vec_bow=dictionary.doc2bow(query.split(' '))#将请求转化为词带模型
vec_tfidf=tfidf[vec_bow]#计算出请求的tf-idf值
#for t in vec_tfidf:
 # print t
'''
###
query=raw_input('Enter your query:')
vec_bow=dictionary.doc2bow(query.split())
vec_tfidf=tfidf[vec_bow]
index=similarities.MatrixSimilarity(corpus_tfidf)
sims=index[vec_tfidf]
similarity=list(sims)
print sorted(similarity,reverse=True)

encodings.xml

<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
 <component name="Encoding">
  <file url="PROJECT" charset="UTF-8" />
 </component>
</project>

misc.xml

<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
 <component name="ProjectLevelVcsManager" settingsEditedManually="false">
  <OptionsSetting value="true" id="Add" />
  <OptionsSetting value="true" id="Remove" />
  <OptionsSetting value="true" id="Checkout" />
  <OptionsSetting value="true" id="Update" />
  <OptionsSetting value="true" id="Status" />
  <OptionsSetting value="true" id="Edit" />
  <ConfirmationsSetting value="0" id="Add" />
  <ConfirmationsSetting value="0" id="Remove" />
 </component>
 <component name="ProjectRootManager" version="2" project-jdk-name="Python 2.7.11 (C:\Python27\python.exe)" project-jdk-type="Python SDK" />
</project>

modules.xml

<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
 <component name="ProjectModuleManager">
  <modules>
   <module fileurl="file://$PROJECT_DIR$/.idea/爬虫练习代码.iml" filepath="$PROJECT_DIR$/.idea/爬虫练习代码.iml" />
  </modules>
 </component>
</project>

Python实现计算最小编辑距离

最小编辑距离或莱文斯坦距离(Levenshtein),指由字符串A转化为字符串B的最小编辑次数.允许的编辑操作有:删除,插入,替换.具体内容可参见:维基百科-莱文斯坦距离.一般代码实现的方式都是通过动态规划算法,找出从A转化为B的每一步的最小步骤.从Google图片借来的图, Python代码实现, (其中要注意矩阵的下标从1开始,而字符串的下标从0开始): def normal_leven(str1, str2): len_str1 = len(str1) + 1 len_str2 = len
Python计算字符宽度的方法

本文实例讲述了Python计算字符宽度的方法.分享给大家供大家参考,具体如下: 最近在用python写一个CLI小程序,其中涉及到计算字符宽度,目标是以友好的方式将一个长字符串截取为等宽的片段. 对于unicode字符,python的len函数可以准确的计算其中所包含的字符个数,但是个数并不代表宽度,如: >>>len(u'你好a') 3 因此无法简单的使用这种方式来计算宽度. GBK decode 首先我想到GBK编码,00–7F范围内的字符是一字节编码,其余是双字节编码,正好与字符的
Python 字符串操作方法大全

1.去空格及特殊符号复制代码代码如下: s.strip().lstrip().rstrip(',') 2.复制字符串复制代码代码如下: #strcpy(sStr1,sStr2)sStr1 = 'strcpy'sStr2 = sStr1sStr1 = 'strcpy2'print sStr2 3.连接字符串复制代码代码如下: #strcat(sStr1,sStr2)sStr1 = 'strcat'sStr2 = 'append'sStr1 += sStr2print sStr1 4.查
Python计算一个文件里字数的方法

本文实例讲述了Python计算一个文件里字数的方法.分享给大家供大家参考.具体如下: 这段程序从所给文件中找出字数来. from string import * def countWords(s): words=split(s) return len(words) #returns the number of words filename=open("welcome.txt",'r') #open an file in reading mode total_words=0 for li
Python字符转换

如:>>> print ord('a') 97 >>> print chr(97) a 下面我们可以开始来设计我们的大小写转换的程序了: 复制代码代码如下: #!/usr/bin/env python #coding=utf-8 def UCaseChar(ch): if ord(ch) in range(97, 122): return chr(ord(ch) - 32) return ch def LCaseChar(ch): if ord(ch) in rang
详解Python编程中基本的数学计算使用

数在 Python 中,对数的规定比较简单,基本在小学数学水平即可理解. 那么,做为零基础学习这,也就从计算小学数学题目开始吧.因为从这里开始,数学的基础知识列位肯定过关了. >>> 3 3 >>> 3333333333333333333333333333333333333333 3333333333333333333333333333333333333333L >>> 3.222222 3.222222 上面显示的是在交互模式下,如果输入 3,就显
python利用datetime模块计算时间差

今天写了点东西,要计算时间差,我记得去年写过,于是今天再次mark一下,以免自己忘记 In [27]: from datetime import datetime In [28]: a=datetime.now() In [29]: b=datetime.now() In [32]: a Out[32]: datetime.datetime(2015, 4, 7, 4, 30, 3, 628556) In [33]: b Out[33]: datetime.datetime(2015, 4, 7
python计算圆周率pi的方法

本文实例讲述了python计算圆周率pi的方法.分享给大家供大家参考.具体如下: from sys import stdout scale = 10000 maxarr = 2800 arrinit = 2000 carry = 0 arr = [arrinit] * (maxarr + 1) for i in xrange(maxarr, 1, -14): total = 0 for j in xrange(i, 0, -1): total = (total * j) + (scale * a
基于python的Tkinter实现一个简易计算器

本文实例介绍了基于python的Tkinter实现简易计算器的详细代码,分享给大家供大家参考,具体内容如下第一种:使用python 的 Tkinter实现一个简易计算器 #coding:utf-8 from Tkinter import * import time root = Tk() def cacl(input_str): if "x" in input_str: ret = input_str.split("x") return int(ret[0]) *
Python 匹配任意字符（包括换行符）的正则表达式写法

想使用正则表达式来获取一段文本中的任意字符,写出如下匹配规则: (.*) 结果运行之后才发现,无法获得换行之后的文本.于是查了一下手册,才发现正则表达式中,"."(点符号)匹配的是除了换行符"\n"以外的所有字符. 以下为正确的正则表达式匹配规则: ([\s\S]*) 同时,也可以用 "([\d\D]*)"."([\w\W]*)" 来表示. Web技术之家_www.waweb.cn 在文本文件里, 这个表达式可以匹配所有的英文
python计算一个序列的平均值的方法

本文实例讲述了python计算一个序列的平均值的方法.分享给大家供大家参考.具体如下: def average(seq, total=0.0): num = 0 for item in seq: total += item num += 1 return total / num 如果序列是数组或者元祖可以简单使用下面的代码 def average(seq): return float(sum(seq)) / len(seq) 希望本文所述对大家的Python程序设计有所帮助.
python实现计算倒数的方法

本文实例讲述了python实现计算倒数的方法.分享给大家供大家参考.具体如下: class Expr: def __add__(self, other): return Plus(self, other) def __mul__(self, other): return Times(self, other) class Int(Expr): def __init__(self, n): self.n = n def d(self, v): return Int(0) def __str__(se
python分割和拼接字符串

关于string的split 和 join 方法对导入os模块进行os.path.splie()/os.path.join() 貌似是处理机制不一样,但是功能上一样. 1.string.split(str=' ',num=string.count(str)): 以str为分隔,符切片string,如果num有指定值,则仅分隔num个子字符串.S.split([sep [,maxsplit]]) -> 由字符串分割成的列表返回一组使用分隔符(sep)分割字符串形成的列表.如果指定最大分割数,则在

Python使用gensim计算文档相似性

相关推荐

随机推荐