python中文分词+词频统计的实现步骤

2025-03-01 08:39:16

前言

本文记录了一下Python在文本处理时的一些过程+代码

一、文本导入

我准备了一个名为abstract.txt的文本文件

接着是在网上下载了stopword.txt(用于结巴分词时的停用词)

有一些是自己觉得没有用加上去的

另外建立了自己的词典extraDict.txt

准备工作做好了，就来看看怎么使用吧！

二、使用步骤

1.引入库

代码如下：

import jieba
from jieba.analyse import extract_tags
from sklearn.feature_extraction.text import TfidfVectorizer

2.读入数据

代码如下：

jieba.load_userdict('extraDict.txt')  # 导入自己建立词典

3.取出停用词表

def stopwordlist():
    stopwords = [line.strip() for line in open('chinesestopwords.txt', encoding='UTF-8').readlines()]
    # ---停用词补充,视具体情况而定---
    i = 0
    for i in range(19):
        stopwords.append(str(10 + i))
    # ----------------------

    return stopwords

4.分词并去停用词（此时可以直接利用python原有的函数进行词频统计）

def seg_word(line):
    # seg=jieba.cut_for_search(line.strip())
    seg = jieba.cut(line.strip())
    temp = ""
    counts = {}
    wordstop = stopwordlist()
    for word in seg:
        if word not in wordstop:
            if word != ' ':
                temp += word
                temp += '\n'
                counts[word] = counts.get(word, 0) + 1#统计每个词出现的次数
    return  temp #显示分词结果
    #return str(sorted(counts.items(), key=lambda x: x[1], reverse=True)[:20])  # 统计出现前二十最多的词及次数

5. 输出分词并去停用词的有用的词到txt

def output(inputfilename, outputfilename):
    inputfile = open(inputfilename, encoding='UTF-8', mode='r')
    outputfile = open(outputfilename, encoding='UTF-8', mode='w')
    for line in inputfile.readlines():
        line_seg = seg_word(line)
        outputfile.write(line_seg)
    inputfile.close()
    outputfile.close()
    return outputfile

6.函数调用

if __name__ == '__main__':
    print("__name__", __name__)
    inputfilename = 'abstract.txt'
    outputfilename = 'a1.txt'
    output(inputfilename, outputfilename)

7.结果

附：输入一段话，统计每个字母出现的次数

先来讲一下思路：

例如给出下面这样一句话

Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven’t got a clue.

那么想要统计里面每一个单词出现的次数，思路很简单，遍历一遍这个字符串，再定义一个空字典count_dict，看每一个单词在这个用于统计的空字典count_dict中的key中存在否，不存在则将这个单词当做count_dict的键加入字典内，然后值就为1，若这个单词在count_dict里面已经存在，那就将它对应的键的值+1就行

下面来看代码：

#定义字符串
sentences = """           # 字符串很长时用三个引号
Love is more than a word
it says so much.
When I see these four letters,
I almost feel your touch.
This is only happened since
I fell in love with you.
Why this word does this,
I haven't got a clue.
"""
#具体实现
#  将句子里面的逗号去掉,去掉多种符号时请用循环，这里我就这样吧
sentences=sentences.replace(',','')
sentences=sentences.replace('.','')   #  将句子里面的.去掉
sentences = sentences.split()         # 将句子分开为单个的单词，分开后产生的是一个列表sentences
# print(sentences)
count_dict = {}
for sentence in sentences:
    if sentence not in count_dict:    # 判断是否不在统计的字典中
        count_dict[sentence] = 1
    else:                              # 判断是否不在统计的字典中
        count_dict[sentence] += 1
for key,value in count_dict.items():
    print(f"{key}出现了{value}次")

输出结果是这样：

总结

以上就是今天要讲的内容，本文仅仅简单介绍了python的中文分词及词频统计！

到此这篇关于python中文分词+词频统计的实现步骤的文章就介绍到这了,更多相关python中文分词词频统计内容请搜索我们以前的文章或继续浏览下面的相关文章希望大家以后多多支持我们！

python实现简单中文词频统计示例

本文介绍了python实现简单中文词频统计示例,分享给大家,具体如下: 任务简单统计一个小说中哪些个汉字出现的频率最高知识点 1.文件操作 2.字典 3.排序 4.lambda 代码 import codecs import matplotlib.pyplot as plt from pylab import mpl mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定默认字体 mpl.rcParams['axes.unicode_minus
详解Python用三种方式统计词频的方法

三种方法: ①直接使用dict ②使用defaultdict ③使用Counter ps:`int()`函数默认返回0 ①dict text = "I'm a hand some boy!" frequency = {} for word in text.split(): if word not in frequency: frequency[word] = 1 else: frequency[word] += 1 ②defaultdict import collections f
Python统计词频并绘制图片(附完整代码)

效果 1 实现代码读取txt文件: def readText(text_file_path): with open(text_file_path, encoding='gbk') as f: # content = f.read() return content 得到文章的词频: def getRecommondArticleKeyword(text_content, key_word_need_num = 10, custom_words = [], stop_words =[], quer
python写程序统计词频的方法

在李笑来所著<时间当作朋友>中有这么一段: 可问题在于,当年我在少年宫学习计算机程序语言的时候,怎么可能想象得到,在20多年后的某一天,我需要先用软件调取语料库中的数据,然后用统计方法为每个单词标注词频,再写一个批处理程序从相应的字典里复制出多达20MB的内容,重新整理-- 在新书<自学是门手艺>中,他再次提及: 又过了好几年,我去新东方教书.2003 年,在写词汇书的过程中,需要统计词频,C++ 倒是用不上,用之前学过它的经验,学了一点 Python,写程序统计词频 --<
Python jieba 中文分词与词频统计的操作

我就废话不多说了,大家还是直接看代码吧~ #! python3 # -*- coding: utf-8 -*- import os, codecs import jieba from collections import Counter def get_words(txt): seg_list = jieba.cut(txt) c = Counter() for x in seg_list: if len(x)>1 and x != '\r\n': c[x] += 1 print('常用词频度统
如何利用python实现词频统计功能

目录功能要求方法如下运行结果总结功能要求这是我们老师的作业代码中都有注释要求词频统计软件: 1)从文本中读入数据:(文件的输入输出) 2)不区分大小写,去除特殊字符. 3) 统计单词例如:about :10 并统计总共多少单词 4)对单词排序.出现次数 5)输出词频最高的10个单词和次数 6)把统计结果存入文本方法如下 1.文件的读取,区分大小写,去除特殊字符 import re def getword(): # 读取文件 f=open('read.txt','r',enc
python 文本单词提取和词频统计的实例

这些对文本的操作经常用到, 那我就总结一下. 陆续补充... 操作: strip_html(cls, text) 去除html标签 separate_words(cls, text, min_lenth=3) 文本提取 get_words_frequency(cls, words_list) 获取词频源码: class DocProcess(object): @classmethod def strip_html(cls, text): """ Delete html ta
python jieba分词并统计词频后输出结果到Excel和txt文档方法

前两天,班上同学写论文,需要将很多篇论文题目按照中文的习惯分词并统计每个词出现的频率. 让我帮她实现这个功能,我在网上查了之后发现jieba这个库还挺不错的. 运行环境: 安装python2.7.13:https://www.python.org/downloads/release/python-2713/ 安装jieba:pip install jieba 安装xlwt:pip install xlwt 具体代码如下: #!/usr/bin/python # -*- coding:utf-8
python利用多种方式来统计词频（单词个数）

python的思维就是让我们用尽可能少的代码来解决问题.对于词频的统计,就代码层面而言,实现的方式也是有很多种的.之所以单独谈到统计词频这个问题,是因为它在统计和数据挖掘方面经常会用到,尤其是处理分类问题上.故在此做个简单的记录. 统计的材料如下: document = [ 'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes', 'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not',
Python实现统计英文文章词频的方法分析

本文实例讲述了Python实现统计英文文章词频的方法.分享给大家供大家参考,具体如下: 应用介绍: 统计英文文章词频是很常见的需求,本文利用python实现. 思路分析: 1.把英文文章的每个单词放到列表里,并统计列表长度: 2.遍历列表,对每个单词出现的次数进行统计,并将结果存储在字典中: 3.利用步骤1中获得的列表长度,求出每个单词出现的频率,并将结果存储在频率字典中: 4.以字典键值对的"值"为标准,对字典进行排序,输出结果(也可利用切片输出频率最大或最小的特定几个,因为经过排序