Python机器学习NLP自然语言处理基本操作新闻分类

2025-01-30 05:01:32

概述

从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁.

TF-IDF 关键词提取

TF-IDF (Term Frequency-Inverse Document Frequency), 即词频-逆文件频率是一种用于信息检索与数据挖掘的常用加权技术. TF-IDF 可以帮助我们挖掘文章中的关键词. 通过数值统计, 反映一个词对于语料库中某篇文章的重要性.

TF

TF (Term Frequency), 即词频. 表示词在文本中出现的频率.

公式:

IDF

IDF (Inverse Document Frequency), 即逆文档频率. 表示语料库中包含词的文档的数目的倒数.

公式:

TF-IDF

公式:

TF-IDF = (词的频率 / 句子总字数) × (总文档数 / 包含该词的文档数)

如果一个词非常常见, 那么 IDF 就会很低, 反之就会很高. TF-IDF 可以帮助我们过滤常见词语, 提取关键词.

TfidfVectorizer

TfidfVectorizer 可以帮助我们把原始文本转化为 tf-idf 的特征矩阵, 从而进行相似度计算. sklearn 的TfidfVectorizer 默认输入文本矩阵每行表示一篇文本. 不同文本中相同词项的 tf 值不同, 因此 tf 值与词项所在文本有关.

格式:

tfidfVectorizer(input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False)

参数:

input: 输入

encoding: 编码, 默认为 utf-8

analyzer: “word” 或 “char”, 默认按词 (word) 分析

stopwords: 停用词

ngram_range: ngrame 上下限

lowercase: 转换为小写

max_features: 关键词个数

数据介绍

数据由 12 个不同网站的新闻数据组成. 如下:

  Class  ...                                            Content
0  news  ...  中广网唐山６月１２日消息（记者汤一亮　庄胜春）据中国之声《新闻晚高峰》报道，今天（１２日）上...
1  news  ...  天津卫视求职节目《非你莫属》“晕倒门”事件余波未了，主持人张绍刚前日通过《非你莫属》节目组发...
2  news  ...  临沂（山东），２０１２年６月４日　夫妻“麦客”忙麦收　６月４日，在山东省临沂市郯城县郯城街道...
3  news  ...  中广网北京６月１３日消息（记者王宇）据中国之声《新闻晚高峰》报道，明天凌晨两场欧洲杯的精彩比...
4  news  ...  环球网记者李亮报道，正在意大利度蜜月的“脸谱”创始人扎克伯格与他华裔妻子的一举一动都处于媒体...

流程:

读取数据
计算数据 tf-idf 值
贝叶斯分类

代码实现

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
import jieba
def load_data():
    """读取数据/停用词"""
    # 读取数据
    data = pd.read_csv("test.txt", sep="\t", names=["Class", "Title", "Content"])
    print(data.head())
    # 读取停用词
    stop_words = pd.read_csv("stopwords.txt", names=["stop_words"], encoding="utf-8")
    stop_words = stop_words["stop_words"].values.tolist()
    print(stop_words)
    return data, stop_words
def main():
    """主函数"""
    # 读取数据
    data, stop_words = load_data()
    # 分词
    segs = data["Content"].apply(lambda x: ' '.join(jieba.cut(x)))
    # Tf-Idf
    tf_idf = TfidfVectorizer(stop_words=stop_words, max_features=1000, lowercase=False)
    # 拟合
    tf_idf.fit(segs)
    # 转换
    X = tf_idf.transform(segs)
    # 分割数据
    X_train, X_test, y_train, y_test = train_test_split(X, data["Class"], random_state=0)
    # 调试输出
    print(X_train[:2])
    print(y_train[:2])
    # 实例化朴素贝叶斯
    classifier = MultinomialNB()
    # 拟合
    classifier.fit(X_train, y_train)
    # 计算分数
    acc = classifier.score(X_test, y_test)
    print("准确率:", acc)
    # 报告
    report = classification_report(y_test, classifier.predict(X_test))
    print(report)
if __name__ == '__main__':
    main()

输出结果:

  Class  ...                                            Content
0  news  ...  中广网唐山６月１２日消息（记者汤一亮　庄胜春）据中国之声《新闻晚高峰》报道，今天（１２日）上...
1  news  ...  天津卫视求职节目《非你莫属》“晕倒门”事件余波未了，主持人张绍刚前日通过《非你莫属》节目组发...
2  news  ...  临沂（山东），２０１２年６月４日　夫妻“麦客”忙麦收　６月４日，在山东省临沂市郯城县郯城街道...
3  news  ...  中广网北京６月１３日消息（记者王宇）据中国之声《新闻晚高峰》报道，明天凌晨两场欧洲杯的精彩比...
4  news  ...  环球网记者李亮报道，正在意大利度蜜月的“脸谱”创始人扎克伯格与他华裔妻子的一举一动都处于媒体...
[5 rows x 3 columns]
['?', '、', '。', '“', '”', '《', '》', '！', '，', '：', '；', '？', '啊', '阿', '哎', '哎呀', '哎哟', '唉', '俺', '俺们', '按', '按照', '吧', '吧哒', '把', '罢了', '被', '本', '本着', '比', '比方', '比如', '鄙人', '彼', '彼此', '边', '别', '别的', '别说', '并', '并且', '不比', '不成', '不单', '不但', '不独', '不管', '不光', '不过', '不仅', '不拘', '不论', '不怕', '不然', '不如', '不特', '不惟', '不问', '不只', '朝', '朝着', '趁', '趁着', '乘', '冲', '除', '除此之外', '除非', '除了', '此', '此间', '此外', '从', '从而', '打', '待', '但', '但是', '当', '当着', '到', '得', '的', '的话', '等', '等等', '地', '第', '叮咚', '对', '对于', '多', '多少', '而', '而况', '而且', '而是', '而外', '而言', '而已', '尔后', '反过来', '反过来说', '反之', '非但', '非徒', '否则', '嘎', '嘎登', '该', '赶', '个', '各', '各个', '各位', '各种', '各自', '给', '根据', '跟', '故', '故此', '固然', '关于', '管', '归', '果然', '果真', '过', '哈', '哈哈', '呵', '和', '何', '何处', '何况', '何时', '嘿', '哼', '哼唷', '呼哧', '乎', '哗', '还是', '还有', '换句话说', '换言之', '或', '或是', '或者', '极了', '及', '及其', '及至', '即', '即便', '即或', '即令', '即若', '即使', '几', '几时', '己', '既', '既然', '既是', '继而', '加之', '假如', '假若', '假使', '鉴于', '将', '较', '较之', '叫', '接着', '结果', '借', '紧接着', '进而', '尽', '尽管', '经', '经过', '就', '就是', '就是说', '据', '具体地说', '具体说来', '开始', '开外', '靠', '咳', '可', '可见', '可是', '可以', '况且', '啦', '来', '来着', '离', '例如', '哩', '连', '连同', '两者', '了', '临', '另', '另外', '另一方面', '论', '嘛', '吗', '慢说', '漫说', '冒', '么', '每', '每当', '们', '莫若', '某', '某个', '某些', '拿', '哪', '哪边', '哪儿', '哪个', '哪里', '哪年', '哪怕', '哪天', '哪些', '哪样', '那', '那边', '那儿', '那个', '那会儿', '那里', '那么', '那么些', '那么样', '那时', '那些', '那样', '乃', '乃至', '呢', '能', '你', '你们', '您', '宁', '宁可', '宁肯', '宁愿', '哦', '呕', '啪达', '旁人', '呸', '凭', '凭借', '其', '其次', '其二', '其他', '其它', '其一', '其余', '其中', '起', '起见', '起见', '岂但', '恰恰相反', '前后', '前者', '且', '然而', '然后', '然则', '让', '人家', '任', '任何', '任凭', '如', '如此', '如果', '如何', '如其', '如若', '如上所述', '若', '若非', '若是', '啥', '上下', '尚且', '设若', '设使', '甚而', '甚么', '甚至', '省得', '时候', '什么', '什么样', '使得', '是', '是的', '首先', '谁', '谁知', '顺', '顺着', '似的', '虽', '虽然', '虽说', '虽则', '随', '随着', '所', '所以', '他', '他们', '他人', '它', '它们', '她', '她们', '倘', '倘或', '倘然', '倘若', '倘使', '腾', '替', '通过', '同', '同时', '哇', '万一', '往', '望', '为', '为何', '为了', '为什么', '为着', '喂', '嗡嗡', '我', '我们', '呜', '呜呼', '乌乎', '无论', '无宁', '毋宁', '嘻', '吓', '相对而言', '像', '向', '向着', '嘘', '呀', '焉', '沿', '沿着', '要', '要不', '要不然', '要不是', '要么', '要是', '也', '也罢', '也好', '一', '一般', '一旦', '一方面', '一来', '一切', '一样', '一则', '依', '依照', '矣', '以', '以便', '以及', '以免', '以至', '以至于', '以致', '抑或', '因', '因此', '因而', '因为', '哟', '用', '由', '由此可见', '由于', '有', '有的', '有关', '有些', '又', '于', '于是', '于是乎', '与', '与此同时', '与否', '与其', '越是', '云云', '哉', '再说', '再者', '在', '在下', '咱', '咱们', '则', '怎', '怎么', '怎么办', '怎么样', '怎样', '咋', '照', '照着', '者', '这', '这边', '这儿', '这个', '这会儿', '这就是说', '这里', '这么', '这么点儿', '这么些', '这么样', '这时', '这些', '这样', '正如', '吱', '之', '之类', '之所以', '之一', '只是', '只限', '只要', '只有', '至', '至于', '诸位', '着', '着呢', '自', '自从', '自个儿', '自各儿', '自己', '自家', '自身', '综上所述', '总的来看', '总的来说', '总的说来', '总而言之', '总之', '纵', '纵令', '纵然', '纵使', '遵照', '作为', '兮', '呃', '呗', '咚', '咦', '喏', '啐', '喔唷', '嗬', '嗯', '嗳', 'a', 'able', 'about', 'above', 'abroad', 'according', 'accordingly', 'across', 'actually', 'adj', 'after', 'afterwards', 'again', 'against', 'ago', 'ahead', "ain't", 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'also', 'although', 'always', 'am', 'amid', 'amidst', 'among', 'amongst', 'an', 'and', 'another', 'any', 'anybody', 'anyhow', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'appear', 'appreciate', 'appropriate', 'are', "aren't", 'around', 'as', "a's", 'aside', 'ask', 'asking', 'associated', 'at', 'available', 'away', 'awfully', 'b', 'back', 'backward', 'backwards', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'both', 'brief', 'but', 'by', 'c', 'came', 'can', 'cannot', 'cant', "can't", 'caption', 'cause', 'causes', 'certain', 'certainly', 'changes', 'clearly', "c'mon", 'co', 'co.', 'com', 'come', 'comes', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', "couldn't", 'course', "c's", 'currently', 'd', 'dare', "daren't", 'definitely', 'described', 'despite', 'did', "didn't", 'different', 'directly', 'do', 'does', "doesn't", 'doing', 'done', "don't", 'down', 'downwards', 'during', 'e', 'each', 'edu', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'entirely', 'especially', 'et', 'etc', 'even', 'ever', 'evermore', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'f', 'fairly', 'far', 'farther', 'few', 'fewer', 'fifth', 'first', 'five', 'followed', 'following', 'follows', 'for', 'forever', 'former', 'formerly', 'forth', 'forward', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'get', 'gets', 'getting', 'given', 'gives', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'greetings', 'h', 'had', "hadn't", 'half', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', "here's", 'hereupon', 'hers', 'herself', "he's", 'hi', 'him', 'himself', 'his', 'hither', 'hopefully', 'how', 'howbeit', 'however', 'hundred', 'i', "i'd", 'ie', 'if', 'ignored', "i'll", "i'm", 'immediate', 'in', 'inasmuch', 'inc', 'inc.', 'indeed', 'indicate', 'indicated', 'indicates', 'inner', 'inside', 'insofar', 'instead', 'into', 'inward', 'is', "isn't", 'it', "it'd", "it'll", 'its', "it's", 'itself', "i've", 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'know', 'known', 'knows', 'l', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', "let's", 'like', 'liked', 'likely', 'likewise', 'little', 'look', 'looking', 'looks', 'low', 'lower', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', "mayn't", 'me', 'mean', 'meantime', 'meanwhile', 'merely', 'might', "mightn't", 'mine', 'minus', 'miss', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'must', "mustn't", 'my', 'myself', 'n', 'name', 'namely', 'nd', 'near', 'nearly', 'necessary', 'need', "needn't", 'needs', 'neither', 'never', 'neverf', 'neverless', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'no-one', 'nor', 'normally', 'not', 'nothing', 'notwithstanding', 'novel', 'now', 'nowhere', 'o', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'on', 'once', 'one', 'ones', "one's", 'only', 'onto', 'opposite', 'or', 'other', 'others', 'otherwise', 'ought', "oughtn't", 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'own', 'p', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'possible', 'presumably', 'probably', 'provided', 'provides', 'q', 'que', 'quite', 'qv', 'r', 'rather', 'rd', 're', 'really', 'reasonably', 'recent', 'recently', 'regarding', 'regardless', 'regards', 'relatively', 'respectively', 'right', 'round', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'second', 'secondly', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'shall', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'since', 'six', 'so', 'some', 'somebody', 'someday', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specified', 'specify', 'specifying', 'still', 'sub', 'such', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", 'thats', "that's", "that've", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', "there'd", 'therefore', 'therein', "there'll", "there're", 'theres', "there's", 'thereupon', "there've", 'these', 'they', "they'd", "they'll", "they're", "they've", 'thing', 'things', 'think', 'third', 'thirty', 'this', 'thorough', 'thoroughly', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'till', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', "t's", 'twice', 'two', 'u', 'un', 'under', 'underneath', 'undoing', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'upwards', 'us', 'use', 'used', 'useful', 'uses', 'using', 'usually', 'v', 'value', 'various', 'versus', 'very', 'via', 'viz', 'vs', 'w', 'want', 'wants', 'was', "wasn't", 'way', 'we', "we'd", 'welcome', 'well', "we'll", 'went', 'were', "we're", "weren't", "we've", 'what', 'whatever', "what'll", "what's", "what've", 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', "where's", 'whereupon', 'wherever', 'whether', 'which', 'whichever', 'while', 'whilst', 'whither', 'who', "who'd", 'whoever', 'whole', "who'll", 'whom', 'whomever', "who's", 'whose', 'why', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wonder', "won't", 'would', "wouldn't", 'x', 'y', 'yes', 'yet', 'you', "you'd", "you'll", 'your', "you're", 'yours', 'yourself', 'yourselves', "you've", 'z', 'zero']
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Windows\AppData\Local\Temp\jieba.cache
Loading model cost 0.797 seconds.
Prefix dict has been built successfully.
C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py:300: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ain', 'aren', 'couldn', 'daren', 'didn', 'doesn', 'don', 'hadn', 'hasn', 'haven', 'isn', 'll', 'mayn', 'mightn', 'mon', 'mustn', 'needn', 'oughtn', 'shan', 'shouldn', 've', 'wasn', 'weren', 'won', 'wouldn'] not in stop_words.
  'stop_words.' % sorted(inconsistent))
C:\Users\Windows\Anaconda3\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)
  (0, 1)	0.172494787172401
  (0, 13)	0.03617578927683419
  (0, 19)	0.044685283861169885
  (0, 24)	0.04378669110667244
  (0, 32)	0.04770060616202845
  (0, 37)	0.08714906173699981
  (0, 42)	0.03262617791847282
  (0, 79)	0.03598272479613044
  (0, 80)	0.03384787551537572
  (0, 83)	0.105263111952599
  (0, 88)	0.1963277784717525
  (0, 89)	0.04008970433219022
  (0, 90)	0.1412328663779052
  (0, 98)	0.03994454517190565
  (0, 134)	0.04594067399577792
  (0, 142)	0.04422553495980068
  (0, 147)	0.05830540319790606
  (0, 149)	0.02851184938909845
  (0, 163)	0.03588762154160368
  (0, 166)	0.11695593718680143
  (0, 168)	0.06847746933720501
  (0, 170)	0.08047415949662211
  (0, 176)	0.03776179057073174
  (0, 185)	0.03924979201525634
  (0, 200)	0.04184963649844074
  :	:
  (0, 855)	0.04946215984804544
  (0, 865)	0.04649241857748127
  (0, 869)	0.138252930106818
  (0, 870)	0.21384828173878706
  (0, 873)	0.1667028225043511
  (0, 876)	0.09298483715496254
  (0, 885)	0.19784215331311594
  (0, 888)	0.04008970433219022
  (0, 896)	0.04985458905113395
  (0, 897)	0.49416003160549193
  (0, 902)	0.03938479984191792
  (0, 909)	0.03885574129240138
  (0, 910)	0.03911664642497605
  (0, 917)	0.05050266101265748
  (0, 922)	0.03966061581314387
  (0, 933)	0.02711876416903459
  (0, 947)	0.0329697338326223
  (0, 955)	0.15044205095521537
  (0, 972)	0.03255891003473477
  (0, 998)	0.1578946679288985
  (1, 196)	0.37511948551579166
  (1, 224)	0.21966463546937165
  (1, 420)	0.24197656045436386
  (1, 460)	0.49133294291978213
  (1, 787)	0.7148930709574247
1394    news.cn
1365    news.cn
Name: Class, dtype: object
准确率: 0.7176165803108808
               precision    recall  f1-score   support
1688.autos.cn       0.00      0.00      0.00         3
     autos.cn       1.00      0.22      0.36         9
       biz.cn       0.63      0.73      0.68        45
          cpc       0.00      0.00      0.00         8
      dangshi       0.48      0.77      0.59        13
       ent.cn       0.86      0.57      0.68        44
        henan       0.98      0.77      0.86        69
         news       0.00      0.00      0.00        16
      news.cn       0.64      0.93      0.76       131
      society       0.00      0.00      0.00         5
    sports.cn       0.86      0.86      0.86        37
       theory       0.00      0.00      0.00         6
     accuracy                           0.72       386
    macro avg       0.45      0.40      0.40       386
 weighted avg       0.69      0.72      0.68       386

以上就是Python机器学习NLP自然语言处理基本操作新闻分类的详细内容，更多关于Python机器学习NLP自然语言处理的资料请关注我们其它相关文章！

浅谈Python NLP入门教程

正文本文简要介绍Python自然语言处理(NLP),使用Python的NLTK库.NLTK是Python的自然语言处理工具包,在NLP领域中,最常使用的一个Python库. 什么是NLP? 简单来说,自然语言处理(NLP)就是开发能够理解人类语言的应用程序或服务. 这里讨论一些自然语言处理(NLP)的实际应用例子,如语音识别.语音翻译.理解完整的句子.理解匹配词的同义词,以及生成语法正确完整句子和段落. 这并不是NLP能做的所有事情. NLP实现搜索引擎: 比如谷歌,Yahoo等.谷歌搜索引
Python编程使用NLTK进行自然语言处理详解

自然语言处理是计算机科学领域与人工智能领域中的一个重要方向.自然语言工具箱(NLTK,NaturalLanguageToolkit)是一个基于Python语言的类库,它也是当前最为流行的自然语言编程与开发工具.在进行自然语言处理研究和应用时,恰当利用NLTK中提供的函数可以大幅度地提高效率.本文就将通过一些实例来向读者介绍NLTK的使用. NLTK NaturalLanguageToolkit,自然语言处理工具包,在NLP领域中,最常使用的一个Python库. NLTK是一个开源的项目,包含:P
Python机器学习NLP自然语言处理基本操作之精确分词

目录概述分词器 jieba 安装精确分词全模式搜索引擎模式获取词性概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 分词器 jieba jieba 算法基于前缀词典实现高效的词图扫描, 生成句子中汉字所有可能成词的情况所构成的有向无环图. 通过动态规划查找最大概率路径, 找出基于词频的最大切分组合. 对于未登录词采用了基于汉字成词能力的 HMM 模型, 使用 Viter
Pytorch在NLP中的简单应用详解

因为之前在项目中一直使用Tensorflow,最近需要处理NLP问题,对Pytorch框架还比较陌生,所以特地再学习一下pytorch在自然语言处理问题中的简单使用,这里做一个记录. 一.Pytorch基础首先,第一步是导入pytorch的一系列包 import torch import torch.autograd as autograd #Autograd为Tensor所有操作提供自动求导方法 import torch.nn as nn import torch.nn.functional
用Python进行一些简单的自然语言处理的教程

本月的每月挑战会主题是NLP,我们会在本文帮你开启一种可能:使用pandas和python的自然语言工具包分析你Gmail邮箱中的内容. NLP-风格的项目充满无限可能: 情感分析是对诸如在线评论.社交媒体等情感内容的测度.举例来说,关于某个话题的tweets趋向于正面还是负面的意见?一个新闻网站涵盖的主题,是使用了更正面/负面的词语,还是经常与某些情绪相关的词语?这个"正面"的Yelp点评不是很讽刺么?(祝最后去的那位好运!) 分析语言在文学中的使用,进而衡量词汇或者写作风格随时间/
Python机器学习NLP自然语言处理基本操作新闻分类

目录概述 TF-IDF 关键词提取 TF IDF TF-IDF TfidfVectorizer 数据介绍代码实现概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. TF-IDF 关键词提取 TF-IDF (Term Frequency-Inverse Document Frequency), 即词频-逆文件频率是一种用于信息检索与数据挖掘的常用加权技术. TF-IDF 可以帮助我
Python机器学习NLP自然语言处理基本操作之京东评论分类

目录概述 RNN 权重共享计算过程 LSTM 阶段数据介绍代码预处理主函数概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. RNN RNN (Recurrent Neural Network), 即循环神经网络. RNN 相较于 CNN, 可以帮助我们更好的处理序列信息, 挖掘前后信息之间的联系. 对于 NLP 这类的任务, 语料的前后概率有极大的联系. 比如: "明天
Python机器学习NLP自然语言处理基本操作家暴归类

目录概述数据介绍词频统计朴素贝叶斯代码实现预处理主函数概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 数据介绍该数据是家庭暴力的一份司法数据.分为 4 个不同类别: 报警人被老公打,报警人被老婆打,报警人被儿子打,报警人被女儿打. 今天我们就要运用我们前几次学到的知识, 来实现一个 NLP 分类问题. 词频统计 CountVectorizer是一个文本特征提取的方
Python机器学习NLP自然语言处理基本操作词向量模型

目录概述词向量词向量维度 Word2Vec CBOW 模型 Skip-Gram 模型负采样模型词向量的训练过程 1. 初始化词向量矩阵 2. 神经网络反向传播词向量模型实战训练模型使用模型概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 词向量我们先来说说词向量究竟是什么. 当我们把文本交给算法来处理的时候, 计算机并不能理解我们输入的文本, 词向量就由此而生了.
Python机器学习NLP自然语言处理基本操作关键词

目录概述关键词 TF-IDF 关键词提取 TF IDF TF-IDF jieba TF-IDF 关键词抽取 jieba 词性不带关键词权重附带关键词权重 TextRank 概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 关键词关键词 (keywords), 即关键词语. 关键词能描述文章的本质, 在文献检索, 自动文摘, 文本聚类 / 分类等方面有着重要的应用. 关键词抽
Python机器学习NLP自然语言处理基本操作电影影评分析

目录概述 RNN 权重共享计算过程 LSTM 阶段代码预处理主函数概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. RNN RNN (Recurrent Neural Network), 即循环神经网络. RNN 相较于 CNN, 可以帮助我们更好的处理序列信息, 挖掘前后信息之间的联系. 对于 NLP 这类的任务, 语料的前后概率有极大的联系. 比如: "明天天气真好&
Python机器学习NLP自然语言处理基本操作词袋模型

概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 词袋模型词袋模型 (Bag of Words Model) 能帮助我们把一个句子转换为向量表示. 词袋模型把文本看作是无序的词汇集合, 把每一单词都进行统计. 向量化词袋模型首先会进行分词, 在分词之后. 通过通过统计在每个词在文本中出现的次数. 我们就可以得到该文本基于词语的特征, 如果将各个文本样本的这些词与对应的词频放在一起
Python机器学习NLP自然语言处理基本操作之Seq2seq的用法

概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. Seq2seq Seq2seq 由 Encoder 和 Decoder 两个 RNN 组成. Encoder 将变长序列输出, 编码成 encoderstate 再由 Decoder 输出变长序列. Seq2seq 的使用领域: 机器翻译: Encoder-Decoder 的最经典应用文本摘要: 输入是一段文本序列, 输出是这段文本
Python机器学习NLP自然语言处理基本操作之命名实例提取

目录概述命名实例 HMM 随机场马尔科夫随机场 CRF 命名实例实战数据集 crf 预处理主程序概述从今天开始我们将开启一段自然语言处理 (NLP) 的旅程. 自然语言处理可以让来处理, 理解, 以及运用人类的语言, 实现机器语言和人类语言之间的沟通桥梁. 命名实例命名实例 (Named Entity) 指的是 NLP 任务中具有特定意义的实体, 包括人名, 地名, 机构名, 专有名词等. 举个例子: Luke Rawlence 代表人物 Aiimi 和 University o