chanese_text_analysis

中文文本分类方法

文本分类 = 文本表示 + 分类模型

文本表示:BOW/N-gram/TF-IDF/word2vec/word embedding/ELMo

词袋模型(中文)

①分词:
第1句话:[w1 w3 w5 w2 w1…]
第2句话:[w11 w32 w51 w21 w15…]
第3句话…

  • 载入jieba库,使用jieba.lcut进行分词
1
2
3
4
5
6
7
8
9
10
11
12
13
df = pd.read_csv("./origin_data/entertainment_news.csv", encoding='utf-8')
df = df.dropna()
content=df["content"].values.tolist()
segment=[]
for line in content:
try:
segs=jieba.lcut(line)
for seg in segs:
if len(seg)>1 and seg!='\r\n':
segment.append(seg)
except:
print(line)
continue
  • 去停用词
1
2
3
4
5
words_df=pd.DataFrame({'segment':segment})
#words_df.head()
stopwords=pd.read_csv("origin_data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
#stopwords.head()
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

②统计词频:
w3 count3
w7 count7
wi count_i

1
2
3
words_stat=words_df.groupby(by=['segment'])['segment'].agg({"计数":numpy.size})
words_stat=words_stat.reset_index().sort_values(by=["计数"],ascending=False)
words_stat.head()

③构建词典:
选出频次最高的N个词
开[1*n]这样的向量空间
(每个位置是哪个词)

1
2
dictionary = corpora.Dictionary(sentences) #创建字典
corpus = [dictionary.doc2bow(sentence) for sentence in sentences]#创建语料库

④映射:把每句话共构建的词典进行映射
第1句话:[1 0 1 0 1 0…]
第2句话:[0 0 0 0 0 0…1, 0…1,0…]

⑤提升信息的表达充分度:

  • 把是否出现替换成频次

  • 不只记录每个词,我还记录连续的n-gram

    • “李雷喜欢韩梅梅” => (“李雷”,”喜欢”,”韩梅梅”)
    • “韩梅梅喜欢李雷” => (“李雷”,”喜欢”,”韩梅梅”)
    • “李雷喜欢韩梅梅” => (“李雷”,”喜欢”,”韩梅梅”,”李雷喜欢”, “喜欢韩梅梅”)
    • “韩梅梅喜欢李雷” => (“李雷”,”喜欢”,”韩梅梅”,”韩梅梅喜欢”,”喜欢李雷”)
  • 不只是使用频次信息,需要知道词对于句子的重要度

    • TF-IDF = TF(term frequency) + IDF(inverse document frequency)

      ​ import jieba.analyse

      • jieba.analyse.extract_tags(sentence, topK=20, withWeight=False, allowPOS=())

        • sentence 为待提取的文本
        • topK 为返回几个 TF/IDF 权重最大的关键词,默认值为 20
        • withWeight 为是否一并返回关键词权重值,默认值为 False
        • allowPOS 仅包括指定词性的词,默认值为空,即不筛选
1
2
3
4
5
6
7
import jieba.analyse as analyse
import pandas as pd
df = pd.read_csv("./origin_data/technology_news.csv", encoding='utf-8')
df = df.dropna()
lines=df.content.values.tolist()
content = "".join(lines)
print (" ".join(analyse.extract_tags(content, topK=30, withWeight=False, allowPOS=())))

⑥上述的表达都是独立表达(没有词和词在含义空间的分布)
喜欢 = 在乎 = “稀罕” = “中意”

  • word-net (把词汇根据关系构建一张网:近义词、反义词、上位词、下位词…)
    • 怎么更新?
    • 个体差异?
  • 希望能够基于海量数据的分布去学习到一种表示
    • nnlm => 词向量
    • word2vec(周边词类似的这样一些词,是可以互相替换,相同的语境)
      • 捕捉的是相关的词,不是近义词
        • 我 讨厌 你
        • 我 喜欢 你
    • word2vec优化…
    • 用监督学习去调整word2vec的结果(word embedding/词嵌入)
  • 文本预处理
    • 时态语态Normalize
    • 近义词替换
    • stemming

分类模型:NB/LR/SVM/LSTM(GRU)/CNN

语种判断:拉丁语系,字母组成的,甚至字母也一样 => 字母的使用(次序、频次)不一样

对向量化的输入去做建模
①NB/LR/SVM…建模

  • 可以接受特别高维度的稀疏表示

②MLP/CNN/LSTM

  • 不适合稀疏高维度数据输入 => word2vec

接下来以完成朴素贝叶斯中文分类器项目

数据介绍

选择科技、汽车、娱乐、军事、运动 总共5类文本数据进行处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import jieba
import pandas as pd
df_technology = pd.read_csv("./origin_data/technology_news.csv", encoding='utf-8')
df_technology = df_technology.dropna()

df_car = pd.read_csv("./origin_data/car_news.csv", encoding='utf-8')
df_car = df_car.dropna()

df_entertainment = pd.read_csv("./origin_data/entertainment_news.csv", encoding='utf-8')
df_entertainment = df_entertainment.dropna()

df_military = pd.read_csv("./origin_data/military_news.csv", encoding='utf-8')
df_military = df_military.dropna()

df_sports = pd.read_csv("./origin_data/sports_news.csv", encoding='utf-8')
df_sports = df_sports.dropna()

technology = df_technology.content.values.tolist()[1000:21000]
car = df_car.content.values.tolist()[1000:21000]
entertainment = df_entertainment.content.values.tolist()[:20000]
military = df_military.content.values.tolist()[:20000]
sports = df_sports.content.values.tolist()[:20000]

数据分析与预处理

  • 读取停用词
1
2
stopwords=pd.read_csv("origin_data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values
  • 除去停用词

    并且将处理后的数据放到新的文件夹,避免每次重复操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def preprocess_text(content_lines, sentences, category, target_path):
out_f = open(target_path+"/"+category+".txt", 'w')
for line in content_lines:
try:
segs=jieba.lcut(line)
segs = list(filter(lambda x:len(x)>1, segs)) #没有解析出来的新闻过滤掉
segs = list(filter(lambda x:x not in stopwords, segs)) #把停用词过滤掉
sentences.append((" ".join(segs), category))
out_f.write(" ".join(segs)+"\n")
except Exception as e:
print(line)
continue
out_f.close()

#生成训练数据
sentences = []
preprocess_text(technology, sentences, 'technology', 'processed_data')
preprocess_text(car, sentences, 'car', 'processed_data')
preprocess_text(entertainment, sentences, 'entertainment', 'processed_data')
preprocess_text(military, sentences, 'military', 'processed_data')
preprocess_text(sports, sentences, 'sports', 'processed_data')
  • 生成训练集和验证集

先打乱下,生成更可靠的训练集

1
2
import random
random.shuffle(sentences)

原数据集分词训练集和验证集

1
2
3
from sklearn.model_selection import train_test_split
x, y = zip(*sentences)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1234)

下一步要做的就是在降噪数据上抽取出来有用的特征啦,我们对文本抽取词袋模型特征

1
2
3
4
5
6
7
8
9
10
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
analyzer='word', # tokenise by character ngrams
max_features=4000, # keep the most common 4000 ngrams
)
vec.fit(x_train)

def get_features(x):
vec.transform(x)

把分类器import进来并且训练

1
2
3
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(vec.transform(x_train), y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

查看准确率

1
classifier.score(vec.transform(x_test), y_test)

.8318188045116215

特征工程

我们可以看到在2w多个样本上,我们能在5个类别上达到83%的准确率。

有没有办法把准确率提高一些呢?

我们可以把特征做得更棒一点,比如说,我们试试加入抽取2-gram和3-gram的统计特征,比如可以把词库的量放大一点。

1
2
3
4
5
6
7
8
9
10
11
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(
analyzer='word', # tokenise by character ngrams
ngram_range=(1,4), # use ngrams of size 1, 2, 3, 4
max_features=20000, # keep the most common 2000 ngrams
)
vec.fit(x_train)

def get_features(x):
vec.transform(x)

训练结果提升到0.8732818850175808

建模与优化对比

  • 交叉验证

更可靠的验证效果的方式是交叉验证,但是交叉验证最好保证每一份里面的样本类别也是相对均衡的,我们这里使用StratifiedKFold

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score
import numpy as np

def stratifiedkfold_cv(x, y, clf_class, shuffle=True, n_folds=5, **kwargs):
stratifiedk_fold = StratifiedKFold(n_splits=n_folds, shuffle=shuffle)
y_pred = y[:]
for train_index, test_index in stratifiedk_fold.split(x, y):
X_train, X_test = x[train_index], x[test_index]
y_train = y[train_index]
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred

NB = MultinomialNB
print(precision_score(y, stratifiedkfold_cv(vec.transform(x),np.array(y),NB), average='macro'))

0.8812996456456414

  • 换模型/特征试试
1
2
3
4
from sklearn.svm import SVC
svm = SVC(kernel='linear')
svm.fit(vec.transform(x_train), y_train)
svm.score(vec.transform(x_test), y_test)
  • rbf核
1
2
3
4
from sklearn.svm import SVC
svm = SVC()
svm.fit(vec.transform(x_train), y_train)
svm.score(vec.transform(x_test), y_test)

项目最终结果

自定义类,以备后续使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


class TextClassifier():

def __init__(self, classifier=MultinomialNB()):
self.classifier = classifier
self.vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,4), max_features=20000)

def features(self, X):
return self.vectorizer.transform(X)

def fit(self, X, y):
self.vectorizer.fit(X)
self.classifier.fit(self.features(X), y)

def predict(self, x):
return self.classifier.predict(self.features([x]))

def score(self, X, y):
return self.classifier.score(self.features(X), y)

def save_model(self, path):
dump((self.classifier, self.vectorizer), path)

def load_model(self, path):
self.classifier, self.vectorizer = load(path)