keras 的 Tokenizer 是將單字轉成數位化的好工具，相關函數及運作原理如下。

安裝套件

製作字典需安裝如下套件

pip install scikit-learn nltk gensim xlrd openpyxl pandas tensorflow==2.10.1

Tokenizer的用途

Tokenizer 把傳入的文章轉成單字，然後給予每個單字一個編號，編號由 1 開始。每個句子必需置於 list 之中。tok.fit_on_texts() 就是開始執行編號變成字典的方法，tok.word_index 則可以列印這部字典編碼的結果。

from keras.preprocessing.text import Tokenizer
text = ['I am Thomas', 'Python is a good language']
tok = Tokenizer()
tok.fit_on_texts(text)
print(tok.word_index)

結果 : 
{'i': 1, 'am': 2, 'thomas': 3, 'python': 4, 'is': 5, 'a': 6, 'good': 7, 'language': 8}

tok.texts_to_sequences() 把每單字轉成編碼格式。如果單字不在字典內，則略過，比如下面的 “hello” 及 “kevin”，不在字典內就不顯示。

from keras.preprocessing.text import Tokenizer
text = ['I am Thomas', 'Python is a good language']
tok = Tokenizer()
tok.fit_on_texts(text)
print(tok.texts_to_sequences(text))
print(tok.texts_to_sequences(["hello, I am kevin"]))

結果:
[[1, 2, 3], [4, 5, 6, 7, 8]]
[[1, 2]]

keras.utils 裏的 pad_sequences() 會將每個句子的編碼擴展成 maxlen 的長度，這是為了將所有的句子變成一樣的長度，長度不足就以 0 填入。

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
text = ['I am Thomas', 'Python is a good language']
tok = Tokenizer()
tok.fit_on_texts(text)
print(pad_sequences(tok.texts_to_sequences(text), maxlen=10))
print(pad_sequences(tok.texts_to_sequences(["hello, I am kevin"]), maxlen=10))

結果:
[[0 0 0 0 0 0 0 1 2 3]
 [0 0 0 0 0 4 5 6 7 8]]
[[0 0 0 0 0 0 0 0 1 2]]

儲存字典

當文章很龐大時，讀取的時間就會非常久，製作字典也很耗時。所以把製作好的字典儲存起來，待下次要使用時直接載入字典即可，這樣就不用重讀文章重新製作字典。

底下代碼 dictory.py 將讀入160 萬篇文章，刪除停用詞及不相關的單字後，製作成字典。最後再使用 pickle.dump() 再將字典儲成 “eng-dictionary.pkl” 檔。

import pickle
import re
import time
import nltk
import pandas as pd
from keras_preprocessing.text import Tokenizer
from nltk import SnowballStemmer
from nltk.corpus import stopwords

def preprocess(text):
    text = re.sub(text_cleaning_re, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            tokens.append(token)
    return " ".join(tokens)

columns = ["target", "ids", "date", "flag", "user", "text"]
text_cleaning_re = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"
nltk.download('stopwords')
stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")
print("讀取csv檔案.....", end="")
t1=time.time()
df = pd.read_csv('training.1600000.processed.noemoticon.csv',
                 encoding = "ISO-8859-1",
                 names=columns)
df.text=df.text.apply(lambda x:preprocess(x))
t2=time.time()
print(f'花費時間 : {t2-t1:.4f}秒')

#製作字典
tok=Tokenizer()
print("製作字典中......", end="")
t1=time.time()
tok.fit_on_texts(df.text)
t2=time.time()
#儲存字典
pickle.dump(tok, open("eng_dictionary.pkl","wb"),protocol=0)

vocab_size = len(tok.word_index)
print(f'花費時間 : {t2-t1:.4f}秒')
print(f'共有 {vocab_size} 個單字')

結果:
讀取csv檔案.....花費時間 : 29.6225秒
製作字典中......花費時間 : 10.4470秒
共有 335507 個單字

讀入字典

讀入字典的方式，使用 pickle.load() 方法

with open("eng_dictionary.pkl",'rb') as file:
    tok=pickle.load(file)
vocab_size = len(toz.word_index)

Tokenizer字典

安裝套件

Tokenizer的用途

儲存字典

讀入字典

發佈留言