pip install nltk
pip install cufflinks
nltk It's a python tool kit , Used to deal with things related to natural language . Including participles (tokenize), Part of speech tagging (POS),
Text classification, etc , It is an easy-to-use ready-made tool . But at present, the word segmentation module of the toolkit , Only English participles are supported , Chinese word segmentation is not supported .
cuffdiff It is mainly found that transcripts express , Cut and splice , Significant changes in promoter use .
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
import cufflinks
from plotly.offline import iplot
cufflinks.go_offline()
Load data , And view the format of the data
df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()
View the second... Of the data 100 That's ok , All columns
take desc Column 100 That's ok , Print it out in detail
df.iloc[100:101,:]
df['desc'][100]
Separate all the characters , And save
CountVectorizer It's a word counter
vec = CountVectorizer().fit(df['desc'])
bag_of_words = vec.transform(df['desc'])
bag_of_words.shape
# Show that 152 Sentence , Every sentence has 3200 word
For details, please go to sklear_API View in
There are two ways to display
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
Summarize the above methods . Form a function , To classify phrases .
def get_top_n_words(corpus,n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
return words_freq[:n]
Get the first... After sorting 20
common_words=get_top_n_words(df['desc'] ,20 )
Put it in the DateFrame in , Easy to handle
df1 = pd.DataFrame(common_words,columns=['desc','count'])
DateFrame Data can be used directly iplot() To draw a picture . But I don't know why my version failed .
So I can only turn the data into a list and then draw .
a = df1.iloc[:]
b = df1.iloc[:,0:1]
for x in df1['count']:
b.append(x)
for x in df1['desc']:
a.append(x)
plt.barh(a,b)
therefore , Based on the previous data . String each piece of data split Then find the duplicate numbers , And count . And then arrange , And draw pictures to show .
Also can put the function .iplot Replace with .plot , The following code
df3.groupby('desc').sum()['count'].sort_values().plot(kind='barh',title='top 20 before remove stopwords-ngram_range=(2,2)')
common_words=get_top_n_words(df['desc'],20)
df3 = pd.DataFrame(common_words,columns=['desc','count'])
df3.groupby('desc').sum()['count'].sort_values()
Data add a column of data ,
df['word_count']=df['desc'].apply(lambda x:len(str(x).split()))
#apply It is a sequence flow that passes parameters into ,
#lambda Sequential flow return x
df['word_count'].plot(kind='hist',bins=50)
# Plot the resulting Columns
Pay attention to the code Need to download tools
import nltk
nltk.download()
sub_replace = re.compile('[^0-9a-z #+_]')
stopwords = set(stopwords.words('english'))
def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split() if word not in stopwords)
return text
df['desc_clean'] = df['desc'].apply(clean_txt)
# You can omit it word
def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split())
return text
df['desc_clean'] = df['desc'].apply(clean_txt)
catalog index
df.set_index('name',inplace = True)
Match phrases according to the formula
tf(t,d) yes tf value , Represents a text d in , Term t The frequency of , It can be seen from the formula that tf The value is determined by both the term and the text .
nd Indicates the number of training set texts .
df(d,t) Indicates that a word item is included t The total number of documents , therefore idf Value and the total number of training set texts and contained word items t Is related to the number of texts .
idf Value improves the text vector represented by frequency , It not only considers the frequency of words in the text , At the same time, the frequency of words appearing in general texts is considered , Words always appear in general texts , Indicates that it can provide less classification information , For example, function words “ Of ”、“ The earth ”、“ have to ” etc. .
tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')
take desc_clean Pass in , Calculate the correlation .
Match a number for each exact string .
tfidf_matrix=tf.fit_transform(df['desc_clean'])
Linear calculation
cosine_similarity =linear_kernel(tfidf_matrix,tfidf_matrix)
indices = pd.Series(df.index)
Write a function
def recommendations(name,cosine_similarity):
recommended_hotels = []
idx = indices[indices == name].index[0]
score_series = pd.Series(cosine_similarity[idx]).sort_values(ascending=False)
top_10_indexes = list(score_series[1:11].index)
for i in top_10_indexes:
recommended_hotels.append(list(df.index)[i])
return recommended_hotels
Input Exact name And make recommendations Name recommendation , What is passed in later is generated by the previous algorithm The number corresponding to each exact name
recommendations('Hilton Garden Seattle Downtown',cosine_similarity)