您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[Xinghai essays] 0 fundamentals, python algorithm recommendation function

編輯：Python

pip install nltk
pip install cufflinks

nltk It's a python tool kit , Used to deal with things related to natural language . Including participles (tokenize), Part of speech tagging (POS),
Text classification, etc , It is an easy-to-use ready-made tool . But at present, the word segmentation module of the toolkit , Only English participles are supported , Chinese word segmentation is not supported .

cuffdiff It is mainly found that transcripts express , Cut and splice , Significant changes in promoter use .

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
import random
import cufflinks
from plotly.offline import iplot
cufflinks.go_offline()

Load data , And view the format of the data

df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")
df.head()

View the second... Of the data 100 That's ok , All columns
take desc Column 100 That's ok , Print it out in detail

df.iloc[100:101,:]
df['desc'][100]

Separate all the characters , And save
CountVectorizer It's a word counter

vec = CountVectorizer().fit(df['desc'])
bag_of_words = vec.transform(df['desc'])

bag_of_words.shape
# Show that 152 Sentence , Every sentence has 3200 word

For details, please go to sklear_API View in

There are two ways to display

words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)

Summarize the above methods . Form a function , To classify phrases .

def get_top_n_words(corpus,n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word,sum_words[0,idx]) for word,idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq,key=lambda x:x[1],reverse=True)
return words_freq[:n]

Get the first... After sorting 20

common_words=get_top_n_words(df['desc'] ,20 )

Put it in the DateFrame in , Easy to handle

df1 = pd.DataFrame(common_words,columns=['desc','count'])

DateFrame Data can be used directly iplot() To draw a picture . But I don't know why my version failed .
So I can only turn the data into a list and then draw .

a = df1.iloc[:]
b = df1.iloc[:,0:1]
for x in df1['count']:
b.append(x)
for x in df1['desc']:
a.append(x)

plt.barh(a,b)

therefore , Based on the previous data . String each piece of data split Then find the duplicate numbers , And count . And then arrange , And draw pictures to show .
Also can put the function .iplot Replace with .plot , The following code

df3.groupby('desc').sum()['count'].sort_values().plot(kind='barh',title='top 20 before remove stopwords-ngram_range=(2,2)')

common_words=get_top_n_words(df['desc'],20)
df3 = pd.DataFrame(common_words,columns=['desc','count'])
df3.groupby('desc').sum()['count'].sort_values()

Data add a column of data ,

df['word_count']=df['desc'].apply(lambda x:len(str(x).split()))
#apply It is a sequence flow that passes parameters into ,
#lambda Sequential flow return x

df['word_count'].plot(kind='hist',bins=50)
# Plot the resulting Columns

Pay attention to the code Need to download tools

import nltk
nltk.download()

sub_replace = re.compile('[^0-9a-z #+_]')
stopwords = set(stopwords.words('english'))
def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split() if word not in stopwords)
return text
df['desc_clean'] = df['desc'].apply(clean_txt)

# You can omit it word

def clean_txt(text):
text.lower()
text = sub_replace.sub('',text)
' '.join(word for word in text.split())
return text
df['desc_clean'] = df['desc'].apply(clean_txt)

catalog index

df.set_index('name',inplace = True)

Match phrases according to the formula

tf(t,d) yes tf value , Represents a text d in , Term t The frequency of , It can be seen from the formula that tf The value is determined by both the term and the text .
nd Indicates the number of training set texts .
df(d,t) Indicates that a word item is included t The total number of documents , therefore idf Value and the total number of training set texts and contained word items t Is related to the number of texts .
idf Value improves the text vector represented by frequency , It not only considers the frequency of words in the text , At the same time, the frequency of words appearing in general texts is considered , Words always appear in general texts , Indicates that it can provide less classification information , For example, function words “ Of ”、“ The earth ”、“ have to ” etc. .

tf=TfidfVectorizer(analyzer='word',ngram_range=(1,3),stop_words='english')

take desc_clean Pass in , Calculate the correlation .
Match a number for each exact string .

tfidf_matrix=tf.fit_transform(df['desc_clean'])

Linear calculation

cosine_similarity =linear_kernel(tfidf_matrix,tfidf_matrix)

indices = pd.Series(df.index)

Write a function

def recommendations(name,cosine_similarity):
recommended_hotels = []
idx = indices[indices == name].index[0]
score_series = pd.Series(cosine_similarity[idx]).sort_values(ascending=False)
top_10_indexes = list(score_series[1:11].index)
for i in top_10_indexes:
recommended_hotels.append(list(df.index)[i])
return recommended_hotels

Input Exact name And make recommendations Name recommendation , What is passed in later is generated by the previous algorithm The number corresponding to each exact name

recommendations('Hilton Garden Seattle Downtown',cosine_similarity)

上一篇文章： [Xinghai essays] Python implementation of FP Tree data structure
下一篇文章： [Xinghai essays] association rules (II) Python implementation

Python

[Pandas] 數據選擇

Pandas常用的數據選擇操作操作語法選擇列df[列名]&

Python/selenium browser web page multi option processing

import time from selenium impo

Django配合python進行requests請求的問題及解決方法

目錄Django配合python進行requests請求前言

Configure python3.6+sublime configuration under Windows

python3.6 link ：https://pan.ba

python爬蟲實戰--第一章：爬取豆瓣電影Top250

本實戰項目通過python爬取豆瓣電影Top250榜單，利用

How Python puts millions of data into PostgreSQL Library

1. PostgreSQL What is it? Post

Python crawler from 0 to 1 -urllib_ Customization of request object (anti crawl strategy)

Python crawler from 0 to 1 - about the encoding and decoding of get/post requests

How to change Python zero foundation from 0 to 1? An article explains Python and data analysis, which must be collected

Python error (2): takes 0 positive arguments but 1 was given

[Xinghai essays] association rules (II) Python implementation

[Xinghai essays] Python implementation of FP Tree data structure

Python numeric string to obtain a fixed length value, and fill 0 on the left if not enough

The python algorithm generates all possible permutations of 1 and 0 of a given length

Bank transfer sharing: from 0 deposit to 2w+ monthly salary, just because I chose Python

Python and C language interactive nanny level tutorial 0 basic audio playback

熱門圖文

為初學者答效率的問題 leetcode筆記：Move Zeroes POJ 2449 Remmarguts' Date (第k短路 A*搜索算法模板) 自定義控件：屬性為控件需要注意的地方教你用C#開發智能手機軟件：推箱子(一) 簡體中文和繁體中文的轉換 asp.net夜話之四：Visual Studio 2005中容易被忽略的技巧 VC初學者經典錯誤LNK2001詳解

欄目導航