您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Data classification in pandas

編輯：Python

official account ： Youer cottage author ：Peter edit ：Pete

Hello everyone , I am a Peter~

This article introduces Categorical type , The main data classification problem , Used to carry integer based category presentation or encoded data , Help users get better performance and memory usage .

background ： Statistical duplicate value

In a Series Duplicate values often appear in the data , We need to extract these different values and calculate their frequency ：

import numpy as np
import pandas as pd

data = pd.Series([" Chinese language and literature "," mathematics "," English "," mathematics "," English "," Geography "," Chinese language and literature "," Chinese language and literature "])
data

0 Chinese language and literature

1 mathematics

2 English

3 mathematics

4 English

5 Geography

6 Chinese language and literature

7 Chinese language and literature

dtype: object

# 1、 Extract different values
pd.unique(data)

array([' Chinese language and literature ', ' mathematics ', ' English ', ' Geography '], dtype=object)

# 2、 Count the number of each value
pd.value\_counts(data)

 Chinese language and literature 3

 mathematics 2

 English 2

 Geography 1

dtype: int64

classification 、 Dictionary code

By way of integer representation , It's called classification or dictionary coding . Different arrays can be called categories of data 、 Dictionary or hierarchy

df = pd.Series([0,1,1,0] \* 2)
df

0 0

1 1

2 1

3 0

4 0

5 1

6 1

7 0

dtype: int64

# dim Use dimension tables
dim = pd.Series([" Chinese language and literature "," mathematics "])
dim

0 Chinese language and literature

1 mathematics

dtype: object

How to integrate 0- Chinese language and literature ,1- Mathematics in df Make one-to-one correspondence ？ Use **take** Method to implement

df1 = dim.take(df)
df1

0 Chinese language and literature

1 mathematics

1 mathematics

0 Chinese language and literature

0 Chinese language and literature

1 mathematics

1 mathematics

0 Chinese language and literature

dtype: object

type(df1) # Series data

pandas.core.series.Series

Categorical Type creation

Generate a Categorical Instance object

Explain with examples Categorical Use of type

subjects = [" Chinese language and literature "," mathematics "," Chinese language and literature "," Chinese language and literature "] \* 2
N = len(subjects)

df2 = pd.DataFrame({
"subject":subjects,
"id": np.arange(N), # Continuous integer
"score":np.random.randint(3,15,size=N), # Random integers
"height":np.random.uniform(165,180,size=N) # Data of normal distribution
},
columns=["id","subject","score","height"]) # Specify the order of column names
df2

Can be subject Turn into Categorical type ：

subject\_cat = df2["subject"].astype("category")
subject\_cat

We found out subject_cat Two characteristics of ：

It is not numpy Array , It is a category data type
There are two values in it ： Chinese and Mathematics

s = subject\_cat.values
s

[' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ', ' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ']

Categories (2, object): [' mathematics ', ' Chinese language and literature ']

type(s)

pandas.core.arrays.categorical.Categorical

s.categories # Check the categories

Index([' mathematics ', ' Chinese language and literature '], dtype='object')

s.codes # View classification code

array([1, 0, 1, 1, 1, 0, 1, 1], dtype=int8)

How to generate Categorical object

There are mainly two ways ：

Appoint DataFrame One of the columns is Categorical object
adopt pandas.Categorical To generate
By constructor from_codes, The premise is that you must first obtain the classification and coding data

# The way 1
df2["subject"] = df2["subject"].astype("category")
df2.subject

0 Chinese language and literature

1 mathematics

2 Chinese language and literature

3 Chinese language and literature

4 Chinese language and literature

5 mathematics

6 Chinese language and literature

7 Chinese language and literature

Name: subject, dtype: category

Categories (2, object): [' mathematics ', ' Chinese language and literature ']

# The way 2
fruit = pd.Categorical([" Apple "," Banana "," grapes "," Apple "," Apple "," Banana "])
fruit

[' Apple ', ' Banana ', ' grapes ', ' Apple ', ' Apple ', ' Banana ']

Categories (3, object): [' Apple ', ' grapes ', ' Banana ']

# The way 3
categories = ["height","score","subject"]
codes = [0,1,0,2,1,0]
my\_data = pd.Categorical.from\_codes(codes, categories)
my\_data

['height', 'score', 'height', 'subject', 'score', 'height']

Categories (3, object): ['height', 'score', 'subject']

Generally, classification transformation does not specify the order of categories , We can pass a parameter ordered To specify a meaningful order ：

['height', 'score', 'height', 'subject', 'score', 'height']

Categories (3, object): ['height' < 'score' < 'subject']

The output above height<socre, indicate height In the order of score In front of . If a classification instance is not sorted , We use as_ordered Sort ：

# my\_data unsorted
my\_data.as\_ordered()

['height', 'score', 'height', 'subject', 'score', 'height']

Categories (3, object): ['height' < 'score' < 'subject']

Categorical Object computing

Statistical calculation

np.random.seed(12345)
data1 = np.random.randn(100)
data1[:10]

array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057,

 1.39340583, 0.09290788, 0.28174615, 0.76902257, 1.24643474])

# Calculation data1 Of 4 Split bin , And extract statistical values
bins\_1 = pd.qcut(data1,4)
bins\_1

[(-0.717, 0.106], (0.106, 0.761], (-0.717, 0.106], (-0.717, 0.106], (0.761, 3.249], ..., (0.761, 3.249], (0.106, 0.761], (-2.371, -0.717], (0.106, 0.761], (0.106, 0.761]]

Length: 100

Categories (4, interval[float64]): [(-2.371, -0.717] < (-0.717, 0.106] < (0.106, 0.761] < (0.761, 3.249]]

You can see the value returned by the above result Categories object

Yes 4 Species value
See that the maximum and minimum values of the whole data are at the head and tail respectively

# Above 4 Use quartile names in quantiles ：Q1\Q2\Q3\Q4
bins\_2 = pd.qcut(data1,4,labels=["Q1","Q2","Q3","Q4"])
bins\_2

['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q4', 'Q3', 'Q1', 'Q3', 'Q3']

Length: 100

Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

bins\_2.codes[:10]

array([1, 2, 1, 1, 3, 3, 1, 2, 3, 3], dtype=int8)

Statistics groupby To make summary statistics ：

bins\_2 = pd.Series(bins\_2, name="quartile") # named quartile
bins\_2

0 Q2

1 Q3

2 Q2

3 Q2

4 Q4

..

95 Q4

96 Q3

97 Q1

98 Q3

99 Q3

Name: quartile, Length: 100, dtype: category

Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

The following code example is for data1 The data from bins_2 Grouping , Generate 3 A statistical function

results = pd.Series(data1).groupby(bins\_2).agg(["count","min","max"]).reset\_index()
results

results["quartile"] # quartile The original classification information maintained by the column

0 Q1

1 Q2

2 Q3

3 Q4

Name: quartile, dtype: category

Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

Memory reduction after classification

N = 10000000 # Millions of data
data3 = pd.Series(np.random.randn(N))
labels3 = pd.Series(["foo", "bar", "baz", "quz"] \* (N // 4))

categories3 = labels3.astype("category") # Classification conversion

# Compare the memory of two
print("data3: ",data3.memory\_usage())
print("categories3: ",categories3.memory\_usage())

data3: 80000128

categories3: 10000332

classification method

Access classification information

The classification method is mainly through special attributes cat To achieve

data

0 Chinese language and literature

1 mathematics

2 English

3 mathematics

4 English

5 Geography

6 Chinese language and literature

7 Chinese language and literature

dtype: object

cat\_data = data.astype("category")
cat\_data # Classified data

0 Chinese language and literature

1 mathematics

2 English

3 mathematics

4 English

5 Geography

6 Chinese language and literature

7 Chinese language and literature

dtype: category

Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']

New category

When the category of actual data exceeds that observed in the data 4 A numerical ：

actual\_cat = [" Chinese language and literature "," mathematics "," English "," Geography "," biological "]
cat\_data2 = cat\_data.cat.set\_categories(actual\_cat)
cat\_data2

In the above classification results " biological "

cat\_data.value\_counts()

 Chinese language and literature 3

 mathematics 2

 English 2

 Geography 1

dtype: int64

cat\_data2.value\_counts() # In the following results “ biological ”

 Chinese language and literature 3

 mathematics 2

 English 2

 Geography 1

 biological 0

dtype: int64

Delete category

cat\_data3 = cat\_data[cat\_data.isin([" Chinese language and literature "," mathematics "])] # Only Chinese and Mathematics
cat\_data3

0 Chinese language and literature

1 mathematics

3 mathematics

6 Chinese language and literature

7 Chinese language and literature

dtype: category

Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']

cat\_data3.cat.remove\_unused\_categories() # Delete unused categories

0 Chinese language and literature

1 mathematics

3 mathematics

6 Chinese language and literature

7 Chinese language and literature

dtype: category

Categories (2, object): [' mathematics ', ' Chinese language and literature ']

Create virtual variables

Convert classified data into virtual variables , That is to say one-hot code （ Hot code alone ）; Produced DataFrame The different categories in are all part of it , See the following example ：

data4 = pd.Series(["col1","col2","col3","col4"] \* 2, dtype="category")
data4

0 col1

1 col2

2 col3

3 col4

4 col1

5 col2

6 col3

7 col4

dtype: category

Categories (4, object): ['col1', 'col2', 'col3', 'col4']

pd.get\_dummies(data4) # get\_dummies： Convert the one-dimensional classification data into a... Containing virtual variables DataFrame

classification method

add_categories： Add a new category to the tail
as_ordered： Category sorting
as_unordered： Disorder categories
remove_categories： Remove category , Set the removed value to null
remove_unused_categories： Remove all categories that do not appear
rename_categories： Replace category name , Do not change the number of categories
reorder_categories： Class
set_categories： Replace the original class with the specified set of new classes , You can add or delete