official account : Youer cottage author :Peter edit :Pete
Hello everyone , I am a Peter~
This article introduces Categorical type , The main data classification problem , Used to carry integer based category presentation or encoded data , Help users get better performance and memory usage .
<!--MORE-->
In a Series Duplicate values often appear in the data , We need to extract these different values and calculate their frequency :
import numpy as np import pandas as pd
data = pd.Series([" Chinese language and literature "," mathematics "," English "," mathematics "," English "," Geography "," Chinese language and literature "," Chinese language and literature "]) data
0 Chinese language and literature
1 mathematics
2 English
3 mathematics
4 English
5 Geography
6 Chinese language and literature
7 Chinese language and literature
dtype: object
# 1、 Extract different values pd.unique(data)
array([' Chinese language and literature ', ' mathematics ', ' English ', ' Geography '], dtype=object)
# 2、 Count the number of each value pd.value\_counts(data)
Chinese language and literature 3
mathematics 2
English 2
Geography 1
dtype: int64
By way of integer representation , It's called classification or dictionary coding . Different arrays can be called categories of data 、 Dictionary or hierarchy
df = pd.Series([0,1,1,0] \* 2) df
0 0
1 1
2 1
3 0
4 0
5 1
6 1
7 0
dtype: int64
# dim Use dimension tables dim = pd.Series([" Chinese language and literature "," mathematics "]) dim
0 Chinese language and literature
1 mathematics
dtype: object
How to integrate 0- Chinese language and literature ,1- Mathematics in df Make one-to-one correspondence ? Use **take** Method to implement
df1 = dim.take(df) df1
0 Chinese language and literature
1 mathematics
1 mathematics
0 Chinese language and literature
0 Chinese language and literature
1 mathematics
1 mathematics
0 Chinese language and literature
dtype: object
type(df1) # Series data
pandas.core.series.Series
Explain with examples Categorical Use of type
subjects = [" Chinese language and literature "," mathematics "," Chinese language and literature "," Chinese language and literature "] \* 2 N = len(subjects)
df2 = pd.DataFrame({ "subject":subjects, "id": np.arange(N), # Continuous integer "score":np.random.randint(3,15,size=N), # Random integers "height":np.random.uniform(165,180,size=N) # Data of normal distribution }, columns=["id","subject","score","height"]) # Specify the order of column names df2
Can be subject Turn into Categorical type :
subject\_cat = df2["subject"].astype("category") subject\_cat
We found out subject_cat Two characteristics of :
s = subject\_cat.values s
[' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ', ' Chinese language and literature ', ' mathematics ', ' Chinese language and literature ', ' Chinese language and literature ']
Categories (2, object): [' mathematics ', ' Chinese language and literature ']
type(s)
pandas.core.arrays.categorical.Categorical
s.categories # Check the categories
Index([' mathematics ', ' Chinese language and literature '], dtype='object')
s.codes # View classification code
array([1, 0, 1, 1, 1, 0, 1, 1], dtype=int8)
There are mainly two ways :
# The way 1 df2["subject"] = df2["subject"].astype("category") df2.subject
0 Chinese language and literature
1 mathematics
2 Chinese language and literature
3 Chinese language and literature
4 Chinese language and literature
5 mathematics
6 Chinese language and literature
7 Chinese language and literature
Name: subject, dtype: category
Categories (2, object): [' mathematics ', ' Chinese language and literature ']
# The way 2 fruit = pd.Categorical([" Apple "," Banana "," grapes "," Apple "," Apple "," Banana "]) fruit
[' Apple ', ' Banana ', ' grapes ', ' Apple ', ' Apple ', ' Banana ']
Categories (3, object): [' Apple ', ' grapes ', ' Banana ']
# The way 3 categories = ["height","score","subject"] codes = [0,1,0,2,1,0] my\_data = pd.Categorical.from\_codes(codes, categories) my\_data
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height', 'score', 'subject']
Generally, classification transformation does not specify the order of categories , We can pass a parameter ordered To specify a meaningful order :
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
The output above height<socre
, indicate height In the order of score In front of . If a classification instance is not sorted , We use as_ordered Sort :
# my\_data unsorted my\_data.as\_ordered()
['height', 'score', 'height', 'subject', 'score', 'height']
Categories (3, object): ['height' < 'score' < 'subject']
np.random.seed(12345) data1 = np.random.randn(100) data1[:10]
array([-0.20470766, 0.47894334, -0.51943872, -0.5557303 , 1.96578057,
1.39340583, 0.09290788, 0.28174615, 0.76902257, 1.24643474])
# Calculation data1 Of 4 Split bin , And extract statistical values bins\_1 = pd.qcut(data1,4) bins\_1
[(-0.717, 0.106], (0.106, 0.761], (-0.717, 0.106], (-0.717, 0.106], (0.761, 3.249], ..., (0.761, 3.249], (0.106, 0.761], (-2.371, -0.717], (0.106, 0.761], (0.106, 0.761]]
Length: 100
Categories (4, interval[float64]): [(-2.371, -0.717] < (-0.717, 0.106] < (0.106, 0.761] < (0.761, 3.249]]
You can see the value returned by the above result Categories object
# Above 4 Use quartile names in quantiles :Q1\Q2\Q3\Q4 bins\_2 = pd.qcut(data1,4,labels=["Q1","Q2","Q3","Q4"]) bins\_2
['Q2', 'Q3', 'Q2', 'Q2', 'Q4', ..., 'Q4', 'Q3', 'Q1', 'Q3', 'Q3']
Length: 100
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
bins\_2.codes[:10]
array([1, 2, 1, 1, 3, 3, 1, 2, 3, 3], dtype=int8)
Statistics groupby To make summary statistics :
bins\_2 = pd.Series(bins\_2, name="quartile") # named quartile bins\_2
0 Q2
1 Q3
2 Q2
3 Q2
4 Q4
..
95 Q4
96 Q3
97 Q1
98 Q3
99 Q3
Name: quartile, Length: 100, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
The following code example is for data1 The data from bins_2 Grouping , Generate 3 A statistical function
results = pd.Series(data1).groupby(bins\_2).agg(["count","min","max"]).reset\_index() results
results["quartile"] # quartile The original classification information maintained by the column
0 Q1
1 Q2
2 Q3
3 Q4
Name: quartile, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']
N = 10000000 # Millions of data data3 = pd.Series(np.random.randn(N)) labels3 = pd.Series(["foo", "bar", "baz", "quz"] \* (N // 4))
categories3 = labels3.astype("category") # Classification conversion
# Compare the memory of two print("data3: ",data3.memory\_usage()) print("categories3: ",categories3.memory\_usage())
data3: 80000128
categories3: 10000332
The classification method is mainly through special attributes cat To achieve
data
0 Chinese language and literature
1 mathematics
2 English
3 mathematics
4 English
5 Geography
6 Chinese language and literature
7 Chinese language and literature
dtype: object
cat\_data = data.astype("category") cat\_data # Classified data
0 Chinese language and literature
1 mathematics
2 English
3 mathematics
4 English
5 Geography
6 Chinese language and literature
7 Chinese language and literature
dtype: category
Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']
When the category of actual data exceeds that observed in the data 4 A numerical :
actual\_cat = [" Chinese language and literature "," mathematics "," English "," Geography "," biological "] cat\_data2 = cat\_data.cat.set\_categories(actual\_cat) cat\_data2
In the above classification results " biological "
cat\_data.value\_counts()
Chinese language and literature 3
mathematics 2
English 2
Geography 1
dtype: int64
cat\_data2.value\_counts() # In the following results “ biological ”
Chinese language and literature 3
mathematics 2
English 2
Geography 1
biological 0
dtype: int64
cat\_data3 = cat\_data[cat\_data.isin([" Chinese language and literature "," mathematics "])] # Only Chinese and Mathematics cat\_data3
0 Chinese language and literature
1 mathematics
3 mathematics
6 Chinese language and literature
7 Chinese language and literature
dtype: category
Categories (4, object): [' Geography ', ' mathematics ', ' English ', ' Chinese language and literature ']
cat\_data3.cat.remove\_unused\_categories() # Delete unused categories
0 Chinese language and literature
1 mathematics
3 mathematics
6 Chinese language and literature
7 Chinese language and literature
dtype: category
Categories (2, object): [' mathematics ', ' Chinese language and literature ']
Convert classified data into virtual variables , That is to say one-hot code ( Hot code alone ); Produced DataFrame The different categories in are all part of it , See the following example :
data4 = pd.Series(["col1","col2","col3","col4"] \* 2, dtype="category") data4
0 col1
1 col2
2 col3
3 col4
4 col1
5 col2
6 col3
7 col4
dtype: category
Categories (4, object): ['col1', 'col2', 'col3', 'col4']
pd.get\_dummies(data4) # get\_dummies: Convert the one-dimensional classification data into a... Containing virtual variables DataFrame
Python Compressed packet proce
A set of all open source rapid
Python From entry to mastery —