您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Pandas common operations

編輯：Python

pandas Common operations

First ： Generate an experiment data table

import pandas as pd
df = pd.DataFrame({
'a':[1,2,3],'b':[1,2,3]})
a b
0 1 1
1 2 2
2 3 3

1、 Modify header （columns）

1、 Modify all column names

Such as ：a,b It is amended as follows A、B.

df.columns = ['A','B']
print(df)

result ：

2、 Only the specified column name is modified

Such as ：a It is amended as follows A

df.rename(columns={
'a':'A'},inplace=True)
print(df)

result

other

df： Any of the Pandas DataFrame object
s： Any of the Pandas Series object
raw： Row labels
col： Column labels

Import dependency package ：

import pandas as pd
import numpy as np

1、 Import data

pd.read_csv(filename_path)： from CSV File import data
pd.read_table(filename_path)： Import data from a delimited text file
pd.read_excel(filename_path)： from Excel File import data
pd.read_sql(query, connection_object)： from SQL surface / Library import data
pd.read_json(json_string)： from JSON Import data in string format
pd.read_html(url)： analysis URL、 String or HTML file , Take out the tables form
pd.read_clipboard()： Get content from your pasteboard , And to the read_table()
pd.DataFrame(dict)： Import data from dictionary objects ,Key Is the column name ,Value Is the data

2、 Derived data

df.to_csv(filename_path)： Export data to CSV file
df.to_excel(filename_path)： Export data to Excel file
df.to_sql(table_name, connection_object)： Export data to SQL surface
df.to_json(filename_path)： With Json Format export data to text file

3、 Create test data

pd.DataFrame(np.random.rand(20,5))： establish 20 That's ok 5 Columns of random numbers DataFrame object
pd.Series(my_list)： From iteratable objects my_list Create a Series object
df.index = pd.date_range('1900/1/30', periods=df.shape[0])： Add a date index

4、 see 、 Check the data

df.head(n)： see DataFrame The front of the object n That's ok （ Without parameters , Before default 10 That's ok ）
df.tail(n)： see DataFrame The end of the object n That's ok （ Without parameters , After default 10 That's ok ）
df.shape()： Look at the number of rows and columns （ Dimension view ）
df.info()： Look at the index 、 Data types and memory information
df.describe()： View summary statistics for numeric Columns
s.value_counts(dropna=False)： see Series Object's unique value and count
df.apply(pd.Series.value_counts)： see DataFrame The unique value and count of each column in the object
df.dtypes： View the data type of each column （ Expand ：df['two'].dtypes, see “two” The type of the column ）
df.isnull()： View vacant ( notes ： The vacant part will use true Show , Not vacant False Show )（ Expand ：df['two'].isnull, see “two” The vacancy in this column ）
df.values： Look at the values in the data table
df.columns： View column name

5、 Data selection

( For specific use, see https://www.cnblogs.com/luckyplj/p/13274662.html)

df.isin([5])： Judge whether there is... In all data values 5
df[col].isin([5])： Judgment column col If there 5
df[col]： By column name , And Series Returns a column in the form of
df[[col1, col2]]： With DataFrame Form returns multiple columns
s.iloc[0]： Select row data by position
s.loc['index_one']： Select row data by index
df.loc[:,'reviews'] Get the data for the specified column Be careful ： The first parameter is zero ： Represents all lines , The first 2 Parameters are column names , Set the get column name to review The data of
df.loc[[0,2],['customername','reviews','review_fenci']] Select the specified multiple rows and columns Parameter description ： [0,2] This list has two elements 0,2 Choose the second 0 Xing He 2 That's ok ['customername','reviews','review_fenci'] This list has 3 Elements indicate that the selected column name is 'customername','reviews','review_fenci‘ the 3 Column
df.iloc[0,:]： Go back to the first line
df.iloc[0,0]： Returns the first element of the first row
df.ix[0] or df.ix[raw] ：ix Function can select row data according to row position or row label

notes ：loc Function according to line / Column labels ( User defined row name 、 Name ) Make row selection ;

 iloc Function according to line / Column position ( The default row and column index ) Make row selection ;

6、 Data cleaning

df.columns = ['a','b','c']： Rename column name
pd.isnull()： Check DataFrame Null value in object , And return a Boolean Array
pd.notnull()： Check DataFrame Non null value in object , And return a Boolean Array
df.dropna()： Delete all rows with null values
df.dropna(axis=1)： Delete all columns with null values
df.dropna(axis=1,thresh=n)： Delete all less than n A non null row
df.fillna(x)： use x Replace DataFrame All null values in the object （ notes ：fillna() Will fill in nan data , Returns the filled result . If you want to be in the original DataFrame Revision in China , Then put inplace Set to True. Such as ,df.fillna(0,inplace=True)）
mydf[' Name ']=mydf[' Name '].fillna(0) The null value of a column is filled with zero
s.astype(float)： take Series Change the data type in to float type
df[col].astype(float)： take DataFrame The data type of a column is changed to float type
s.replace(1,'first')： use ‘first’ Instead of all equal to 1 Value （ Instead, the value , Not a column name nor an index name ）
s.replace([1,3],['one','three'])： use 'one' Instead of 1, use 'three' Instead of 3
df[col].replace(1,1.0,inplace=True)： Column col The value in 1 use 1.0 Replace
df.replace([1,3],['one','three'])
df.rename(columns=lambda x: x + 1)： Batch change column names
df.rename(columns={'old_name': 'new_ name'})： Selectively change column names
df.set_index('column_one')： take column_one This column becomes an index column
df.rename(index=lambda x: x + 1)： Batch rename index
df[col]=df[col].str.upper() or df[col].str.lower()： Column based case conversion
df[col]=df[col].map(str.strip)： Clear a column of spaces
df.drop_duplicates(subset=col,keep='fisrt',inplace=Flase)： Delete duplicate values

notes ： This drop_duplicate The way is right DataFrame Formatted data , Remove duplicate rows below specific columns . return DataFrame Formatted data .

subset : column label or sequence of labels, optional Used to specify a specific column , Default all columns
keep : {‘first’, ‘last’, False}, default ‘first’ Remove duplicate items and keep the first occurrence
inplace : boolean, default False Whether to modify the original data directly or keep a copy

7、 Data processing

df[df[col] > 0.5]： choice col The value of the column is greater than 0.5 The line of
df.sort_values(col1)： According to the column col1 Sorting data , Default ascending order
df.sort_values(col2, ascending=False)： According to the column col1 Sort the data in descending order
df.sort_values([col1,col2], ascending=[True,False])： First by column col1 Ascending order , Post press col2 Sort the data in descending order
df.groupby(col)： Returns a by column col In groups Groupby object
df.groupby([col1,col2])： Returns a list grouped by multiple columns Groupby object
df.groupby(col1)[col2]： Return by column col1 After grouping , Column col2 The average of
df.pivot_table(index=col1, values=[col2,col3], aggfunc=max)： Create a by column col1 Grouping , And calculate col2 and col3 The maximum value of the PivotTable
df.groupby(col1).agg(np.mean)： Return by column col1 The mean of all the columns in the group
data.apply(np.mean)： Yes DataFrame Each column in the application function np.mean
data.apply(np.max,axis=1)： Yes DataFrame Apply the function... To each line in the np.max
df.isin

8、 Data merging

df1.append(df2)： take df2 Add rows from to df1 Tail of
df.concat([df1, df2],axis=1)： take df2 Add columns in to df1 Tail of
df1.join(df2,on=col1,how='inner')： Yes df1 And df2 The column execution of SQL Formal join

9、 Data statistics

df.describe()： View the summary statistics of the data value column
df.mean()： Returns the mean of all columns
df.corr()： Returns the correlation coefficient between columns
df.count()： Return non null values in each column (NaN) The number of
df.max()： Returns the maximum value of each column
df.min()： Returns the minimum value of each column
df.median()： Returns the median of each column
df.std()： Returns the standard deviation of each column
df.sum()： Returns the sum of all rows

Reference resources