Pandas It's based on NumPy Data analysis toolkit for , It provides an efficient data analysis method , And it can be used for large data sets .
Pandas The main data structure of is Series ( One-dimensional data ) And DataFrame( Two dimensional data ), These two data structures are enough to handle finance 、 Statistics 、 Social Sciences 、 Most typical use cases in fields such as engineering .
# Import Pandas
import pandas as pd
# Pandas data structure - Series
# Pandas Series Similar to a column in a table (column), Similar to one-dimensional arrays , You can save any data type .
# Series By index (index) And columns make up , Function as follows :
# pandas.Series( data, index, dtype, name, copy)
# 1. data: A set of data (ndarray type ).
# 2. index: Data index label , If you don't specify , The default from the 0 Start .
# 3. dtype: data type , By default, I will judge .
# 4. name: Set the name .
# 5. copy: Copy the data , The default is False.
# Create a Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s)
# Create an empty Series
s = pd.Series()
print(s)
# Create an empty Series, And initialize it as a dictionary
s = pd.Series({
'a': 1, 'b': 2, 'c': 3})
print(s)
# Create an empty Series, And initialize it as a list
s = pd.Series([1, 2, 3])
print(s)
# Create an empty Series, And initialize to a string
s = pd.Series('hello')
print(s)
# Create an empty Series, And initialize to a number
s = pd.Series(1)
print(s)
# Create an empty Series, And initialize to a Boolean value
s = pd.Series(True)
print(s)
# Pandas data structure - DataFrame
# DataFrame It's a tabular data structure , It has an ordered set of columns , Each column can be of a different value type ( The number 、 character string 、 Boolean value ).DataFrame There are both row and column indexes , It can be seen by Series A dictionary made up of ( Share an index ).
# Pandas DataFrame Is a two-dimensional array structure , It's like a two-dimensional array .
# DataFrame The construction method is as follows :
# pandas.DataFrame( data, index, columns, dtype, copy)
# data: A set of data (ndarray、series, map, lists, dict Other types ).
# index: Index value , Or it can be called a line label .
# columns: Column labels , The default is RangeIndex (0, 1, 2, …, n) .
# dtype: data type .
# copy: Copy the data , The default is False.
# Create a Series
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s)
# Create a DataFrame
data = [['Google', 10], ['Runoob', 12], ['Wiki', 13]]
df = pd.DataFrame(data, columns=['Site', 'Age'], dtype=float)
print(df)
# Create an empty DataFrame
df = pd.DataFrame()
print(df)
# Create an empty DataFrame, And initialize it as a dictionary
dict = {
'a': 1, 'b': 2, 'c': 3}
data = pd.DataFrame(list(dict.items()))
print(data)
# Create an empty DataFrame, And initialize it as a list
df = pd.DataFrame([1, 2, 3])
print(df)
# Pandas CSV file
# CSV(Comma-Separated Values, Comma separated values , Sometimes referred to as character separated values , Because the separator character can also not be a comma ), Its files store tabular data in plain text ( Numbers and text ).
# CSV It's universal 、 Relatively simple file format , By user 、 Business and science are widely used .Pandas It's easy to handle CSV file ,
# Read csv
# df = pd.read_csv('nba.csv')
# Before output 5 After the row and 5 That's ok , Other omitted
print(df)
# Output all
print(df.to_string())
# Three fields name, site, age
nme = ["Google", "Runoob", "Taobao", "Wiki"]
st = ["www.google.com", "www.runoob.com", "www.taobao.com", "www.wikipedia.org"]
ag = [90, 40, 80, 98]
# Dictionaries
dict = {
'name': nme, 'site': st, 'age': ag}
df = pd.DataFrame(dict)
# preservation dataframe
df.to_csv('site.csv')
# Data processing
# head() Method , Before output 5 Row data
print(df.head())
# tail() Method , After output 5 Row data
print(df.tail())
# info() Method , Output DataFrame The situation of
print(df.info())
# describe() Method , Output DataFrame Descriptive statistics for
print(df.describe())
# shape() Method , Output DataFrame The number of rows and columns
print(df.shape)
# index() Method , Output DataFrame The index of
print(df.index)
# columns() Method , Output DataFrame Column name of
print(df.columns)
# values() Method , Output DataFrame Value
print(df.values)
# dtypes() Method , Output DataFrame Data type of
print(df.dtypes)
# isnull() Method , Output DataFrame Null value information of
print(df.isnull())
# notnull() Method , Output DataFrame Non null value information of
print(df.notnull())
# dropna() Method , Delete DataFrame Null value line of
print(df.dropna())
# dropna(how='all') Method , Delete DataFrame All null value lines of
print(df.dropna(how='all'))
# dropna(thresh=) Method , Delete DataFrame Null value line of , If there is more than thresh Null value of row , Then delete the line
print(df.dropna(thresh=2))
# dropna(subset=) Method , Delete DataFrame Null value line of , Only delete the null value of the specified column
print(df.dropna(subset=['age']))
# fillna() Method , fill DataFrame Null value of
print(df.fillna(value=0))
# fillna(method='ffill') Method , fill DataFrame Null value of , The value in front
print(df.fillna(method='ffill'))
# fillna(method='bfill') Method , fill DataFrame Null value of , Value after
print(df.fillna(method='bfill'))
# fillna(method='pad') Method , fill DataFrame Null value of , The value in front
print(df.fillna(method='pad'))
# fillna(method='backfill') Method , fill DataFrame Null value of , Value after
print(df.fillna(method='backfill'))
# fillna(method='ffill', limit=) Method , fill DataFrame Null value of , The value in front , Fill up to limit That's ok
print(df.fillna(method='ffill', limit=2))
# fillna(method='bfill', limit=) Method , fill DataFrame Null value of , Value after , Fill up to limit That's ok
print(df.fillna(method='bfill', limit=2))
# fillna(method='pad', limit=) Method , fill DataFrame Null value of , The value in front , Fill up to limit That's ok
print(df.fillna(method='pad', limit=2))
# Pandas JSON
# JSON(JavaScript Object Notation,JavaScript Object notation ), Is the syntax for storing and exchanging text information , similar XML.
# JSON Than XML smaller 、 faster , Easier to parse ,Pandas It's easy to handle JSON data .
# Read json
# df = pd.read_json('sites.json')
# print(df.to_string())
data = [
{
"id": "A001",
"name": " Novice tutorial ",
"url": "www.runoob.com",
"likes": 61
},
{
"id": "A002",
"name": "Google",
"url": "www.google.com",
"likes": 124
},
{
"id": "A003",
"name": " TaoBao ",
"url": "www.taobao.com",
"likes": 45
}
]
df = pd.DataFrame(data)
print(df)
# Pandas Data cleaning
# Data cleaning is the process of processing some useless data . Many data sets have missing data 、 Data format error 、 Wrong data or duplicate data , If you want to make the data analysis more accurate , We need to process these useless data .
# Pandas Cleaning null value : If we want to delete a row that contains an empty field , have access to dropna() Method , The syntax is as follows :
# DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
# Pandas Cleaning malformed data
# Cells with wrong data format will make data analysis difficult , It's not even possible . We can use rows that contain empty cells , Or convert all cells in the column to data in the same format .
# df['Date'] = pd.to_datetime(df['Date'])
# Pandas Cleaning error data
# Data errors are also common , We can replace or remove the wrong data .
# df.loc[2, 'age'] = 30
# Pandas Cleaning duplicate data
# If we want to clean up duplicate data , have access to duplicated() and drop_duplicates() Method . If the corresponding data is duplicate ,duplicated() Returns the True, Otherwise return to False.
# print(df.duplicated())
https://www.runoob.com/pandas/pandas-tutorial.html
七夕來襲!是時候展現專屬於程序員的浪漫了!你打算怎麼給心愛的
p.s. High yield Blogger , Pay