Hello, Hello everyone , I'm Chen Chen ~
Today I'll share about 8 Common use pandas Of index Set up
groupby
Grouping methods are often used . For example, add a grouping column team To group .
>>> df0["team"] = ["X", "X", "Y", "Y", "Y"] >>> df0 A B C team 0 0.548012 0.288583 0.734276 X 1 0.342895 0.207917 0.995485 X 2 0.378794 0.160913 0.971951 Y 3 0.039738 0.008414 0.226510 Y 4 0.581093 0.750331 0.133022 Y >>> df0.groupby("team").mean() A B C team X 0.445453 0.248250 0.864881 Y 0.333208 0.306553 0.443828
By default , Grouping will program the grouping column index Indexes . But a lot of times , We don't want grouped columns to become indexes , Because some calculation or judgment logic still needs to use this column . therefore , We need to set it so that the grouped column does not become an index , At the same time, it can also complete the function of grouping .
There are two ways to do what you need , The first is to use reset_index
, The second is in groupby Method as_index=False
. Personally, I prefer the second method , It only involves two steps , More concise .
>>> df0.groupby("team").mean().reset_index() team A B C 0 X 0.445453 0.248250 0.864881 1 Y 0.333208 0.306553 0.443828 >>> df0.groupby("team", as_index=False).mean() team A B C 0 X 0.445453 0.248250 0.864881 1 Y 0.333208 0.306553 0.443828
Of course , If the data has been read or after some data processing steps , We can go through set_index
Set index manually .
>>> df = pd.read_csv("data.csv", parse_dates=["date"]) >>> df.set_index("date") temperature humidity date 2021-07-01 95 50 2021-07-02 94 55 2021-07-03 94 56
Here are two points to note .
set_index
Method will create a new... By default DataFrame. If you want to change in place df
The index of , Need to set up inplace=True
.df.set_index(“date”, inplace=True)
drop=False
.df.set_index(“date”, drop=False)
Processing DataFrame when , Some operations ( For example, delete a row 、 Index selection, etc ) A subset of the original index will be generated , In this way, the default numeric index sorting is messy . To rebuild a continuous index , have access to reset_index
Method .
>>> df0 = pd.DataFrame(np.random.rand(5, 3), columns=list("ABC")) >>> df0 A B C 0 0.548012 0.288583 0.734276 1 0.342895 0.207917 0.995485 2 0.378794 0.160913 0.971951 3 0.039738 0.008414 0.226510 4 0.581093 0.750331 0.133022 >>> df1 = df0[df0.index % 2 == 0] >>> df1 A B C 0 0.548012 0.288583 0.734276 2 0.378794 0.160913 0.971951 4 0.581093 0.750331 0.133022 >>> df1.reset_index(drop=True) A B C 0 0.548012 0.288583 0.734276 1 0.378794 0.160913 0.971951 2 0.581093 0.750331 0.133022
Usually , We don't need to keep the old index , So it can be drop
Parameter set to True
. Again , If you want to reset the index in place , Can be set up inplace
Parameter is True
, Otherwise, a new DataFrame.
When used sort_value
This problem is also encountered when sorting methods , Because by default , Indexes index Change with sort order , So it's a mess of snow . If we want the index not to change with the sort , It also needs to be in sort_values
Method ignore_index
that will do .
>>> df0.sort_values("A") A B C team 3 0.039738 0.008414 0.226510 Y 1 0.342895 0.207917 0.995485 X 2 0.378794 0.160913 0.971951 Y 0 0.548012 0.288583 0.734276 X 4 0.581093 0.750331 0.133022 Y >>> df0.sort_values("A", ignore_index=True) A B C team 0 0.039738 0.008414 0.226510 Y 1 0.342895 0.207917 0.995485 X 2 0.378794 0.160913 0.971951 Y 3 0.548012 0.288583 0.734276 X 4 0.581093 0.750331 0.133022 Y
Deleting duplicates is the same as sorting , After default execution, the sorting order will also be disrupted . Empathy , Can be in drop_duplicates
Set in method ignore_index
Parameters True
that will do .
>>> df0 A B C team 0 0.548012 0.288583 0.734276 X 1 0.342895 0.207917 0.995485 X 2 0.378794 0.160913 0.971951 Y 3 0.039738 0.008414 0.226510 Y 4 0.581093 0.750331 0.133022 Y >>> df0.drop_duplicates("team", ignore_index=True) A B C team 0 0.548012 0.288583 0.734276 X 1 0.378794 0.160913 0.971951 Y
When we have a DataFrame when , Want to use different data sources or separate operations to allocate indexes . under these circumstances , You can assign indexes directly to existing df.index
.
>>> better_index = ["X1", "X2", "Y1", "Y2", "Y3"] >>> df0.index = better_index >>> df0 A B C team X1 0.548012 0.288583 0.734276 X X2 0.342895 0.207917 0.995485 X Y1 0.378794 0.160913 0.971951 Y Y2 0.039738 0.008414 0.226510 Y Y3 0.581093 0.750331 0.133022 Y
Export data to CSV When you file , Default DataFrame Have from 0 Index started . If we don't want to export CSV The file contains it , Can be in to_csv
Set in method index
Parameters .
>>> df0.to_csv("exported_file.csv", index=False)
As shown below , Derived CSV In file , Index column is not included in the file .
Actually , There are many methods to set the index , But we are generally concerned about data , And often ignore the index , An error may be reported when the operation continues . The above high-frequency operations have index settings , It is recommended that you form the habit of setting the index when you use it at ordinary times , This will save a lot of time .
In many cases , Our data source is CSV file . Suppose there is a file named data.csv
, Contains the following data .
date,temperature,humidity 07/01/21,95,50 07/02/21,94,55 07/03/21,94,56
By default ,pandas
Will create one from 0 Start index line , as follows :
>>> pd.read_csv("data.csv", parse_dates=["date"]) date temperature humidity 0 2021-07-01 95 50 1 2021-07-02 94 55 2 2021-07-03 94 56
however , We can import by index_col
If the parameter is set to a column, you can directly specify the index column .
>>> pd.read_csv("data.csv", parse_dates=["date"], index_col="date") temperature humidity date 2021-07-01 95 50 2021-07-02 94 55 2021-07-03 94 56