Hello, Hello everyone , I'm Chen Chen ~
Today I'll share about 8 Common use pandas Of index Set up
groupby Grouping methods are often used . For example, add a grouping column team To group .
>>> df0["team"] = ["X", "X", "Y", "Y", "Y"]
>>> df0
A B C team
0 0.548012 0.288583 0.734276 X
1 0.342895 0.207917 0.995485 X
2 0.378794 0.160913 0.971951 Y
3 0.039738 0.008414 0.226510 Y
4 0.581093 0.750331 0.133022 Y
>>> df0.groupby("team").mean()
A B C
team
X 0.445453 0.248250 0.864881
Y 0.333208 0.306553 0.443828By default , Grouping will program the grouping column index Indexes . But a lot of times , We don't want grouped columns to become indexes , Because some calculation or judgment logic still needs to use this column . therefore , We need to set it so that the grouped column does not become an index , At the same time, it can also complete the function of grouping .
There are two ways to do what you need , The first is to use reset_index, The second is in groupby Method as_index=False. Personally, I prefer the second method , It only involves two steps , More concise .
>>> df0.groupby("team").mean().reset_index()
team A B C
0 X 0.445453 0.248250 0.864881
1 Y 0.333208 0.306553 0.443828
>>> df0.groupby("team", as_index=False).mean()
team A B C
0 X 0.445453 0.248250 0.864881
1 Y 0.333208 0.306553 0.443828 Of course , If the data has been read or after some data processing steps , We can go through set_index Set index manually .
>>> df = pd.read_csv("data.csv", parse_dates=["date"])
>>> df.set_index("date")
temperature humidity
date
2021-07-01 95 50
2021-07-02 94 55
2021-07-03 94 56Here are two points to note .
set_index Method will create a new... By default DataFrame. If you want to change in place df The index of , Need to set up inplace=True.df.set_index(“date”, inplace=True)
drop=False.df.set_index(“date”, drop=False)
Processing DataFrame when , Some operations ( For example, delete a row 、 Index selection, etc ) A subset of the original index will be generated , In this way, the default numeric index sorting is messy . To rebuild a continuous index , have access to reset_index Method .
>>> df0 = pd.DataFrame(np.random.rand(5, 3), columns=list("ABC"))
>>> df0
A B C
0 0.548012 0.288583 0.734276
1 0.342895 0.207917 0.995485
2 0.378794 0.160913 0.971951
3 0.039738 0.008414 0.226510
4 0.581093 0.750331 0.133022
>>> df1 = df0[df0.index % 2 == 0]
>>> df1
A B C
0 0.548012 0.288583 0.734276
2 0.378794 0.160913 0.971951
4 0.581093 0.750331 0.133022
>>> df1.reset_index(drop=True)
A B C
0 0.548012 0.288583 0.734276
1 0.378794 0.160913 0.971951
2 0.581093 0.750331 0.133022 Usually , We don't need to keep the old index , So it can be drop Parameter set to True. Again , If you want to reset the index in place , Can be set up inplace Parameter is True, Otherwise, a new DataFrame.
When used sort_value This problem is also encountered when sorting methods , Because by default , Indexes index Change with sort order , So it's a mess of snow . If we want the index not to change with the sort , It also needs to be in sort_values Method ignore_index that will do .
>>> df0.sort_values("A")
A B C team
3 0.039738 0.008414 0.226510 Y
1 0.342895 0.207917 0.995485 X
2 0.378794 0.160913 0.971951 Y
0 0.548012 0.288583 0.734276 X
4 0.581093 0.750331 0.133022 Y
>>> df0.sort_values("A", ignore_index=True)
A B C team
0 0.039738 0.008414 0.226510 Y
1 0.342895 0.207917 0.995485 X
2 0.378794 0.160913 0.971951 Y
3 0.548012 0.288583 0.734276 X
4 0.581093 0.750331 0.133022 Y Deleting duplicates is the same as sorting , After default execution, the sorting order will also be disrupted . Empathy , Can be in drop_duplicates Set in method ignore_index Parameters True that will do .
>>> df0
A B C team
0 0.548012 0.288583 0.734276 X
1 0.342895 0.207917 0.995485 X
2 0.378794 0.160913 0.971951 Y
3 0.039738 0.008414 0.226510 Y
4 0.581093 0.750331 0.133022 Y
>>> df0.drop_duplicates("team", ignore_index=True)
A B C team
0 0.548012 0.288583 0.734276 X
1 0.378794 0.160913 0.971951 Y When we have a DataFrame when , Want to use different data sources or separate operations to allocate indexes . under these circumstances , You can assign indexes directly to existing df.index.
>>> better_index = ["X1", "X2", "Y1", "Y2", "Y3"] >>> df0.index = better_index >>> df0 A B C team X1 0.548012 0.288583 0.734276 X X2 0.342895 0.207917 0.995485 X Y1 0.378794 0.160913 0.971951 Y Y2 0.039738 0.008414 0.226510 Y Y3 0.581093 0.750331 0.133022 Y
Export data to CSV When you file , Default DataFrame Have from 0 Index started . If we don't want to export CSV The file contains it , Can be in to_csv Set in method index Parameters .
>>> df0.to_csv("exported_file.csv", index=False)As shown below , Derived CSV In file , Index column is not included in the file .
Actually , There are many methods to set the index , But we are generally concerned about data , And often ignore the index , An error may be reported when the operation continues . The above high-frequency operations have index settings , It is recommended that you form the habit of setting the index when you use it at ordinary times , This will save a lot of time .
In many cases , Our data source is CSV file . Suppose there is a file named data.csv, Contains the following data .
date,temperature,humidity 07/01/21,95,50 07/02/21,94,55 07/03/21,94,56
By default ,pandas Will create one from 0 Start index line , as follows :
>>> pd.read_csv("data.csv", parse_dates=["date"])
date temperature humidity
0 2021-07-01 95 50
1 2021-07-02 94 55
2 2021-07-03 94 56 however , We can import by index_col If the parameter is set to a column, you can directly specify the index column .
>>> pd.read_csv("data.csv", parse_dates=["date"], index_col="date")
temperature humidity
date
2021-07-01 95 50
2021-07-02 94 55
2021-07-03 94 56