您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Do you know the index settings of 8 common pandas?

編輯：Python

Hello, Hello everyone , I'm Chen Chen ~

Today I'll share about 8 Common use pandas Of index Set up

1. Index from groupby Convert operation to column

groupby Grouping methods are often used . For example, add a grouping column team To group .

>>> df0["team"] = ["X", "X", "Y", "Y", "Y"]
>>> df0
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.039738  0.008414  0.226510    Y
4  0.581093  0.750331  0.133022    Y
>>> df0.groupby("team").mean()
             A         B         C
team                              
X     0.445453  0.248250  0.864881
Y     0.333208  0.306553  0.443828

By default , Grouping will program the grouping column index Indexes . But a lot of times , We don't want grouped columns to become indexes , Because some calculation or judgment logic still needs to use this column . therefore , We need to set it so that the grouped column does not become an index , At the same time, it can also complete the function of grouping .

There are two ways to do what you need , The first is to use reset_index, The second is in groupby Method as_index=False. Personally, I prefer the second method , It only involves two steps , More concise .

>>> df0.groupby("team").mean().reset_index()
  team         A         B         C
0    X  0.445453  0.248250  0.864881
1    Y  0.333208  0.306553  0.443828
>>> df0.groupby("team", as_index=False).mean()
  team         A         B         C
0    X  0.445453  0.248250  0.864881
1    Y  0.333208  0.306553  0.443828

2. Use the existing DataFrame catalog index

Of course , If the data has been read or after some data processing steps , We can go through set_index Set index manually .

>>> df = pd.read_csv("data.csv", parse_dates=["date"])
>>> df.set_index("date")
            temperature  humidity
date                             
2021-07-01           95        50
2021-07-02           94        55
2021-07-03           94        56

Here are two points to note .

set_index Method will create a new... By default DataFrame. If you want to change in place df The index of , Need to set up inplace=True.

df.set_index(“date”, inplace=True)

If you want to keep the columns that will be set as indexes , You can set drop=False.

df.set_index(“date”, drop=False)

3. Reset the index after some operations

Processing DataFrame when , Some operations （ For example, delete a row 、 Index selection, etc ） A subset of the original index will be generated , In this way, the default numeric index sorting is messy . To rebuild a continuous index , have access to reset_index Method .

>>> df0 = pd.DataFrame(np.random.rand(5, 3), columns=list("ABC"))
>>> df0
          A         B         C
0  0.548012  0.288583  0.734276
1  0.342895  0.207917  0.995485
2  0.378794  0.160913  0.971951
3  0.039738  0.008414  0.226510
4  0.581093  0.750331  0.133022
>>> df1 = df0[df0.index % 2 == 0]
>>> df1
          A         B         C
0  0.548012  0.288583  0.734276
2  0.378794  0.160913  0.971951
4  0.581093  0.750331  0.133022
>>> df1.reset_index(drop=True)
          A         B         C
0  0.548012  0.288583  0.734276
1  0.378794  0.160913  0.971951
2  0.581093  0.750331  0.133022

Usually , We don't need to keep the old index , So it can be drop Parameter set to True. Again , If you want to reset the index in place , Can be set up inplace Parameter is True, Otherwise, a new DataFrame.

4. Reset index after sorting

When used sort_value This problem is also encountered when sorting methods , Because by default , Indexes index Change with sort order , So it's a mess of snow . If we want the index not to change with the sort , It also needs to be in sort_values Method ignore_index that will do .

>>> df0.sort_values("A")
          A         B         C team
3  0.039738  0.008414  0.226510    Y
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
0  0.548012  0.288583  0.734276    X
4  0.581093  0.750331  0.133022    Y
>>> df0.sort_values("A", ignore_index=True)
          A         B         C team
0  0.039738  0.008414  0.226510    Y
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.548012  0.288583  0.734276    X
4  0.581093  0.750331  0.133022    Y

5. Reset the index after removing duplicates

Deleting duplicates is the same as sorting , After default execution, the sorting order will also be disrupted . Empathy , Can be in drop_duplicates Set in method ignore_index Parameters True that will do .

>>> df0
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.342895  0.207917  0.995485    X
2  0.378794  0.160913  0.971951    Y
3  0.039738  0.008414  0.226510    Y
4  0.581093  0.750331  0.133022    Y
>>> df0.drop_duplicates("team", ignore_index=True)
          A         B         C team
0  0.548012  0.288583  0.734276    X
1  0.378794  0.160913  0.971951    Y

6. Direct assignment of index

When we have a DataFrame when , Want to use different data sources or separate operations to allocate indexes . under these circumstances , You can assign indexes directly to existing df.index.

>>> better_index = ["X1", "X2", "Y1", "Y2", "Y3"]
>>> df0.index = better_index
>>> df0
           A         B         C team
X1  0.548012  0.288583  0.734276    X
X2  0.342895  0.207917  0.995485    X
Y1  0.378794  0.160913  0.971951    Y
Y2  0.039738  0.008414  0.226510    Y
Y3  0.581093  0.750331  0.133022    Y

7. write in CSV Ignore index when file

Export data to CSV When you file , Default DataFrame Have from 0 Index started . If we don't want to export CSV The file contains it , Can be in to_csv Set in method index Parameters .

>>> df0.to_csv("exported_file.csv", index=False)

As shown below , Derived CSV In file , Index column is not included in the file .

Actually , There are many methods to set the index , But we are generally concerned about data , And often ignore the index , An error may be reported when the operation continues . The above high-frequency operations have index settings , It is recommended that you form the habit of setting the index when you use it at ordinary times , This will save a lot of time .

8. Specify index column when reading

In many cases , Our data source is CSV file . Suppose there is a file named data.csv, Contains the following data .

date,temperature,humidity
07/01/21,95,50
07/02/21,94,55
07/03/21,94,56

By default ,pandas Will create one from 0 Start index line , as follows ：

>>> pd.read_csv("data.csv", parse_dates=["date"])
        date  temperature  humidity
0 2021-07-01           95        50
1 2021-07-02           94        55
2 2021-07-03           94        56

however , We can import by index_col If the parameter is set to a column, you can directly specify the index column .

>>> pd.read_csv("data.csv", parse_dates=["date"], index_col="date")
            temperature  humidity
date                             
2021-07-01           95        50
2021-07-02           94        55
2021-07-03           94        56