您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python data processing pandas read / write IO tool CSV parsing

編輯：Python

Catalog

Preface

CSV And text files

1 Argument parsing

1.1 Basics

1.2 Column 、 Indexes 、 name

1.3 General resolution configuration

1.4 NA And missing data processing

1.5 Date time processing

1.6 iteration

1.7 quote 、 Compression and file format

1.8 Error handling

2. Specify the type of data column

Preface

We introduced pandas The basic syntax operation of , Let's start with pandas Data read and write operation of .

pandas Of IO API It's a group of top reader function , such as pandas.read_csv(), Will return a pandas object .

And the corresponding writer Functions are object methods , Such as DataFrame.to_csv().

All the... Are listed below reader and writer function

Be careful ： We'll use that later StringIO, Please make sure to import

# python3from io import StringIO# python2from StringIO import StringIOCSV And text files

The main function for reading text files is read_csv()

1 Argument parsing

read_csv() Accept the following common parameters :

1.1 Basics

filepath_or_buffer: Variable

It can be a file path 、 file URL Or any with read() Object of function

sep: str, Default ,, about read_table yes \t

File separator , If set to None, be C The engine cannot automatically detect delimiters , and Python The engine can automatically detect the separator through the built-in sniffer tool .

Besides , If the set character length is greater than 1, It's not '\s+', Then the string will be parsed into regular expressions , And mandatory use Python Parsing engine .

for example '\\r\\t', But regular expressions tend to ignore reference data in text .

delimiter: str, The default is None

sep Alternative parameters for , Consistent function

1.2 Column 、 Indexes 、 name

header: int or list, The default is 'infer'

The line number used as the column name , The default behavior is to infer column names ：

If not specified names Parameters behave like header=0, That is, infer from the first line read .

If set names, Behavior and header=None identical .

It can also be for header Settings list , Represents a multi-level column name . Such as [0,1,3], Unspecified rows （ Here is 2） Will be skipped , If skip_blank_lines=True, Blank lines and comment lines will be skipped . therefore header=0 Does not represent the first line of the file

names: array-like, The default is None

List of column names to be set , If the file does not contain a title line , Should be passed explicitly header=None, And duplicate values are not allowed in this list .

index_col: int, str, sequence of int/str, False, The default is None

Used as a DataFrame Index column of , It can be given in the form of a string name or a column index . If a list is specified , Then use MultiIndex

Be careful ：index_col=False Can be used to force pandas Do not use the first column as an index . for example , When your file is an error file with a delimiter at the end of each line .

usecols: List or function , The default is None

Only the specified columns are read . If it's a list , Then all elements must be positions （ That is, the integer index in the file column ） Or a string , These strings must be the same as names The column names provided by the parameter or inferred from the document header row correspond to .

The order in the list is ignored , namely usecols=[0, 1] Equivalent to [1, 0]

If it is a callable function , Will be calculated based on the column name , The return callable function evaluates to True The name of

In [1]: import pandas as pdIn [2]: from io import StringIOIn [3]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"In [4]: pd.read_csv(StringIO(data))Out[4]: col1 col2 col30 a b 11 a b 22 c d 3In [5]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ["COL1", "COL3"])Out[5]: col1 col30 a 11 a 22 c 3

Using this parameter can greatly speed up parsing time and reduce memory usage

squeeze: boolean, The default is False

If the parsed data contains only one column , So go back to one Series

prefix: str, The default is None

When there is no title , Prefix added to automatically generated column numbers , for example 'X' Express X0, X1...

mangle_dupe_cols: boolean, The default is True

Duplicate columns will be specified as 'X','X.1'…'X.N', instead of 'X'... . If there is a duplicate name in the column , Pass on False Will cause the data to be covered

1.3 General resolution configuration

dtype: Type name or type Dictionary （column -> type）, The default is None

The data type of the data or column . for example .

{'a'：np.float64,'b'：np.int32}
engine: {'c', 'python'}

The parser engine to use .C The engine is faster , and Python The engine is now more fully functional

converters: dict, The default is None

A function dictionary for converting values in certain columns . Keys can be integers , It can also be listed

true_values: list, The default is None

The data value is parsed as True

false_values: list, The default is None

The data value is parsed as False

skipinitialspace: boolean, The default is False

Skip spaces after delimiters

skiprows: Integer or integer list , The default is None

The line number to skip at the beginning of the file （ The index for 0） Or the number of lines to skip

If you can call a function , Then apply the function to the index , If you return True, Then the line should be skipped , Otherwise return to False

In [6]: data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"In [7]: pd.read_csv(StringIO(data))Out[7]: col1 col2 col30 a b 11 a b 22 c d 3In [8]: pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)Out[8]: col1 col2 col30 a b 2

skipfooter: int, The default is 0

Need to skip the number of lines at the end of the file （ I won't support it C engine ）

nrows: int, The default is None

Number of file lines to read , Useful for reading large files

memory_map: boolean, The default is False

If filepath_or_buffer Parameter specifies the file path , The file object is mapped directly to memory , Then access the data directly from there . Use this option to improve performance , Because there's no more I/O expenses

1.4 NA And missing data processing

na_values: scalar, str, list-like, dict, The default is None

Need to be converted to NA String of values

keep_default_na: boolean, The default is True

Whether to include default when parsing data NaN value . Depending on whether or not it's coming in na_values, Its behavior is as follows

keep_default_na=True, And it specifies na_values, na_values Will be the same as the default NaN Be parsed together
keep_default_na=True, And no na_values, Only the default NaN
keep_default_na=False, And it specifies na_values, Parse only na_values designated NaN
keep_default_na=False, And no na_values, The string will not be parsed to NaN

Be careful ： If na_filter=False, that keep_default_na and na_values Parameters are ignored

na_filter: boolean, The default is True

Detect missing value markers （ Empty string and na_values Value ）. In the absence of any NA Data in , Set up na_filter=False It can improve the performance of reading large files

skip_blank_lines: boolean, The default is True

If True, Skip the blank line , It is not interpreted as NaN value

1.5 Date time processing

parse_dates: Boolean value 、 List or nested list 、 Dictionaries , The default is False.

If True -> Try to parse the index

If [1, 2, 3] -> Try to 1, 2, 3 Columns resolve to delimited dates

If [[1, 3]] -> take 1, 3 Column resolves to a single date column

If {'foo': [1, 3]} -> take 1, 3 Column as the date and set the column name to foo

infer_datetime_format: Boolean value , The default is False

If set to True And set up parse_dates, Try to infer datetime Format to speed up processing

date_parser: function , The default is None

A function for converting a string sequence into an array of datetime instances . By default dateutil.parser.parser convert ,pandas You will try to call... In three different ways date_parser

Pass one or more arrays （parse_dates Defined columns ） As a parameter ;

take parse_dates String values in defined columns are concatenated into a single array , And pass it on ;

Use one or more strings ( Corresponding to parse_dates Defined columns ) As a parameter , Call... On each line date_parser once .

dayfirst: Boolean value , The default is False

DD/MM Format date

cache_dates: Boolean value , The default is True

If True, Then the unique 、 Converted date cache to apply datetime transformation .

Parsing duplicate date strings , Especially for date strings with time zone offsets , May significantly increase speed .

1.6 iteration

iterator: boolean, The default is False

return TextFileReader Object to iterate or use get_chunk() To get the block

1.7 quote 、 Compression and file format

compression: {'infer', 'gzip', 'bz2', 'zip', 'xz', None, dict}, The default is 'infer'

Used for instant decompression of disk data . If "infer", If filepath_or_buffer Is the file path and starts with ".gz",".bz2",".zip" or ".xz" ending , Then use gzip,bz2,zip or xz decompression , Otherwise, do not decompress .

If you use "zip", be ZIP The file must contain only one data file to read . Set to None It means not decompressing

You can also use a dictionary , The key is method From {'zip', 'gzip', 'bz2'} Choose from . for example

compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}

thousandsstr, The default is None

The separator of the value in thousands

decimal: str, The default is '.'

decimal point

float_precision: string, The default is None

Appoint C Which converter should the engine use to handle floating point values . The options for a normal converter are None, The options for high-precision converters are high, The bidirectional converter options are round_trip.

quotechar: str ( The length is 1)

A character used to indicate the beginning and end of referenced data . Delimiters in quoted data are ignored

comment: str, The default is None

Used to skip the line beginning with this character , for example , If comment='#', Will skip # Beginning line

encoding: str, The default is None

Set the encoding format

1.8 Error handling

error_bad_linesboolean, The default is True

By default , Rows with too many fields （ for example , With too many commas csv file ） Exception will be thrown , And will not return any DataFrame.

If set to False, Then these bad lines will be deleted

warn_bad_linesboolean, The default is True

If error_bad_lines=False And warn_bad_lines=True, Each bad line will output a warning

2. Specify the type of data column

You can indicate the entire DataFrame Or the data type of each column

In [9]: import numpy as npIn [10]: data = "a,b,c,d\n1,2,3,4\n5,6,7,8\n9,10,11"In [11]: print(data)a,b,c,d1,2,3,45,6,7,89,10,11In [12]: df = pd.read_csv(StringIO(data), dtype=object)In [13]: dfOut[13]: a b c d0 1 2 3 41 5 6 7 82 9 10 11 NaNIn [14]: df["a"][0]Out[14]: '1'In [15]: df = pd.read_csv(StringIO(data), dtype={"b": object, "c": np.float64, "d": "Int64"})In [16]: df.dtypesOut[16]: a int64b objectc float64d Int64dtype: object

You can use read_csv() Of converters Parameters , Unify the data type of a column

In [17]: data = "col_1\n1\n2\n'A'\n4.22"In [18]: df = pd.read_csv(StringIO(data), converters={"col_1": str})In [19]: dfOut[19]: col_10 11 22 'A'3 4.22In [20]: df["col_1"].apply(type).value_counts()Out[20]: <class 'str'> 4Name: col_1, dtype: int64

perhaps , You can use... After reading the data to_numeric() Function cast type

In [21]: df2 = pd.read_csv(StringIO(data))In [22]: df2["col_1"] = pd.to_numeric(df2["col_1"], errors="coerce")In [23]: df2Out[23]: col_10 1.001 2.002 NaN3 4.22In [24]: df2["col_1"].apply(type).value_counts()Out[24]: <class 'float'> 4Name: col_1, dtype: int64

It converts all valid numeric values to floating point numbers , And resolve the invalid to NaN

Last , How to handle columns with mixed types depends on your specific needs . In the example above , If you just want to convert the abnormal data to NaN, that to_numeric() Maybe the best choice for you .

However , If you want to force all the data , Whatever the type , So use read_csv() Of converters The parameters will be better

Be careful

In some cases , Reading exception data that contains mixed type columns will result in inconsistent data sets .

If you rely on pandas To infer the type of column , The parsing engine will continue to infer the type of the data block , Instead of extrapolating the entire dataset at once .

In [25]: col_1 = list(range(500000)) + ["a", "b"] + list(range(500000))In [26]: df = pd.DataFrame({"col_1": col_1})In [27]: df.to_csv("foo.csv")In [28]: mixed_df = pd.read_csv("foo.csv")In [29]: mixed_df["col_1"].apply(type).value_counts()Out[29]: <class 'int'> 737858<class 'str'> 262144Name: col_1, dtype: int64In [30]: mixed_df["col_1"].dtypeOut[30]: dtype('O')

This leads to mixed_df Some blocks of columns contain int type , Other blocks contain str, This is because the data read is of mixed type .

That's all Python pandas Data reading and writing operations IO Tools CSV Details of , More about Python pandas For data reading and writing materials, please pay attention to other relevant articles on the software development network ！