introduction
Create files and datasets
Write data sets
Read dataset
introductionstay Matlab operation HDF5 It has been described in detail in the document HDF5 File already used Matlab How to operate it . This article summarizes how to Python Next use HDF5 file . We still follow Matlab operation HDF5 The order of the documents is , They are to create HDF5 file , Write data , Reading data .
Python Under the HDF5 File dependency h5py tool kit
Create files and datasetsUse `h5py.File() Method creation hdf5 file
h5file = h5py.File(filename,'w')
Then create a dataset on this basis
X = h5file.create_dataset(shape=(0,args.patch_size,args.patch_size), # The dimension of the dataset maxshape = (None,args.patch_size,args.patch_size), # The maximum allowable dimension of the dataset dtype=float,compression='gzip',name='train', # data type 、 Is it compressed? , And the name of the dataset chunks=(args.chunk_size,args.patch_size,args.patch_size)) # Block storage , The size of each block
The two most relevant parameters are shape and maxshape, Obviously, we want a certain dimension of the dataset to be extensible , So in maxshape in , Mark the dimension you want to expand as None, Other dimensions and shape The parameters are the same . Another thing worth noting is , Use compression='gzip' in the future , The entire data set can be greatly compressed , It is very useful for large data sets , And when reading and writing data , No explicit decoding by the user .
Write data setsUse the above creat_dataset Created dataset in the future , Reading and writing a dataset is like reading and writing numpy Arrays are just as convenient , For example, the above function defines the data set 'train', That's the variable X in the future , You can read and write in the following ways :
data = np.zeros((100,args.patch_size,arg))X[0:100,:,:] = data
When you created the dataset earlier , We define shape = (args.chunk_size,args.patch_size,args.patch_size), If there is more data , What shall I do? ?
have access to resize Method to extend the maxshape Is defined as None That dimension of :
X.resize(X.shape[0]+args.chunk_size,axis=0)
Because we are maxshape=(None,args.patch_size,args.patch_size) Define the zeroth dimension as extensible , therefore , First of all, we use X.shape[0] To find the length of the dimension , And extend it . After the dimension is extended , You can continue to write data into it .
Read datasetRead h5 The file method is also very simple , The first use of h5py.File Method to open the corresponding h5 file , Then a data set in it is fetched to a variable , Reading this variable is like numpy The same .
h = h5py.File(hd5file,'r')train = h['train']train[1]train[2]...
But there is a problem with the above reading method, that is, every time you use it (train[1],train[2]) Need to read data from the hard disk , This will result in slower reads . A better way is , Read one at a time from the hard disk chunk_size The data of , Then store the data in memory , Read from memory when needed , For example, use the following method :
h = h5py.File(hd5file,'r')train = h['train']X = train[0:100] # Read more data from the hard disk at one time ,X Will be stored in memory X[1] # Read from memory X[2] # Read from memory
This method will be much faster .
That's all Python operation HDF5 Details of file examples , More about Python operation HDF5 Please pay attention to other relevant articles of the software development network for the information of the document !