PDF yes Portable Document Format For short , Meaning for “ Portable document format ”, By Adobe Systems Used with applications 、 operating system 、 The file format developed by the file exchange in a hardware independent way .
stay python There are multiple corresponding libraries in the Pdf file , The most common one is Pypdf2
PyPDF It's an operation pdf Module , Now the most commonly used version is PyPDF2;
It should be noted that , This library cannot be operated pdf Get text messages
PyPDF2 It's pure. Python PDF library , You can read document information ( title , Author, etc )、 write in 、 Division 、 Merge PDF file , It can also be true of pdf Add watermark to the document 、 Encryption and decryption .
Use pip Package manager installation PyPDF2 The latest version :pip install PyPDF2
The editor recommends using VSCode, start-up VSCode, You can directly choose to open “ terminal ” menu , Install the library and run the program ; Very convenient
PyPdf2 There are two modules , Namely :
1、 Use PDFFileReader Can get pdf Basic information of the document , You can also get every page pdf And load as PageObject object ;
from PyPDF2 import PdfFileReader # introduce reader
pdf = PdfFileReader(input_path) # Initialize a reader object , Incoming file path
infomation = pdf.getDocumentInfo() # Get document information
number_of_pages = pdf.getNumPages() # Get total pages
The complete example code is as follows :
def read(): ''' Read pdf data ''' from PyPDF2 import PdfFileReader # introduce reader pdf = PdfFileReader(input_path) # Initialize a reader object , Incoming file path #pdf = pdf.decrypt('password') # Keep encrypted files confidential infomation = pdf.getDocumentInfo() # Get document information number_of_pages = pdf.getNumPages() # Get total pages txt = f'''{input_path} information: Author : {infomation.author}, Creator : {infomation.creator}, Producer : {infomation.producer}, Subject : {infomation.subject}, Title : {infomation.title}, Number of pages : {number_of_pages} ''' print(txt) # Above information , Except pages , The following files may not exist # This library is not suitable for reading document contents for i in range(0,number_of_pages): pageObject = pdf.getPage(i) #print(pageObject.extractText())
2、 Use PdfFileWriter Need to cooperate with PdfFileReader
from PyPDF2 import PdfFileWriter,PdfFileReader
pdfReader = PdfFileReader(input_path)
pdfWriter = PdfFileWriter()
addPage To this end PDF File add page The page is usually from a PdfFileReader Obtained in instance
pdfWriter.addPage(pdfReader.getPage(0))
For details, please refer to the following code comments :
def write(): ''' write in ''' from PyPDF2 import PdfFileWriter,PdfFileReader pdfReader = PdfFileReader(input_path) pdfWriter = PdfFileWriter() # addPage To this end PDF File add page The page is usually from a PdfFileReader Obtained in instance pdfWriter.addPage(pdfReader.getPage(0)) # insertBlankPage Insert a blank page into this PDF File and return to this page PageObject object # insertBlankPage(width=None, height=None, index=0) Add... At the beginning by default pdfWriter.insertBlankPage(width=100,height=100) # addBlankPage(width=None, height=None) Add a blank page , If not specified width|height, Use the... On the previous page width|height # If not specified width|height And there is no previous page raise PageSizeNotDefinedError pdfWriter.addBlankPage() # Here it is PDF Insert a pageObject object . The page is usually from a PdfFileReader Obtained in instance # index Specify the insertion position By default, insert at the beginning pdfWriter.insertPage(pdfReader.getPage(2)) # addAttachment(fname, fdata) stay PDF Embedded files in # pdfWriter.addAttachment(fname=" Annex 1 .txt", fdata=b'Hello world!') print(pdfWriter.getNumPages()) # encryption #pdfWriter.encrypt(user_pwd='password', owner_pwd='password') pdfWriter.write(open('H:/test_w.pdf','wb'))
3、 Important concepts PageObject:
stay PdfFileReader load pdf After the document , Each page retrieved will be converted to PageObject object , about Pdf The operation of , In fact, it is operating PageObject object ;
Here is PageObject Methods commonly used in objects :
PageObject Methods :
mergePage(page2) Merge the contents of two pages into one , Watermark effect can be achieved
mergeRotatedPage(page2, rotation, expand=False) similar mergePage Method , It can be done to page2 Rotate the page
mergeScaledPage(page2, scale, expand=False) similar mergePage Method , It can be done to page2 Zoom the page
mergeTranslatedPage(page2, tx, ty, expand=False) similar mergePage Method , It can be done to page2 Page panning
mergeRotatedScaledPage(page2, rotation, scale, expand=False) similar mergePage Method , It can be done to page2 Rotate and zoom the page
mergeRotatedScaledTranslatedPage(page2, rotation, scale, tx, ty, expand=False) similar mergePage Method , It can be done to page2 Page rotation 、 Zoom and pan operations
mergeRotatedTranslatedPage(page2, rotation, tx, ty, expand=False) similar mergePage Method , It can be done to page2 Rotate and pan the page
mergeScaledTranslatedPage(page2, scale, tx, ty, expand=False) similar mergePage Method , It can be done to page2 Zoom and pan the page
mergeTransformedPage(page2, ctm, expand=False) similar mergePage Method , It can be done to page2 The page performs matrix conversion operations
rotateClockwise(angle) Rotate the page clockwise ,angle Must be 90 The increment of degrees
rotateCounterClockwise(angle) Rotate the page counterclockwise ,angle Must be 90 The increment of degrees
scale(sx, sy) Zoom the page
scaleBy(factor) Press fixed XY Axis scaling page
scaleTo(width, height) Zoom the page to the specified size
Please refer to the code Notes for understanding :