您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python Programming -- parsing metadata in PDF files using pypdf

編輯：Python

Python Programming – Use PyPDF analysis PDF Metadata in the file

Metadata

As a very visible object in a file , Metadata can exist in documents 、 The spreadsheet 、 picture 、 Audio and video files . The application that creates these files may put the author of the document 、 Creation and modification time 、 Details such as possible updated versions and comments are stored . for example , Mobile camera will take photos GPS Location information is saved , Microsoft Word The program may also save the author information of the document .

Case analysis

The anonymous manuscript of a member of a hacker organization can still be downloaded on the Internet –ANONOPS_The_Press_Release.pdf. Anonymous people publish information in the manuscript , The organization called for a distributed denial of service attack on some of the institutions involved （DDoS） To retaliate . This manuscript has no signature , There is no indication of the source , Just to PDF（Portable Document Format, Portable document format ） The form of the document is released . But the actual program used to create this document is PDF The metadata records the name of the document author .

PYPDF Is a very good management PDF Third party utilities for documentation , You can go to http://pybrary.net/pyPdf/ Download it . It allows you to extract content from a document , Or split the document 、 Merge 、 Copy 、 Encryption and decryption operations . To extract metadata , We can use .getDocumentInfo() Method , This method returns a tuple Array , Every tuple Both contain a description of the metadata element and its value . Go through the array one by one , You can print out PDF All metadata of the document .

The sample code is as follows ：

import pyPdf
from pyPdf import PdfFileReader
def printMeta(fileName):
pdfFile = PdfFileReader(file(fileName, 'rb'))
docInfo = pdfFile.getDocumentInfo()
print('[*] PDF MetaData For: ' + str(fileName))
for metaItem in docInfo:
print(metaItem + ':' + docInfo[metaItem])

Add another OptionParser Method , Let the script parse only the metadata of the file we specify , In this way, we have one that can recognize the embedded in PDF Tools for metadata in documents . Again , We can also modify our scripts , To check a particular metadata — Designated users . This example , Help Greek law enforcement officials search out all “ author ” This metadata is marked as Alex Tapanaries Documents . Source code is as follows ：

# Import various modules 
import pyPdf
import optparse
from pyPdf import PdfFileReader, PdfFileWriter
def printMeta(fileName):
pdfFile = PdfFileReader(file(fileName, 'rb'))
docInfo = pdfFile.getDocumentInfo()
print('[*] PDF MetaData For: ' + str(fileName))
for metaItem in docInfo:
print(metaItem + ':' + docInfo[metaItem])
def main():
parser = optparse.OptionParser('usage %prog -F <PDF file name>' )
parser.add_option('-F', dest = 'filename', type = 'string', help = 'specify PDF file name')
(options, args) = parser.parse_args()
fileName = options.filename
if fileName == None:
print(parser.usage)
exit(0)
else:
printMeta(fileName)
if __name__ == '__main__':
printMeta()

Run this... On the file published by the anonymous website pdfReader Script , We can see that led to the arrest of the Greek authorities Tapanaries Metadata for Mr .