程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Python batch export of pictures and embedded files in word documents

編輯:Python

There are screenshots to be submitted for the questions in the student's test paper , There are also documents to be submitted , In order to facilitate the students' examination , Allow separate intersection or embedding Word Submitted in , Then how to sort out the students' answers afterwards ? It is more convenient to submit separately , Directly scan the file name to match the name and put it into the specified folder . But embedded in Word How to extract the pictures and files in ?

There are the following needs : Extract one Word All the pictures in the document (png、jpg) And embedded files ( Any format ) Put it in the specified folder .

solve
docx It's a compressed package , The decompressed image is usually placed in the document name .docx\word\media\ Under the table of contents :

Embedded files are usually placed in the document name .docx\word\embeddings\ Under the table of contents :

After asking Du Niang , I found it easy to extract pictures , Use it directly docx In the library Document.part.rels{k:v.target_ref} Find the relative path of the file , use Document.part.rels{k:v.target_part.blob} Read file contents . Simply judge whether the path and file suffix are what we need media Under the png Document and embeddings Under the bin file , If yes, write to the new file :

To extract the image

install python-docx library

pip install python-docx

extract

import os
from docx import Document # pip install python-docx
is_debug = True
if __name__ == '__main__':
# Need to export Word Document path 
# Python Learning and communication base 279199867
target_file = r'paper\HBase test questions .docx'
# The directory where the exported file is located 
output_dir = r'paper\output'
# load Word file 
doc = Document(target_file)
# Traverse Word All the files in the package 
dict_rel = doc.part.rels
# r_id: Document id ,rel: File object 
for r_id, rel in dict_rel.items():
if not ( # If the file is not in media perhaps embeddings Medium , Just skip 
str(rel.target_ref).startswith('media')
or str(rel.target_ref).startswith('embeddings')
):
continue
# If the file is not the suffix we want , Also skip directly 
file_suffix = str(rel.target_ref).split('.')[-1:][0]
if file_suffix.lower() not in ['png', 'jpg', 'bin']:
continue
# If the output directory does not exist , establish 
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Build the name and path of the export file 
file_name = r_id + '_' + str(rel.target_ref).replace('/', '_')
file_path = os.path.join(output_dir,file_name)
# Write binary data to a file in a new location 
with open(file_path, "wb") as f:
f.write(rel.target_part.blob)
# Print the results 
if is_debug:
print(' Export file succeeded :', file_name)

Running results :

You can see , Pictures can be exported normally , But the students embed JAVA The file was not exported , In other words, the export is bin file , Not completely exported .

Extract embedded files

Ask Du Niang again and find , In fact, this is also zip Compressed package , But it can't be extracted directly , It has a more professional name , It's called ole file , Our previous doc、xls、ppt Wait till you don't bring it x All the ancient documents are in this format . So how to extract the file ? Du Niang told me that there was a man named oletools Projects can , So I downloaded it and analyzed it , It turns out that it can !

oletools Project address :https://github.com/decalage2/oletools

perhaps gitee To the address of someone else :https://gitee.com/yunqimg/oletools

I am using gitee Previous editions , because github Cannot be opened QwQ

Introduced by relevant documents , Under the project of oletools-master\oletools\oleobj.py You can extract this bin Suffix ole file , Just give it a try , stay oleobj.py Open the command line in the directory , Take the just extracted rId12_embeddings_oleObject1.bin File copy to oleobj.py In the directory , Execute the following command :

Be careful : Before that, I performed the installation oletools The order of , If you do not install it, you may make an error :pip install oletools, Or say oleobj.py rely on olefile:pip install olefile, In the installation oletools By the way olefile.

python oleobj.py rId12_embeddings_oleObject1.bin

Successfully exported

Microsoft Windows [ edition 10.0.22000.708]
(c) Microsoft Corporation. All rights reserved .
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>python oleobj.py rId12_embeddings_oleObject1.bin
oleobj 0.56 - http://decalage.info/oletools
THIS IS WORK IN PROGRESS - Check updates regularly!
Please report any issue at https://github.com/decalage2/oletools/issues
-------------------------------------------------------------------------------
File: 'rId12_embeddings_oleObject1.bin'
extract file embedded in OLE object from stream '\x01Ole10Native':
Parsing OLE Package
Filename = "Boos.java"
Source path = "D:\111\´ó20´óÊý¾Ý Àî¾üÁé\Boos.java"
Temp path = "C:\Users\ADMINI~1\AppData\Local\Temp\Boos.java"
saving to file rId12_embeddings_oleObject1.bin_Boos.java
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>

The exported files can also be accessed normally :

So the oletools Copy the directory to the project , I'm going to modify it a little bit oleobj.py Can make my code call it , stay oleobj.py Add the following code to :

def export_main(ole_files, output_dir, log_leve=DEFAULT_LOG_LEVEL):
ensure_stdout_handles_unicode()
logging.basicConfig(level=LOG_LEVELS[log_leve], stream=sys.stdout,
format='%(levelname)-8s %(message)s')
# Enable the log module 
log.setLevel(logging.NOTSET)
any_err_stream = False
any_err_dumping = False
any_did_dump = False
for container, filename, data \
in xglob.iter_files(ole_files,
recursive=False,
zip_password=None,
zip_fname='*'):
if container and filename.endswith('/'):
continue
# Output folder 
err_stream, err_dumping, did_dump = \
process_file(filename, data, output_dir)
any_err_stream |= err_stream
any_err_dumping |= err_dumping
any_did_dump |= did_dump
return_val = RETURN_NO_DUMP
if any_did_dump:
return_val += RETURN_DID_DUMP
if any_err_stream:
return_val += RETURN_ERR_STREAM
if any_err_dumping:
return_val += RETURN_ERR_DUMP
return return_val
def export_ole_file(ole_files, output_dir, debug=False):
debug_leve = 'critical'
if debug:
debug_leve = 'info'
# export 
result = export_main(
ole_files,
output_dir,
debug_leve
)
if result and debug:
print(' export ole File error ', ole_files)

Add the following call to the code that extracts the file :

if str(rel.target_ref).startswith('embeddings'):
# Unzip the embedded file 
export_ole_file([file_path], output_dir)

Run again

Successfully exported embedded into Word Documents in !

Solve the problem successfully ~

Brothers, go and have a try !


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved