您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python batch export of pictures and embedded files in word documents

編輯：Python

There are screenshots to be submitted for the questions in the student's test paper , There are also documents to be submitted , In order to facilitate the students' examination , Allow separate intersection or embedding Word Submitted in , Then how to sort out the students' answers afterwards ？ It is more convenient to submit separately , Directly scan the file name to match the name and put it into the specified folder . But embedded in Word How to extract the pictures and files in ？

There are the following needs ： Extract one Word All the pictures in the document （png、jpg） And embedded files （ Any format ） Put it in the specified folder .

solve
docx It's a compressed package , The decompressed image is usually placed in the document name .docx\word\media\ Under the table of contents ：

Embedded files are usually placed in the document name .docx\word\embeddings\ Under the table of contents ：

After asking Du Niang , I found it easy to extract pictures , Use it directly docx In the library Document.part.rels{k:v.target_ref} Find the relative path of the file , use Document.part.rels{k:v.target_part.blob} Read file contents . Simply judge whether the path and file suffix are what we need media Under the png Document and embeddings Under the bin file , If yes, write to the new file ：

To extract the image

install python-docx library

pip install python-docx

extract

import os
from docx import Document # pip install python-docx
is_debug = True
if __name__ == '__main__':
# Need to export Word Document path 
# Python Learning and communication base 279199867
target_file = r'paper\HBase test questions .docx'
# The directory where the exported file is located 
output_dir = r'paper\output'
# load Word file 
doc = Document(target_file)
# Traverse Word All the files in the package 
dict_rel = doc.part.rels
# r_id： Document id ,rel： File object 
for r_id, rel in dict_rel.items():
if not ( # If the file is not in media perhaps embeddings Medium , Just skip 
str(rel.target_ref).startswith('media')
or str(rel.target_ref).startswith('embeddings')
):
continue
# If the file is not the suffix we want , Also skip directly 
file_suffix = str(rel.target_ref).split('.')[-1:][0]
if file_suffix.lower() not in ['png', 'jpg', 'bin']:
continue
# If the output directory does not exist , establish 
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# Build the name and path of the export file 
file_name = r_id + '_' + str(rel.target_ref).replace('/', '_')
file_path = os.path.join(output_dir,file_name)
# Write binary data to a file in a new location 
with open(file_path, "wb") as f:
f.write(rel.target_part.blob)
# Print the results 
if is_debug:
print(' Export file succeeded ：', file_name)

Running results ：

You can see , Pictures can be exported normally , But the students embed JAVA The file was not exported , In other words, the export is bin file , Not completely exported .

Extract embedded files

Ask Du Niang again and find , In fact, this is also zip Compressed package , But it can't be extracted directly , It has a more professional name , It's called ole file , Our previous doc、xls、ppt Wait till you don't bring it x All the ancient documents are in this format . So how to extract the file ？ Du Niang told me that there was a man named oletools Projects can , So I downloaded it and analyzed it , It turns out that it can ！

oletools Project address ：https://github.com/decalage2/oletools

perhaps gitee To the address of someone else ：https://gitee.com/yunqimg/oletools

I am using gitee Previous editions , because github Cannot be opened QwQ

Introduced by relevant documents , Under the project of oletools-master\oletools\oleobj.py You can extract this bin Suffix ole file , Just give it a try , stay oleobj.py Open the command line in the directory , Take the just extracted rId12_embeddings_oleObject1.bin File copy to oleobj.py In the directory , Execute the following command ：

Be careful ： Before that, I performed the installation oletools The order of , If you do not install it, you may make an error ：pip install oletools, Or say oleobj.py rely on olefile：pip install olefile, In the installation oletools By the way olefile.

python oleobj.py rId12_embeddings_oleObject1.bin

Successfully exported

Microsoft Windows [ edition 10.0.22000.708]
(c) Microsoft Corporation. All rights reserved .
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>python oleobj.py rId12_embeddings_oleObject1.bin
oleobj 0.56 - http://decalage.info/oletools
THIS IS WORK IN PROGRESS - Check updates regularly!
Please report any issue at https://github.com/decalage2/oletools/issues
-------------------------------------------------------------------------------
File: 'rId12_embeddings_oleObject1.bin'
extract file embedded in OLE object from stream '\x01Ole10Native':
Parsing OLE Package
Filename = "Boos.java"
Source path = "D:\111\´ó20´óÊý¾Ý Àî¾üÁé\Boos.java"
Temp path = "C:\Users\ADMINI~1\AppData\Local\Temp\Boos.java"
saving to file rId12_embeddings_oleObject1.bin_Boos.java
D:\Minuy\Downloads\oletools-master\oletools-master\oletools>

The exported files can also be accessed normally ：

So the oletools Copy the directory to the project , I'm going to modify it a little bit oleobj.py Can make my code call it , stay oleobj.py Add the following code to ：

def export_main(ole_files, output_dir, log_leve=DEFAULT_LOG_LEVEL):
ensure_stdout_handles_unicode()
logging.basicConfig(level=LOG_LEVELS[log_leve], stream=sys.stdout,
format='%(levelname)-8s %(message)s')
# Enable the log module 
log.setLevel(logging.NOTSET)
any_err_stream = False
any_err_dumping = False
any_did_dump = False
for container, filename, data \
in xglob.iter_files(ole_files,
recursive=False,
zip_password=None,
zip_fname='*'):
if container and filename.endswith('/'):
continue
# Output folder 
err_stream, err_dumping, did_dump = \
process_file(filename, data, output_dir)
any_err_stream |= err_stream
any_err_dumping |= err_dumping
any_did_dump |= did_dump
return_val = RETURN_NO_DUMP
if any_did_dump:
return_val += RETURN_DID_DUMP
if any_err_stream:
return_val += RETURN_ERR_STREAM
if any_err_dumping:
return_val += RETURN_ERR_DUMP
return return_val
def export_ole_file(ole_files, output_dir, debug=False):
debug_leve = 'critical'
if debug:
debug_leve = 'info'
# export 
result = export_main(
ole_files,
output_dir,
debug_leve
)
if result and debug:
print(' export ole File error ', ole_files)

Add the following call to the code that extracts the file ：

if str(rel.target_ref).startswith('embeddings'):
# Unzip the embedded file 
export_ole_file([file_path], output_dir)

Run again

Successfully exported embedded into Word Documents in ！

Solve the problem successfully ~

Brothers, go and have a try ！

上一篇文章： Loading C language code using python (ctypes)
下一篇文章： Python learning notes: parsing XML (the elementtree XML API)

Python

Python 設計模式：原型模式

定義原型模式（Prototype Pattern）

Python實現識別文字中的省市區並繪圖

目錄1.准備2.基本使用3.高級使用在做NLP（自然語言處理

python-批量畫子圖+添加主標題+解決中文亂碼

import numpy as np import pand

Solve the shortest path problem, secondary allocation problem and knapsack problem based on the improved ant colony algorithm [matlab & Python code implementation]

一個簡單python小事情

碰到這個情況該怎麼辦，確實有點難住我了。具體的操作是什麼樣子

Python Experiment 6 file access

2022.06.08 Afternoon experimen

Django admin uses import_ Export display and import / export data

Great move! How to use Python to skillfully batch merge excel!

Pandasdataframe method for batch modification of values

Python+excel series: case 6: batch printing workbooks, batch printing specified worksheets in multiple workbooks

Python+excel series: case 7: batch copy all worksheets of a workbook to other workbooks, and batch copy the data of a worksheet to the specified worksheets of other workbooks

Only three lines of Python code are used to realize the import and export between the database and excel

[Python] how to batch name a file name as four or five digits

Python3 export excel one by one

Python batch modify file suffix

Python batch extracts the specified column and writes it to a new file

熱門圖文

實現php加速的eAccelerator dll支持文件打包下載 sql server-sqlserver 事物的使用 Servlet容器Tomcat中web.xml中url-pattern配置詳解《C# to IL》第四章關鍵字和操作符(下)(9) poj 3041 Asteroids (匈牙利算法---二分圖最大匹配) c#窗體背景圖片-C#設置窗體背景圖片，並讓背景圖片每隔五秒改變一次 C#獲得U盤序列號的辦法 new表達式，operator new和placement new介紹

欄目導航