您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python lxml cleaning XML clearing node and attribute

編輯：Python

introduction ：

Data migration will be encountered in the project , How to ensure that the new data is consistent with the old data , Sometimes content testing is required , This involves XML Comparison of contents , Due to the change of data processing method , Need to ignore differences in expectations , So for special xpath To deal with . Synthesis of various studies , Think lxml Highest efficiency , Yes xml It is very convenient to handle , This article will use an example to solve the problem of data cleaning .
Summary ：

XML namespace Summary
lxml Yes XML The operation of
lxml Application cleaning XML

XML namespace

of XML namespace, You can refer to XML Namespace ,namesp It is mainly used to solve naming conflicts , Generally complex XML tag With prefix , The prefix represents the namespace .

for example ： This is a normal XML

<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<element name="foo">
<complexType>
<element name="bar" type="B:myType"/>
</complexType>
</element>
<complexType name="myType">
<choice>
<element name="baz" type="string"/>
<element name="bas" type="string"/>
</choice>
</complexType>
</schema>

We can also use prefixes to express ：
xmlns:myprefix="http://www.w3.org/2001/XMLSchema"

<myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<myprefix:element name="foo">
<myprefix:complexType>
<myprefix:element name="bar" type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
<myprefix:choice>
<myprefix:element name="baz" type="string"/>
<myprefix:element name="bas" type="string"/>
</myprefix:choice>
</myprefix:complexType>
</myprefix:schema>

lxml Yes XML The operation of

analysis XML：

import lxml.etree as XE
root = XE.fromstring(xml_content)

obtain namesp

namespaces=root.nsmap

location node
Be careful ： Need to use relative xpath, Absolute... Is not supported xpath, For prefixed tag You have to take it namespace

nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)

node remove

node.getparent().remove(node)

attribute eliminate

node.attrib.pop(attri_name_list[0])

lxml Application cleaning XML

demand ：
Just take the one on it xml For example

Remove node myprefix:schema/myprefix:complexType/myprefix:choice
remove attribute
myprefix:schema/myprefix:element/[@name]

programme ：

It is possible to dispose of Xpath Add to a list , Or read from a file
Satisfy xpath Of node Maybe a lot , So we need a loop to process
We have to deal with it at the same time node and attribute
If you need to deal with a lot XML, Every XML To deal with the xpath It may be different , So try to skip Ben XML Mismatched in XPATH
Want to get the pure xml Content , hold namespac Also clear

Complete code ：

import re
import lxml.etree as XE
def ignore_xpath_handled_by_lxml(xml_content):
ignore_xpath_set = set()
ignore_xpath_set.add("myprefix:schema/myprefix:complexType/myprefix:choice")
ignore_xpath_set.add("myprefix:schema/myprefix:element/[@name]")
root = XE.fromstring(xml_content)
root_tag_name = re.findall(".*\}(.*)", root.tag)[0]
for xpath_ignore in ignore_xpath_set:
xpath_ignore_tag = xpath_ignore.split("/")[0].split(":")[1]
# reletive path
index = xpath_ignore.find("/")
rele_xpath_ignore = ".//" + xpath_ignore[index+1:]
# handle the xpath: mached the tag
if xpath_ignore_tag == root_tag_name:
try:
attri_name_list = re.findall(".*\[@(.*)\].*", xpath_ignore)
nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
if len(nodes) > 0:
for node in nodes:
if len(attri_name_list) > 0:
node.attrib.pop(attri_name_list[0])
else:
node.getparent().remove(node)
except Exception as e:
print("Error: {}".format(e))
else:
continue
root_tag = "myprefix" + ":" + root_tag_name
ignore_result = str(XE.tostring(root, pretty_print=True, encoding="unicode"))
namespace_pattern = re.compile('<' + root_tag + r' xmlns(.|\\s)*>')
content_without_namespace = re.sub(namespace_pattern, '<'+ root_tag + '>', ignore_result)
return content_without_namespace
if __name__ == "__main__":
xml_string = ''' <myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified"> <myprefix:element name="foo"> <myprefix:complexType> <myprefix:element name="bar" type="B:myType"/> </myprefix:complexType> </myprefix:element> <myprefix:complexType name="myType"> <myprefix:choice> <myprefix:element name="baz" type="string"/> <myprefix:element name="bas" type="string"/> </myprefix:choice> </myprefix:complexType> </myprefix:schema> '''
new_xml_string = ignore_xpath_handled_by_lxml(xml_string)
print(new_xml_string)

Output ：

<myprefix:schema>
<myprefix:element>
<myprefix:complexType>
<myprefix:element type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
</myprefix:complexType>
</myprefix:schema>