Data migration will be encountered in the project , How to ensure that the new data is consistent with the old data , Sometimes content testing is required , This involves XML Comparison of contents , Due to the change of data processing method , Need to ignore differences in expectations , So for special xpath To deal with . Synthesis of various studies , Think lxml Highest efficiency , Yes xml It is very convenient to handle , This article will use an example to solve the problem of data cleaning .
Summary :
of XML namespace, You can refer to XML Namespace ,namesp It is mainly used to solve naming conflicts , Generally complex XML tag With prefix , The prefix represents the namespace .
for example : This is a normal XML
<schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<element name="foo">
<complexType>
<element name="bar" type="B:myType"/>
</complexType>
</element>
<complexType name="myType">
<choice>
<element name="baz" type="string"/>
<element name="bas" type="string"/>
</choice>
</complexType>
</schema>
We can also use prefixes to express :xmlns:myprefix="http://www.w3.org/2001/XMLSchema"
<myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified">
<myprefix:element name="foo">
<myprefix:complexType>
<myprefix:element name="bar" type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
<myprefix:choice>
<myprefix:element name="baz" type="string"/>
<myprefix:element name="bas" type="string"/>
</myprefix:choice>
</myprefix:complexType>
</myprefix:schema>
import lxml.etree as XE
root = XE.fromstring(xml_content)
namespaces=root.nsmap
Be careful : Need to use relative xpath, Absolute... Is not supported xpath, For prefixed tag You have to take it namespace
nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
node.getparent().remove(node)
node.attrib.pop(attri_name_list[0])
demand :
Just take the one on it xml For example
programme :
Complete code :
import re
import lxml.etree as XE
def ignore_xpath_handled_by_lxml(xml_content):
ignore_xpath_set = set()
ignore_xpath_set.add("myprefix:schema/myprefix:complexType/myprefix:choice")
ignore_xpath_set.add("myprefix:schema/myprefix:element/[@name]")
root = XE.fromstring(xml_content)
root_tag_name = re.findall(".*\}(.*)", root.tag)[0]
for xpath_ignore in ignore_xpath_set:
xpath_ignore_tag = xpath_ignore.split("/")[0].split(":")[1]
# reletive path
index = xpath_ignore.find("/")
rele_xpath_ignore = ".//" + xpath_ignore[index+1:]
# handle the xpath: mached the tag
if xpath_ignore_tag == root_tag_name:
try:
attri_name_list = re.findall(".*\[@(.*)\].*", xpath_ignore)
nodes = root.findall(rele_xpath_ignore, namespaces=root.nsmap)
if len(nodes) > 0:
for node in nodes:
if len(attri_name_list) > 0:
node.attrib.pop(attri_name_list[0])
else:
node.getparent().remove(node)
except Exception as e:
print("Error: {}".format(e))
else:
continue
root_tag = "myprefix" + ":" + root_tag_name
ignore_result = str(XE.tostring(root, pretty_print=True, encoding="unicode"))
namespace_pattern = re.compile('<' + root_tag + r' xmlns(.|\\s)*>')
content_without_namespace = re.sub(namespace_pattern, '<'+ root_tag + '>', ignore_result)
return content_without_namespace
if __name__ == "__main__":
xml_string = ''' <myprefix:schema xmlns:myprefix="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:B" xmlns:B="urn:B" elementFormDefault="qualified"> <myprefix:element name="foo"> <myprefix:complexType> <myprefix:element name="bar" type="B:myType"/> </myprefix:complexType> </myprefix:element> <myprefix:complexType name="myType"> <myprefix:choice> <myprefix:element name="baz" type="string"/> <myprefix:element name="bas" type="string"/> </myprefix:choice> </myprefix:complexType> </myprefix:schema> '''
new_xml_string = ignore_xpath_handled_by_lxml(xml_string)
print(new_xml_string)
Output :
<myprefix:schema>
<myprefix:element>
<myprefix:complexType>
<myprefix:element type="B:myType"/>
</myprefix:complexType>
</myprefix:element>
<myprefix:complexType name="myType">
</myprefix:complexType>
</myprefix:schema>