एलएक्सएमएल में एक तत्व कैसे निकालें

Question 1

मुझे एक विशेषता की सामग्री के आधार पर तत्वों को पूरी तरह से हटाने की आवश्यकता है, अजगर के एलएक्सएमएल का उपयोग करके। उदाहरण:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

मैं यह प्रिंट करना चाहूंगा:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

क्या एक अस्थायी चर को संचय किए बिना ऐसा करने का एक तरीका है और इसे मैन्युअल रूप से मुद्रित करना है, जैसे:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

removeएक xmlElement की विधि का उपयोग करें :

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

अगर मुझे @ ऑक वर्जन के साथ तुलना करनी थी, तो मेरा काम तब भी होगा जब हटाने के तत्व सीधे आपके xml के रूट नोड के तहत नहीं हैं।

Question 3

आप removeफ़ंक्शन की तलाश कर रहे हैं । पेड़ की हटाने की विधि को बुलाओ और इसे हटाने के लिए एक सबलेमेंट पास करें।

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

परिणाम:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

मैं एक स्थिति से मिला:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)उस text hereहिस्से को हटा देगा जिसका मैं मतलब नहीं था।

उत्तर का अनुसरण करते हुए , मैंने पाया कि etree.strip_elementsमेरे लिए एक बेहतर समाधान है, जिसे आप नियंत्रित कर सकते हैं कि आप पाठ को with_tail=(bool)परम के साथ पीछे हटाएंगे या नहीं ।

लेकिन फिर भी मुझे नहीं पता कि क्या यह टैग के लिए xpath फ़िल्टर का उपयोग कर सकता है। बस यह सूचित करने के लिए डाल दिया।

यहाँ डॉक्टर है:

स्ट्रिप_लेमेंट्स (ट्री_ओर_मेंमेंट, * टैग_नाम, with_tail = ट्रू)

किसी पेड़ या उपट्री से उपलब्ध टैग नामों के साथ सभी तत्वों को हटा दें। यह उनके सभी गुणों, पाठ सामग्री और वंशजों सहित तत्वों और उनके संपूर्ण उप-प्रकार को हटा देगा। यह तत्व के टेल टेक्स्ट को भी हटा देगा जब तक आप with_tailकीवर्ड तर्क विकल्प को स्पष्ट रूप से गलत पर सेट नहीं करते ।

टैग नामों में वाइल्डकार्ड शामिल हो सकते हैं _Element.iter।

ध्यान दें कि यह उस तत्व (या एलिमेंट्री रूट एलिमेंट) को डिलीट नहीं करेगा जिसे आपने मैच किया था। यह केवल अपने वंशजों का इलाज करेगा। यदि आप मूल तत्व को शामिल करना चाहते हैं, तो इस फ़ंक्शन को कॉल करने से पहले सीधे उसका टैग नाम जांचें।

उदाहरण उपयोग ::
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

जैसा कि पहले ही उल्लेख किया गया है, आप remove()पेड़ से (उप) तत्वों को हटाने के लिए विधि का उपयोग कर सकते हैं :

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

लेकिन यह इसके सहित तत्व को हटा देता है tail, जो एक समस्या है यदि आप HTML जैसे मिश्रित-सामग्री दस्तावेज़ों को संसाधित कर रहे हैं:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

हो जाता है

<div></div>

जो मुझे लगता है कि आप हमेशा नहीं चाहते हैं :) मैंने सिर्फ तत्व को निकालने और अपनी पूंछ रखने के लिए सहायक फ़ंक्शन बनाया है:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

इस तरह से यह टेल टेक्स्ट को बनाए रखेगा:

<div> Hello!</div>

Question 6

इसे हल करने के लिए आप lxml से html का उपयोग भी कर सकते हैं:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

इसे इसका उत्पादन करना चाहिए:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>