在 Python 中常见的XML编程接口有DOM和SAX,这两种接口处理XML文件的方式不同,使用场合也不同。 Python有三种方法解析XML:SAX,DOM和ElementTree。 但是最常用的,可能是另外一个模块: lxml 。
In [39]:
# pip install lxml
In [40]:
from lxml import etree
In [41]:
root = etree.Element('root')
In [42]:
root.tag
Out[42]:
'root'
In [43]:
root.append(etree.Element('child1'))
In [44]:
child2 = etree.SubElement(root, 'child2')
In [45]:
child3 = etree.SubElement(root, 'child3')
In [46]:
print(etree.tostring(root, pretty_print=True))
b'<root>\n <child1/>\n <child2/>\n <child3/>\n</root>\n'
In [47]:
child = root[0]
In [48]:
child.tag
Out[48]:
'child1'
In [49]:
len(root)
Out[49]:
3
In [50]:
root.index(root[1])
Out[50]:
1
In [51]:
children = list(root)
In [52]:
for child in children:
print(child.tag)
child1 child2 child3
In [53]:
etree.iselement(root)
Out[53]:
True
In [54]:
if len(root):
print('got')
got
In [55]:
if len(child2):
print('got')
In [56]:
child2.getparent()
Out[56]:
<Element root at 0x7f7b71fcac00>
In [57]:
child2.getnext()
Out[57]:
<Element child3 at 0x7f7b71fb1780>
In [58]:
child2.getprevious()
Out[58]:
<Element child1 at 0x7f7b71f2a040>
In [59]:
root = etree.Element('root', intersting = 'totally')
In [60]:
etree.tostring(root)
Out[60]:
b'<root intersting="totally"/>'
属性只是无序的 name-value 对,所以处理它们非常方便的方法是通过 Elements 中类似字典的界面:
In [61]:
root.get('intersting')
Out[61]:
'totally'
In [62]:
root.get('Hello')
In [63]:
root.set('Hello', 'HuHu')
In [64]:
sorted(root.keys())
Out[64]:
['Hello', 'intersting']
In [65]:
for name, value in sorted(root.items()):
print(f'{name}: {value}')
Hello: HuHu intersting: totally
In [66]:
etree.tostring(root)
Out[66]:
b'<root intersting="totally" Hello="HuHu"/>'
In [67]:
attributes = root.attrib
In [68]:
attributes.get('intersting')
Out[68]:
'totally'
元素可以包含文本:
In [69]:
root = etree.Element('root')
In [70]:
root.text = 'TEXT'
In [71]:
root.text
Out[71]:
'TEXT'
In [72]:
etree.tostring(root)
Out[72]:
b'<root>TEXT</root>'
In [73]:
with open('/data/demo/movie.xml') as f:
# print(f.read())
text = f.read()
html = etree.HTML(text.encode())
# print(html)
print(html.tag)
html
从根节点向下找任意层中title的节点。
In [74]:
years = html.xpath('//year')
for year in years:
print(year.tag)
year year
In [75]:
for tr in html.xpath('//movie[@title="Trigun"]'):
print(tr)
<Element movie at 0x7f7b71ff0440>
可以使用 lxml 的 etree 库来进行爬取网站信息。
从豆瓣电影中提取“本周口碑榜”:
In [76]:
import requests
lxml 是c语言的库,效率非常高。
In [77]:
from lxml import etree
In [78]:
url = 'http://movie.douban.com'
headers = {'User-agent': "Mozilla/7.0 (Windows NT 6.1) AppleWebKit/539.36 (KHTML, like Gecko) \
Chrome/59.0.2883.75 Safari/537.36"}
response = requests.get(url, headers=headers)
In [79]:
with response:
if response.status_code == 200:
text = response.text
html = etree.HTML(text)
print(html.tag)
titles = html.xpath('//div[@class="billboard-bd"]//a/text()')
for title in titles:
print(title)
print("*********************")
html 孤独摇滚(上) 孤独的美食家 剧场版 黎明的一切 爱的暂停键 共同的语言 最后的里程 大风杀 雷霆特攻队* 新干线惊爆倒数 女儿的女儿 *********************