lxml

XML

序列化

生成 XML 字符串。在 Python3 中需要指定 encoding=str ，否则生成 bytes 对象

etree.tostring(el, encoding=str)

如果指定了 xml_declaration=True 则不能使用 encoding=str 。一个变通的方法是

etree.tostring(root, encoding='UTF-8', xml_declaration=True).decode('UTF-8')

method 默认为 xml ，指定为 html 时将生成HTML代码，指定为 text 时将只有文本（与 xpath('string()') 相同。

文本

获得节点内的文本（相当于jQuery的 text() ）

el.text_content()
el.xpath('string()')

XPath 命名空间

指定命名空间。可同时指定多个命名空间。

doc.xpath('//em:name', namespaces={'em':'http://www.mozilla.org/2004/em-rdf#'})

参见官方文档。

注意：如果XPath表达式是绝对路径，则在文档中查找，而不是在当前对象的子节点中查找。如果是相对路径则相对于当前对象进行查找。

HTML

lxml 也可以处理 HTML，使用 lxml.html 模块。例如：

html = lxml.html.fromstring(open('a.html').read())

也可以使用 parse() 方法读入HTML文档，但它并不返回 html 对象，需要调用其 getroot() 方法。

处理其 html 对象，除了可以使用XPath外，也可以使用基于CSS的 cssselect() 方法。详见lxml.cssselect。

序列化时使用 lxml.html.tostring 。和 lxml.etree.tostring 类似，但有些标签默认是不关闭的，使用 method='xml' 来输出为XHTML。

不支持的 CSS

使用 \ 转义特殊字符，如 #a\ b 、 .a\.b （2012年6月29日）

问题处理

HTML DOCTYPE

HTML文件在序列化时不会自动添加DOCTYPE。^[1]^[2]^[3]可以手动添加：

f.write(doc.getroottree().docinfo.doctype + '\n')

DLL import error

Windows 下可能出现此问题。或者在命令行下能导入，但在 GVIM 中出现该问题。下载安装这里的版本即可解决。或者使用 Resource Hacker 删除内嵌的 manifest 资源。^[4]^[5]

BUGS

lxml.html.parse 和 lxml.etree.parse 处理 URL 时，在收到 HTTP 响应头前会阻塞整个解释器（不释放 GIL）（2011年11月30日，lxml 2.3.1，libxml2 2.7.8）
lxml.html.parse 从 file-like 对象读取时会当作 latin1 编码，unicode 对象会被先编码成 UTF-8； lxml.etree.parse 正常（2011年11月30日，lxml 2.3.1，libxml2 2.7.8）^[6]

外部链接

教程
- IBM: 使用由 Python 编写的 lxml 实现高性能 XML 解析

参考资料

[1] Bug #659367 in lxml: “missing doctype when serialized”

[2] ython - Creating a doctype with lxml's etree - Stack Overflow

[3] ython - lxml, missing doctype when serialized - Stack Overflow

[4] ython - lxml: DLL load failed: The specified module could not be found - Stack Overflow

[5] Issue 4120: Do not embed manifest files in *.pyd when compiling with MSVC - Python tracker

[6] Bug #898072 in lxml: “lxml.html.parse treats encoding as Latin1 in Python 3 when reading from unicode file-objects directly”

[1]

[2]

[3]

[4]

[5]

[6]

lxml

目录

XML

序列化

文本

XPath 命名空间

HTML

不支持的 CSS

问题处理

HTML DOCTYPE

DLL import error

BUGS

外部链接

参考资料

导航菜单

lxml

XML

序列化

文本

XPath 命名空间

HTML

不支持的 CSS

问题处理

HTML DOCTYPE

DLL import error

BUGS

外部链接

参考资料

导航菜单

搜索