annotate genshi/input.py @ 820:1837f39efd6f experimental-inline

Sync (old) experimental inline branch with trunk@1027.
author cmlenz
date Wed, 11 Mar 2009 17:51:06 +0000
parents 0742f421caba
children 09cc3627654c
rev   line source
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
1 # -*- coding: utf-8 -*-
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
2 #
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
3 # Copyright (C) 2006-2007 Edgewall Software
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
4 # All rights reserved.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
5 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
7 # you should have received as part of this distribution. The terms
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 213
diff changeset
8 # are also available at http://genshi.edgewall.org/wiki/License.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
9 #
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
11 # individuals. For the exact contribution history, see the revision
230
24757b771651 Renamed Markup to Genshi in repository.
cmlenz
parents: 213
diff changeset
12 # history and logs, available at http://genshi.edgewall.org/log/.
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
13
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
14 """Support for constructing markup streams from files, strings, or other
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
15 sources.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
16 """
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
17
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
18 from itertools import chain
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
19 from xml.parsers import expat
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
20 import HTMLParser as html
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
21 import htmlentitydefs
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
22 from StringIO import StringIO
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
23
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
24 from genshi.core import Attrs, QName, Stream, stripentities
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
25 from genshi.core import START, END, XML_DECL, DOCTYPE, TEXT, START_NS, END_NS, \
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
26 START_CDATA, END_CDATA, PI, COMMENT
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
27
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
28 __all__ = ['ET', 'ParseError', 'XMLParser', 'XML', 'HTMLParser', 'HTML']
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
29 __docformat__ = 'restructuredtext en'
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
30
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
31 def ET(element):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
32 """Convert a given ElementTree element to a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
33
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
34 :param element: an ElementTree element
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
35 :return: a markup stream
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
36 """
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
37 tag_name = QName(element.tag.lstrip('{'))
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
38 attrs = Attrs([(QName(attr.lstrip('{')), value)
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
39 for attr, value in element.items()])
290
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
40
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
41 yield START, (tag_name, attrs), (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
42 if element.text:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
43 yield TEXT, element.text, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
44 for child in element.getchildren():
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
45 for item in ET(child):
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
46 yield item
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
47 yield END, tag_name, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
48 if element.tail:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module.
cmlenz
parents: 230
diff changeset
49 yield TEXT, element.tail, (None, -1, -1)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
50
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
51
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
52 class ParseError(Exception):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
53 """Exception raised when fatal syntax errors are found in the input being
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
54 parsed.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
55 """
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
56
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
57 def __init__(self, message, filename=None, lineno=-1, offset=-1):
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
58 """Exception initializer.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
59
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
60 :param message: the error message from the parser
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
61 :param filename: the path to the file that was parsed
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
62 :param lineno: the number of the line on which the error was encountered
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
63 :param offset: the column number where the error was encountered
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
64 """
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
65 self.msg = message
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
66 if filename:
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
67 message += ', in ' + filename
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
68 Exception.__init__(self, message)
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
69 self.filename = filename or '<string>'
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
70 self.lineno = lineno
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
71 self.offset = offset
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
72
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
73
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
74 class XMLParser(object):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
75 """Generator-based XML parser based on roughly equivalent code in
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
76 Kid/ElementTree.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
77
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
78 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
79
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
80 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
81 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
82 ... print kind, data
326
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again.
cmlenz
parents: 316
diff changeset
83 START (QName(u'root'), Attrs([(QName(u'id'), u'2')]))
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again.
cmlenz
parents: 316
diff changeset
84 START (QName(u'child'), Attrs())
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
85 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
86 END child
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
87 END root
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
88 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
89
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
90 _entitydefs = ['<!ENTITY %s "&#%d;">' % (name, value) for name, value in
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
91 htmlentitydefs.name2codepoint.items()]
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
92 _external_dtd = '\n'.join(_entitydefs)
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
93
316
4ab9edf5e83b Configurable encoding of template files, closing #65.
cmlenz
parents: 312
diff changeset
94 def __init__(self, source, filename=None, encoding=None):
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
95 """Initialize the parser for the given XML input.
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
96
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
97 :param source: the XML text as a file-like object
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
98 :param filename: the name of the file, if appropriate
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
99 :param encoding: the encoding of the file; if not specified, the
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
100 encoding is assumed to be ASCII, UTF-8, or UTF-16, or
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
101 whatever the encoding specified in the XML declaration
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
102 (if any)
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
103 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
104 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
105 self.filename = filename
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
106
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
107 # Setup the Expat parser
316
4ab9edf5e83b Configurable encoding of template files, closing #65.
cmlenz
parents: 312
diff changeset
108 parser = expat.ParserCreate(encoding, '}')
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
109 parser.buffer_text = True
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
110 parser.returns_unicode = True
160
faea6db52ef1 Attribute order in parsed XML is now preserved.
cmlenz
parents: 146
diff changeset
111 parser.ordered_attributes = True
faea6db52ef1 Attribute order in parsed XML is now preserved.
cmlenz
parents: 146
diff changeset
112
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
113 parser.StartElementHandler = self._handle_start
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
114 parser.EndElementHandler = self._handle_end
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
115 parser.CharacterDataHandler = self._handle_data
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
116 parser.StartDoctypeDeclHandler = self._handle_doctype
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
117 parser.StartNamespaceDeclHandler = self._handle_start_ns
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
118 parser.EndNamespaceDeclHandler = self._handle_end_ns
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
119 parser.StartCdataSectionHandler = self._handle_start_cdata
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
120 parser.EndCdataSectionHandler = self._handle_end_cdata
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
121 parser.ProcessingInstructionHandler = self._handle_pi
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
122 parser.XmlDeclHandler = self._handle_xml_decl
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
123 parser.CommentHandler = self._handle_comment
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
124
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
125 # Tell Expat that we'll handle non-XML entities ourselves
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
126 # (in _handle_other)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
127 parser.DefaultHandler = self._handle_other
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
128 parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
129 parser.UseForeignDTD()
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
130 parser.ExternalEntityRefHandler = self._build_foreign
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
131
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
132 self.expat = parser
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
133 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
134
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
135 def parse(self):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
136 """Generator that parses the XML source, yielding markup events.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
137
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
138 :return: a markup event stream
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
139 :raises ParseError: if the XML text is not well formed
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
140 """
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
141 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
142 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
143 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
144 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
145 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
146 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
147 data = self.source.read(bufsize)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
148 if data == '': # end of data
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
149 if hasattr(self, 'expat'):
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
150 self.expat.Parse('', True)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
151 del self.expat # get rid of circular references
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
152 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
153 else:
207
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43.
cmlenz
parents: 182
diff changeset
154 if isinstance(data, unicode):
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43.
cmlenz
parents: 182
diff changeset
155 data = data.encode('utf-8')
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
156 self.expat.Parse(data, False)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
157 for event in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
158 yield event
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
159 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
160 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
161 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
162 except expat.ExpatError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
163 msg = str(e)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
164 raise ParseError(msg, self.filename, e.lineno, e.offset)
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
165 return Stream(_generate()).filter(_coalesce)
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
166
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
167 def __iter__(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
168 return iter(self.parse())
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
169
293
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
170 def _build_foreign(self, context, base, sysid, pubid):
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
171 parser = self.expat.ExternalEntityParserCreate(context)
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
172 parser.ParseFile(StringIO(self._external_dtd))
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
173 return 1
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem].
cmlenz
parents: 290
diff changeset
174
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
175 def _enqueue(self, kind, data=None, pos=None):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
176 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
177 pos = self._getpos()
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
178 if kind is TEXT:
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
179 # Expat reports the *end* of the text event as current position. We
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
180 # try to fix that up here as much as possible. Unfortunately, the
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
181 # offset is only valid for single-line text. For multi-line text,
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
182 # it is apparently not possible to determine at what offset it
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
183 # started
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
184 if '\n' in data:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
185 lines = data.splitlines()
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
186 lineno = pos[1] - len(lines) + 1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
187 offset = -1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
188 else:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
189 lineno = pos[1]
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
190 offset = pos[2] - len(data)
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
191 pos = (pos[0], lineno, offset)
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
192 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
193
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
194 def _getpos_unknown(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
195 return (self.filename, -1, -1)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
196
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
197 def _getpos(self):
134
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though).
cmlenz
parents: 69
diff changeset
198 return (self.filename, self.expat.CurrentLineNumber,
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
199 self.expat.CurrentColumnNumber)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
200
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
201 def _handle_start(self, tag, attrib):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
202 attrs = Attrs([(QName(name), value) for name, value in
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
203 zip(*[iter(attrib)] * 2)])
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
204 self._enqueue(START, (QName(tag), attrs))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
205
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
206 def _handle_end(self, tag):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
207 self._enqueue(END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
208
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
209 def _handle_data(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
210 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
211
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
212 def _handle_xml_decl(self, version, encoding, standalone):
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
213 self._enqueue(XML_DECL, (version, encoding, standalone))
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
214
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
215 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
216 self._enqueue(DOCTYPE, (name, pubid, sysid))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
217
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
218 def _handle_start_ns(self, prefix, uri):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
219 self._enqueue(START_NS, (prefix or '', uri))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
220
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
221 def _handle_end_ns(self, prefix):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
222 self._enqueue(END_NS, prefix or '')
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
223
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
224 def _handle_start_cdata(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
225 self._enqueue(START_CDATA)
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
226
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
227 def _handle_end_cdata(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
228 self._enqueue(END_CDATA)
143
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24.
cmlenz
parents: 140
diff changeset
229
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
230 def _handle_pi(self, target, data):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
231 self._enqueue(PI, (target, data))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
232
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
233 def _handle_comment(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
234 self._enqueue(COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
235
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
236 def _handle_other(self, text):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
237 if text.startswith('&'):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
238 # deal with undefined entities
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
239 try:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
240 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
241 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
242 except KeyError:
209
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
243 filename, lineno, offset = self._getpos()
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
244 error = expat.error('undefined entity "%s": line %d, column %d'
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
245 % (text, lineno, offset))
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
246 error.code = expat.errors.XML_ERROR_UNDEFINED_ENTITY
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
247 error.lineno = lineno
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
248 error.offset = offset
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC.
cmlenz
parents: 207
diff changeset
249 raise error
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
250
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
251
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
252 def XML(text):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
253 """Parse the given XML source and return a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
254
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
255 Unlike with `XMLParser`, the returned stream is reusable, meaning it can be
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
256 iterated over multiple times:
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
257
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
258 >>> xml = XML('<doc><elem>Foo</elem><elem>Bar</elem></doc>')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
259 >>> print xml
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
260 <doc><elem>Foo</elem><elem>Bar</elem></doc>
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
261 >>> print xml.select('elem')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
262 <elem>Foo</elem><elem>Bar</elem>
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
263 >>> print xml.select('elem/text()')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
264 FooBar
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
265
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
266 :param text: the XML source
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
267 :return: the parsed XML event stream
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
268 :raises ParseError: if the XML text is not well-formed
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
269 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
270 return Stream(list(XMLParser(StringIO(text))))
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
271
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
272
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
273 class HTMLParser(html.HTMLParser, object):
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
274 """Parser for HTML input based on the Python `HTMLParser` module.
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
275
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
276 This class provides the same interface for generating stream events as
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
277 `XMLParser`, and attempts to automatically balance tags.
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
278
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
279 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
280
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
281 >>> parser = HTMLParser(StringIO('<UL compact><LI>Foo</UL>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
282 >>> for kind, data, pos in parser:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
283 ... print kind, data
326
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again.
cmlenz
parents: 316
diff changeset
284 START (QName(u'ul'), Attrs([(QName(u'compact'), u'compact')]))
08ada6b4b767 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again.
cmlenz
parents: 316
diff changeset
285 START (QName(u'li'), Attrs())
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
286 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
287 END li
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
288 END ul
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
289 """
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
290
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
291 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
292 'hr', 'img', 'input', 'isindex', 'link', 'meta',
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
293 'param'])
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
294
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
295 def __init__(self, source, filename=None, encoding='utf-8'):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
296 """Initialize the parser for the given HTML input.
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
297
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
298 :param source: the HTML text as a file-like object
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
299 :param filename: the name of the file, if known
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
300 :param filename: encoding of the file; ignored if the input is unicode
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
301 """
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
302 html.HTMLParser.__init__(self)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
303 self.source = source
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
304 self.filename = filename
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
305 self.encoding = encoding
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
306 self._queue = []
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
307 self._open_tags = []
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
308
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
309 def parse(self):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
310 """Generator that parses the HTML source, yielding markup events.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
311
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
312 :return: a markup event stream
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
313 :raises ParseError: if the HTML text is not well formed
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
314 """
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
315 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
316 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
317 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
318 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
319 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
320 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
321 data = self.source.read(bufsize)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
322 if data == '': # end of data
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
323 self.close()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
324 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
325 else:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
326 self.feed(data)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
327 for kind, data, pos in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
328 yield kind, data, pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
329 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
330 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
331 open_tags = self._open_tags
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
332 open_tags.reverse()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
333 for tag in open_tags:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
334 yield END, QName(tag), pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
335 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
336 except html.HTMLParseError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
337 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
338 raise ParseError(msg, self.filename, e.lineno, e.offset)
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
339 return Stream(_generate()).filter(_coalesce)
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
340
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
341 def __iter__(self):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
342 return iter(self.parse())
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
343
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
344 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
345 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
346 pos = self._getpos()
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
347 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
348
21
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
349 def _getpos(self):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
350 lineno, column = self.getpos()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3.
cmlenz
parents: 1
diff changeset
351 return (self.filename, lineno, column)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
352
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
353 def handle_starttag(self, tag, attrib):
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
354 fixed_attrib = []
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
355 for name, value in attrib: # Fixup minimized attributes
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
356 if value is None:
312
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
357 value = unicode(name)
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
358 elif not isinstance(value, unicode):
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`.
cmlenz
parents: 311
diff changeset
359 value = value.decode(self.encoding, 'replace')
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
360 fixed_attrib.append((QName(name), stripentities(value)))
26
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file.
cmlenz
parents: 21
diff changeset
361
182
41db0260ebb1 Renamed `Attributes` to `Attrs` to reduce the verbosity.
cmlenz
parents: 160
diff changeset
362 self._enqueue(START, (QName(tag), Attrs(fixed_attrib)))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
363 if tag in self._EMPTY_ELEMS:
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
364 self._enqueue(END, QName(tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
365 else:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
366 self._open_tags.append(tag)
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
367
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
368 def handle_endtag(self, tag):
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
369 if tag not in self._EMPTY_ELEMS:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
370 while self._open_tags:
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
371 open_tag = self._open_tags.pop()
395
55cf81951686 inline branch: Merged [439:479/trunk].
cmlenz
parents: 326
diff changeset
372 self._enqueue(END, QName(open_tag))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
373 if open_tag.lower() == tag.lower():
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
374 break
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
375
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
376 def handle_data(self, text):
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
377 if not isinstance(text, unicode):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
378 text = text.decode(self.encoding, 'replace')
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
379 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
380
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
381 def handle_charref(self, name):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
382 if name.lower().startswith('x'):
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
383 text = unichr(int(name[1:], 16))
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
384 else:
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
385 text = unichr(int(name))
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
386 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
387
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
388 def handle_entityref(self, name):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
389 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
390 text = unichr(htmlentitydefs.name2codepoint[name])
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
391 except KeyError:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
392 text = '&%s;' % name
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
393 self._enqueue(TEXT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
394
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
395 def handle_pi(self, data):
395
55cf81951686 inline branch: Merged [439:479/trunk].
cmlenz
parents: 326
diff changeset
396 target, data = data.split(None, 1)
55cf81951686 inline branch: Merged [439:479/trunk].
cmlenz
parents: 326
diff changeset
397 if data.endswith('?'):
55cf81951686 inline branch: Merged [439:479/trunk].
cmlenz
parents: 326
diff changeset
398 data = data[:-1]
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
399 self._enqueue(PI, (target.strip(), data.strip()))
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
400
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
401 def handle_comment(self, text):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
402 self._enqueue(COMMENT, text)
1
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
403
821114ec4f69 Initial import.
cmlenz
parents:
diff changeset
404
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
405 def HTML(text, encoding='utf-8'):
500
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
406 """Parse the given HTML source and return a markup stream.
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
407
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
408 Unlike with `HTMLParser`, the returned stream is reusable, meaning it can be
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
409 iterated over multiple times:
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
410
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
411 >>> html = HTML('<body><h1>Foo</h1></body>')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
412 >>> print html
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
413 <body><h1>Foo</h1></body>
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
414 >>> print html.select('h1')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
415 <h1>Foo</h1>
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
416 >>> print html.select('h1/text()')
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
417 Foo
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
418
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
419 :param text: the HTML source
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
420 :return: the parsed XML event stream
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
421 :raises ParseError: if the HTML text is not well-formed, and error recovery
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
422 fails
0742f421caba Merged revisions 487-603 via svnmerge from
cmlenz
parents: 395
diff changeset
423 """
311
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8).
cmlenz
parents: 293
diff changeset
424 return Stream(list(HTMLParser(StringIO(text), encoding=encoding)))
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
425
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
426 def _coalesce(stream):
144
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26.
cmlenz
parents: 143
diff changeset
427 """Coalesces adjacent TEXT events into a single event."""
146
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
428 textbuf = []
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
429 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
430 for kind, data, pos in chain(stream, [(None, None, None)]):
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
431 if kind is TEXT:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
432 textbuf.append(data)
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
433 if textpos is None:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
434 textpos = pos
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
435 else:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
436 if textbuf:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
437 yield TEXT, u''.join(textbuf), textpos
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
438 del textbuf[:]
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
439 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
440 if kind:
db0dacc1239a Simplifed `CoalesceFilter` (now a function)
cmlenz
parents: 145
diff changeset
441 yield kind, data, pos
Copyright (C) 2012-2017 Edgewall Software