genshi/mirror: genshi/input.py annotate

annotate genshi/input.py @ 458:5f5b227b04be trunk

The `ET()` function now correctly handles attributes with a namespace.

author	cmlenz
date	Tue, 17 Apr 2007 18:35:29 +0000
parents	5692bc32ba5f
children	75425671b437

rev	line source
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	1 # -- coding: utf-8 --
5479aae32f5a Initial import. cmlenz parents: diff changeset	2 #
408 4675d5cf6c67 Update copyright year for files modified this year. cmlenz parents: 403 diff changeset	3 # Copyright (C) 2006-2007 Edgewall Software
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	4 # All rights reserved.
5479aae32f5a Initial import. cmlenz parents: diff changeset	5 #
5479aae32f5a Initial import. cmlenz parents: diff changeset	6 # This software is licensed as described in the file COPYING, which
5479aae32f5a Initial import. cmlenz parents: diff changeset	7 # you should have received as part of this distribution. The terms
230 84168828b074 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	8 # are also available at http://genshi.edgewall.org/wiki/License.
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	9 #
5479aae32f5a Initial import. cmlenz parents: diff changeset	10 # This software consists of voluntary contributions made by many
5479aae32f5a Initial import. cmlenz parents: diff changeset	11 # individuals. For the exact contribution history, see the revision
230 84168828b074 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	12 # history and logs, available at http://genshi.edgewall.org/log/.
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	13
425 073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	14 """Support for constructing markup streams from files, strings, or other
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	15 sources.
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	16 """
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	17
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	18 from itertools import chain
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	19 from xml.parsers import expat
5479aae32f5a Initial import. cmlenz parents: diff changeset	20 try:
5479aae32f5a Initial import. cmlenz parents: diff changeset	21 frozenset
5479aae32f5a Initial import. cmlenz parents: diff changeset	22 except NameError:
5479aae32f5a Initial import. cmlenz parents: diff changeset	23 from sets import ImmutableSet as frozenset
5479aae32f5a Initial import. cmlenz parents: diff changeset	24 import HTMLParser as html
5479aae32f5a Initial import. cmlenz parents: diff changeset	25 import htmlentitydefs
5479aae32f5a Initial import. cmlenz parents: diff changeset	26 from StringIO import StringIO
5479aae32f5a Initial import. cmlenz parents: diff changeset	27
293 e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	28 from genshi.core import Attrs, QName, Stream, stripentities
230 84168828b074 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	29 from genshi.core import DOCTYPE, START, END, START_NS, END_NS, TEXT, \
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	30 START_CDATA, END_CDATA, PI, COMMENT
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	31
290 94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	32 __all__ = ['ET', 'ParseError', 'XMLParser', 'XML', 'HTMLParser', 'HTML']
425 073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	33 __docformat__ = 'restructuredtext en'
290 94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	34
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	35 def ET(element):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	36 """Convert a given ElementTree element to a markup stream.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	37
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	38 :param element: an ElementTree element
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	39 :return: a markup stream
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	40 """
290 94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	41 tag_name = QName(element.tag.lstrip('{'))
458 5f5b227b04be The `ET()` function now correctly handles attributes with a namespace. cmlenz parents: 434 diff changeset	42 attrs = Attrs([(QName(attr.lstrip('{')), value)
5f5b227b04be The `ET()` function now correctly handles attributes with a namespace. cmlenz parents: 434 diff changeset	43 for attr, value in element.items()])
290 94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	44
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	45 yield START, (tag_name, attrs), (None, -1, -1)
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	46 if element.text:
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	47 yield TEXT, element.text, (None, -1, -1)
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	48 for child in element.getchildren():
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	49 for item in ET(child):
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	50 yield item
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	51 yield END, tag_name, (None, -1, -1)
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	52 if element.tail:
94f9f2cc66c8 Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	53 yield TEXT, element.tail, (None, -1, -1)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	54
5479aae32f5a Initial import. cmlenz parents: diff changeset	55
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	56 class ParseError(Exception):
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	57 """Exception raised when fatal syntax errors are found in the input being
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	58 parsed.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	59 """
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	60
422 5d08a744636e More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	61 def __init__(self, message, filename=None, lineno=-1, offset=-1):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	62 """Exception initializer.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	63
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	64 :param message: the error message from the parser
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	65 :param filename: the path to the file that was parsed
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	66 :param lineno: the number of the line on which the error was encountered
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	67 :param offset: the column number where the error was encountered
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	68 """
422 5d08a744636e More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	69 self.msg = message
5d08a744636e More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	70 if filename:
434 5692bc32ba5f * Better method to propogate the full path to the template file on parse errors. Supersedes r513. cmlenz parents: 433 diff changeset	71 message += ', in ' + filename
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	72 Exception.__init__(self, message)
422 5d08a744636e More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	73 self.filename = filename or '<string>'
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	74 self.lineno = lineno
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	75 self.offset = offset
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	76
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	77
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	78 class XMLParser(object):
5479aae32f5a Initial import. cmlenz parents: diff changeset	79 """Generator-based XML parser based on roughly equivalent code in
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	80 Kid/ElementTree.
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	81
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	82 The parsing is initiated by iterating over the parser object:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	83
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	84 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	85 >>> for kind, data, pos in parser:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	86 ... print kind, data
326 f999da894391 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	87 START (QName(u'root'), Attrs([(QName(u'id'), u'2')]))
f999da894391 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	88 START (QName(u'child'), Attrs())
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	89 TEXT Foo
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	90 END child
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	91 END root
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	92 """
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	93
293 e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	94 _entitydefs = ['<!ENTITY %s "&#%d;">' % (name, value) for name, value in
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	95 htmlentitydefs.name2codepoint.items()]
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	96 _external_dtd = '\n'.join(_entitydefs)
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	97
316 a946edefac40 Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	98 def __init__(self, source, filename=None, encoding=None):
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	99 """Initialize the parser for the given XML input.
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	100
425 073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	101 :param source: the XML text as a file-like object
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	102 :param filename: the name of the file, if appropriate
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	103 :param encoding: the encoding of the file; if not specified, the
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	104 encoding is assumed to be ASCII, UTF-8, or UTF-16, or
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	105 whatever the encoding specified in the XML declaration
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	106 (if any)
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	107 """
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	108 self.source = source
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	109 self.filename = filename
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	110
5479aae32f5a Initial import. cmlenz parents: diff changeset	111 # Setup the Expat parser
316 a946edefac40 Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	112 parser = expat.ParserCreate(encoding, '}')
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	113 parser.buffer_text = True
5479aae32f5a Initial import. cmlenz parents: diff changeset	114 parser.returns_unicode = True
160 d19e8a2c549e Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	115 parser.ordered_attributes = True
d19e8a2c549e Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	116
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	117 parser.StartElementHandler = self._handle_start
5479aae32f5a Initial import. cmlenz parents: diff changeset	118 parser.EndElementHandler = self._handle_end
5479aae32f5a Initial import. cmlenz parents: diff changeset	119 parser.CharacterDataHandler = self._handle_data
5479aae32f5a Initial import. cmlenz parents: diff changeset	120 parser.StartDoctypeDeclHandler = self._handle_doctype
5479aae32f5a Initial import. cmlenz parents: diff changeset	121 parser.StartNamespaceDeclHandler = self._handle_start_ns
5479aae32f5a Initial import. cmlenz parents: diff changeset	122 parser.EndNamespaceDeclHandler = self._handle_end_ns
143 3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	123 parser.StartCdataSectionHandler = self._handle_start_cdata
3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	124 parser.EndCdataSectionHandler = self._handle_end_cdata
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	125 parser.ProcessingInstructionHandler = self._handle_pi
5479aae32f5a Initial import. cmlenz parents: diff changeset	126 parser.CommentHandler = self._handle_comment
209 fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	127
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	128 # Tell Expat that we'll handle non-XML entities ourselves
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	129 # (in _handle_other)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	130 parser.DefaultHandler = self._handle_other
293 e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	131 parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
209 fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	132 parser.UseForeignDTD()
293 e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	133 parser.ExternalEntityRefHandler = self._build_foreign
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	134
5479aae32f5a Initial import. cmlenz parents: diff changeset	135 # Location reporting is only support in Python >= 2.4
5479aae32f5a Initial import. cmlenz parents: diff changeset	136 if not hasattr(parser, 'CurrentLineNumber'):
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	137 self._getpos = self._getpos_unknown
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	138
5479aae32f5a Initial import. cmlenz parents: diff changeset	139 self.expat = parser
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	140 self._queue = []
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	141
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	142 def parse(self):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	143 """Generator that parses the XML source, yielding markup events.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	144
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	145 :return: a markup event stream
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	146 :raises ParseError: if the XML text is not well formed
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	147 """
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	148 def _generate():
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	149 try:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	150 bufsize = 4 * 1024 # 4K
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	151 done = False
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	152 while 1:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	153 while not done and len(self._queue) == 0:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	154 data = self.source.read(bufsize)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	155 if data == '': # end of data
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	156 if hasattr(self, 'expat'):
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	157 self.expat.Parse('', True)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	158 del self.expat # get rid of circular references
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	159 done = True
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	160 else:
207 28bfc6aafab7 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	161 if isinstance(data, unicode):
28bfc6aafab7 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	162 data = data.encode('utf-8')
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	163 self.expat.Parse(data, False)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	164 for event in self._queue:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	165 yield event
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	166 self._queue = []
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	167 if done:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	168 break
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	169 except expat.ExpatError, e:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	170 msg = str(e)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	171 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	172 return Stream(_generate()).filter(_coalesce)
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	173
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	174 def __iter__(self):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	175 return iter(self.parse())
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	176
293 e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	177 def _build_foreign(self, context, base, sysid, pubid):
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	178 parser = self.expat.ExternalEntityParserCreate(context)
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	179 parser.ParseFile(StringIO(self._external_dtd))
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	180 return 1
e17b7459b515 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	181
143 3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	182 def _enqueue(self, kind, data=None, pos=None):
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	183 if pos is None:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	184 pos = self._getpos()
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	185 if kind is TEXT:
134 d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	186 # Expat reports the end of the text event as current position. We
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	187 # try to fix that up here as much as possible. Unfortunately, the
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	188 # offset is only valid for single-line text. For multi-line text,
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	189 # it is apparently not possible to determine at what offset it
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	190 # started
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	191 if '\n' in data:
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	192 lines = data.splitlines()
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	193 lineno = pos[1] - len(lines) + 1
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	194 offset = -1
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	195 else:
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	196 lineno = pos[1]
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	197 offset = pos[2] - len(data)
d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	198 pos = (pos[0], lineno, offset)
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	199 self._queue.append((kind, data, pos))
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	200
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	201 def _getpos_unknown(self):
134 d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	202 return (self.filename, -1, -1)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	203
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	204 def _getpos(self):
134 d681d2c3cd8d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	205 return (self.filename, self.expat.CurrentLineNumber,
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	206 self.expat.CurrentColumnNumber)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	207
5479aae32f5a Initial import. cmlenz parents: diff changeset	208 def _handle_start(self, tag, attrib):
403 228907abb726 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	209 attrs = Attrs([(QName(name), value) for name, value in
228907abb726 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	210 zip([iter(attrib)] 2)])
228907abb726 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	211 self._enqueue(START, (QName(tag), attrs))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	212
5479aae32f5a Initial import. cmlenz parents: diff changeset	213 def _handle_end(self, tag):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	214 self._enqueue(END, QName(tag))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	215
5479aae32f5a Initial import. cmlenz parents: diff changeset	216 def _handle_data(self, text):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	217 self._enqueue(TEXT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	218
5479aae32f5a Initial import. cmlenz parents: diff changeset	219 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	220 self._enqueue(DOCTYPE, (name, pubid, sysid))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	221
5479aae32f5a Initial import. cmlenz parents: diff changeset	222 def _handle_start_ns(self, prefix, uri):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	223 self._enqueue(START_NS, (prefix or '', uri))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	224
5479aae32f5a Initial import. cmlenz parents: diff changeset	225 def _handle_end_ns(self, prefix):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	226 self._enqueue(END_NS, prefix or '')
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	227
143 3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	228 def _handle_start_cdata(self):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	229 self._enqueue(START_CDATA)
143 3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	230
3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	231 def _handle_end_cdata(self):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	232 self._enqueue(END_CDATA)
143 3d4c214c979a CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	233
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	234 def _handle_pi(self, target, data):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	235 self._enqueue(PI, (target, data))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	236
5479aae32f5a Initial import. cmlenz parents: diff changeset	237 def _handle_comment(self, text):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	238 self._enqueue(COMMENT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	239
5479aae32f5a Initial import. cmlenz parents: diff changeset	240 def _handle_other(self, text):
5479aae32f5a Initial import. cmlenz parents: diff changeset	241 if text.startswith('&'):
5479aae32f5a Initial import. cmlenz parents: diff changeset	242 # deal with undefined entities
5479aae32f5a Initial import. cmlenz parents: diff changeset	243 try:
5479aae32f5a Initial import. cmlenz parents: diff changeset	244 text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	245 self._enqueue(TEXT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	246 except KeyError:
209 fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	247 filename, lineno, offset = self._getpos()
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	248 error = expat.error('undefined entity "%s": line %d, column %d'
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	249 % (text, lineno, offset))
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	250 error.code = expat.errors.XML_ERROR_UNDEFINED_ENTITY
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	251 error.lineno = lineno
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	252 error.offset = offset
fc6b2fb66518 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	253 raise error
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	254
5479aae32f5a Initial import. cmlenz parents: diff changeset	255
5479aae32f5a Initial import. cmlenz parents: diff changeset	256 def XML(text):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	257 """Parse the given XML source and return a markup stream.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	258
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	259 Unlike with `XMLParser`, the returned stream is reusable, meaning it can be
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	260 iterated over multiple times:
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	261
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	262 >>> xml = XML('<doc><elem>Foo</elem><elem>Bar</elem></doc>')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	263 >>> print xml
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	264 <doc><elem>Foo</elem><elem>Bar</elem></doc>
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	265 >>> print xml.select('elem')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	266 <elem>Foo</elem><elem>Bar</elem>
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	267 >>> print xml.select('elem/text()')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	268 FooBar
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	269
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	270 :param text: the XML source
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	271 :return: the parsed XML event stream
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	272 :raises ParseError: if the XML text is not well-formed
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	273 """
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	274 return Stream(list(XMLParser(StringIO(text))))
5479aae32f5a Initial import. cmlenz parents: diff changeset	275
5479aae32f5a Initial import. cmlenz parents: diff changeset	276
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	277 class HTMLParser(html.HTMLParser, object):
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	278 """Parser for HTML input based on the Python `HTMLParser` module.
5479aae32f5a Initial import. cmlenz parents: diff changeset	279
5479aae32f5a Initial import. cmlenz parents: diff changeset	280 This class provides the same interface for generating stream events as
5479aae32f5a Initial import. cmlenz parents: diff changeset	281 `XMLParser`, and attempts to automatically balance tags.
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	282
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	283 The parsing is initiated by iterating over the parser object:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	284
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	285 >>> parser = HTMLParser(StringIO('<UL compact><LI>Foo</UL>'))
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	286 >>> for kind, data, pos in parser:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	287 ... print kind, data
326 f999da894391 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	288 START (QName(u'ul'), Attrs([(QName(u'compact'), u'compact')]))
f999da894391 Fixed `__repr__` of the `QName`, `Attrs`, and `Expression` classes so that the output can be used as code to instantiate the object again. cmlenz parents: 316 diff changeset	289 START (QName(u'li'), Attrs())
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	290 TEXT Foo
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	291 END li
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	292 END ul
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	293 """
5479aae32f5a Initial import. cmlenz parents: diff changeset	294
5479aae32f5a Initial import. cmlenz parents: diff changeset	295 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
5479aae32f5a Initial import. cmlenz parents: diff changeset	296 'hr', 'img', 'input', 'isindex', 'link', 'meta',
5479aae32f5a Initial import. cmlenz parents: diff changeset	297 'param'])
5479aae32f5a Initial import. cmlenz parents: diff changeset	298
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	299 def __init__(self, source, filename=None, encoding='utf-8'):
8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	300 """Initialize the parser for the given HTML input.
8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	301
425 073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	302 :param source: the HTML text as a file-like object
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	303 :param filename: the name of the file, if known
073640758a42 Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	304 :param filename: encoding of the file; ignored if the input is unicode
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	305 """
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	306 html.HTMLParser.__init__(self)
5479aae32f5a Initial import. cmlenz parents: diff changeset	307 self.source = source
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	308 self.filename = filename
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	309 self.encoding = encoding
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	310 self._queue = []
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	311 self._open_tags = []
5479aae32f5a Initial import. cmlenz parents: diff changeset	312
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	313 def parse(self):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	314 """Generator that parses the HTML source, yielding markup events.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	315
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	316 :return: a markup event stream
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	317 :raises ParseError: if the HTML text is not well formed
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	318 """
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	319 def _generate():
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	320 try:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	321 bufsize = 4 * 1024 # 4K
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	322 done = False
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	323 while 1:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	324 while not done and len(self._queue) == 0:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	325 data = self.source.read(bufsize)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	326 if data == '': # end of data
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	327 self.close()
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	328 done = True
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	329 else:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	330 self.feed(data)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	331 for kind, data, pos in self._queue:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	332 yield kind, data, pos
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	333 self._queue = []
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	334 if done:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	335 open_tags = self._open_tags
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	336 open_tags.reverse()
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	337 for tag in open_tags:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	338 yield END, QName(tag), pos
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	339 break
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	340 except html.HTMLParseError, e:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	341 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	342 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	343 return Stream(_generate()).filter(_coalesce)
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	344
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	345 def __iter__(self):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	346 return iter(self.parse())
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	347
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	348 def _enqueue(self, kind, data, pos=None):
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	349 if pos is None:
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	350 pos = self._getpos()
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	351 self._queue.append((kind, data, pos))
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	352
21 b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	353 def _getpos(self):
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	354 lineno, column = self.getpos()
b4d17897d053 * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	355 return (self.filename, lineno, column)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	356
5479aae32f5a Initial import. cmlenz parents: diff changeset	357 def handle_starttag(self, tag, attrib):
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	358 fixed_attrib = []
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	359 for name, value in attrib: # Fixup minimized attributes
3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	360 if value is None:
312 cb7326367f91 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	361 value = unicode(name)
cb7326367f91 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	362 elif not isinstance(value, unicode):
cb7326367f91 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	363 value = value.decode(self.encoding, 'replace')
403 228907abb726 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	364 fixed_attrib.append((QName(name), stripentities(value)))
26 3c1a022be04c * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	365
182 2f30ce3fb85e Renamed `Attributes` to `Attrs` to reduce the verbosity. cmlenz parents: 160 diff changeset	366 self._enqueue(START, (QName(tag), Attrs(fixed_attrib)))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	367 if tag in self._EMPTY_ELEMS:
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	368 self._enqueue(END, QName(tag))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	369 else:
5479aae32f5a Initial import. cmlenz parents: diff changeset	370 self._open_tags.append(tag)
5479aae32f5a Initial import. cmlenz parents: diff changeset	371
5479aae32f5a Initial import. cmlenz parents: diff changeset	372 def handle_endtag(self, tag):
5479aae32f5a Initial import. cmlenz parents: diff changeset	373 if tag not in self._EMPTY_ELEMS:
5479aae32f5a Initial import. cmlenz parents: diff changeset	374 while self._open_tags:
5479aae32f5a Initial import. cmlenz parents: diff changeset	375 open_tag = self._open_tags.pop()
378 873ca2a7ec05 Improve handling of incorrectly nested tags in the HTML parser. cmlenz parents: 376 diff changeset	376 self._enqueue(END, QName(open_tag))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	377 if open_tag.lower() == tag.lower():
5479aae32f5a Initial import. cmlenz parents: diff changeset	378 break
5479aae32f5a Initial import. cmlenz parents: diff changeset	379
5479aae32f5a Initial import. cmlenz parents: diff changeset	380 def handle_data(self, text):
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	381 if not isinstance(text, unicode):
8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	382 text = text.decode(self.encoding, 'replace')
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	383 self._enqueue(TEXT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	384
5479aae32f5a Initial import. cmlenz parents: diff changeset	385 def handle_charref(self, name):
423 56bbe1d94da0 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	386 if name.lower().startswith('x'):
56bbe1d94da0 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	387 text = unichr(int(name[1:], 16))
56bbe1d94da0 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	388 else:
56bbe1d94da0 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	389 text = unichr(int(name))
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	390 self._enqueue(TEXT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	391
5479aae32f5a Initial import. cmlenz parents: diff changeset	392 def handle_entityref(self, name):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	393 try:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	394 text = unichr(htmlentitydefs.name2codepoint[name])
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	395 except KeyError:
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	396 text = '&%s;' % name
d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	397 self._enqueue(TEXT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	398
5479aae32f5a Initial import. cmlenz parents: diff changeset	399 def handle_pi(self, data):
376 0e0952d85d97 Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	400 target, data = data.split(None, 1)
0e0952d85d97 Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	401 if data.endswith('?'):
0e0952d85d97 Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	402 data = data[:-1]
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	403 self._enqueue(PI, (target.strip(), data.strip()))
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	404
5479aae32f5a Initial import. cmlenz parents: diff changeset	405 def handle_comment(self, text):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	406 self._enqueue(COMMENT, text)
1 5479aae32f5a Initial import. cmlenz parents: diff changeset	407
5479aae32f5a Initial import. cmlenz parents: diff changeset	408
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	409 def HTML(text, encoding='utf-8'):
433 bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	410 """Parse the given HTML source and return a markup stream.
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	411
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	412 Unlike with `HTMLParser`, the returned stream is reusable, meaning it can be
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	413 iterated over multiple times:
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	414
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	415 >>> html = HTML('<body><h1>Foo</h1></body>')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	416 >>> print html
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	417 <body><h1>Foo</h1></body>
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	418 >>> print html.select('h1')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	419 <h1>Foo</h1>
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	420 >>> print html.select('h1/text()')
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	421 Foo
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	422
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	423 :param text: the HTML source
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	424 :return: the parsed XML event stream
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	425 :raises ParseError: if the HTML text is not well-formed, and error recovery
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	426 fails
bc430fd7c54d More API docs. cmlenz parents: 425 diff changeset	427 """
311 8de1ff534d22 * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	428 return Stream(list(HTMLParser(StringIO(text), encoding=encoding)))
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	429
146 04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	430 def _coalesce(stream):
144 d1ce85a7f296 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	431 """Coalesces adjacent TEXT events into a single event."""
146 04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	432 textbuf = []
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	433 textpos = None
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	434 for kind, data, pos in chain(stream, [(None, None, None)]):
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	435 if kind is TEXT:
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	436 textbuf.append(data)
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	437 if textpos is None:
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	438 textpos = pos
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	439 else:
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	440 if textbuf:
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	441 yield TEXT, u''.join(textbuf), textpos
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	442 del textbuf[:]
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	443 textpos = None
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	444 if kind:
04799355362d Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	445 yield kind, data, pos

Mercurial > genshi > mirror

annotate genshi/input.py @ 458:5f5b227b04be trunk