genshi/genshi-test: genshi/input.py annotate

annotate genshi/input.py @ 932:e53161c2773c

Merge r1140 from py3k: add support for python 3 to core genshi components (genshi.core, genshi.input and genshi.output): * default input and output encodings changed from UTF-8 to None (i.e. unicode strings) * Namespace and QName objects do not call stringrepr in __repr__ in Python 3 since repr() returns a unicode string there. * track changes to expat parser in Python 3 (mostly it accepts bytes instead of strings)

author	hodgestar
date	Fri, 18 Mar 2011 09:08:12 +0000
parents	fbe34d12acde
children

rev	line source
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	1 # -- coding: utf-8 --
821114ec4f69 Initial import. cmlenz parents: diff changeset	2 #
854 0d9e87c6cf6e More work on reducing the size of the diff produced by 2to3. cmlenz parents: 853 diff changeset	3 # Copyright (C) 2006-2009 Edgewall Software
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	4 # All rights reserved.
821114ec4f69 Initial import. cmlenz parents: diff changeset	5 #
821114ec4f69 Initial import. cmlenz parents: diff changeset	6 # This software is licensed as described in the file COPYING, which
821114ec4f69 Initial import. cmlenz parents: diff changeset	7 # you should have received as part of this distribution. The terms
230 24757b771651 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	8 # are also available at http://genshi.edgewall.org/wiki/License.
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	9 #
821114ec4f69 Initial import. cmlenz parents: diff changeset	10 # This software consists of voluntary contributions made by many
821114ec4f69 Initial import. cmlenz parents: diff changeset	11 # individuals. For the exact contribution history, see the revision
230 24757b771651 Renamed Markup to Genshi in repository. cmlenz parents: 213 diff changeset	12 # history and logs, available at http://genshi.edgewall.org/log/.
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	13
425 5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	14 """Support for constructing markup streams from files, strings, or other
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	15 sources.
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	16 """
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	17
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	18 from itertools import chain
859 fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	19 import htmlentitydefs as entities
fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	20 import HTMLParser as html
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	21 from xml.parsers import expat
821114ec4f69 Initial import. cmlenz parents: diff changeset	22
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	23 from genshi.core import Attrs, QName, Stream, stripentities
859 fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	24 from genshi.core import START, END, XML_DECL, DOCTYPE, TEXT, START_NS, \
fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	25 END_NS, START_CDATA, END_CDATA, PI, COMMENT
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	26 from genshi.compat import StringIO, BytesIO
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	27
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	28
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	29 __all__ = ['ET', 'ParseError', 'XMLParser', 'XML', 'HTMLParser', 'HTML']
425 5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	30 __docformat__ = 'restructuredtext en'
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	31
859 fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	32
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	33 def ET(element):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	34 """Convert a given ElementTree element to a markup stream.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	35
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	36 :param element: an ElementTree element
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	37 :return: a markup stream
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	38 """
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	39 tag_name = QName(element.tag.lstrip('{'))
458 160f787cc818 The `ET()` function now correctly handles attributes with a namespace. cmlenz parents: 434 diff changeset	40 attrs = Attrs([(QName(attr.lstrip('{')), value)
160f787cc818 The `ET()` function now correctly handles attributes with a namespace. cmlenz parents: 434 diff changeset	41 for attr, value in element.items()])
290 a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	42
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	43 yield START, (tag_name, attrs), (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	44 if element.text:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	45 yield TEXT, element.text, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	46 for child in element.getchildren():
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	47 for item in ET(child):
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	48 yield item
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	49 yield END, tag_name, (None, -1, -1)
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	50 if element.tail:
a6738047c85e Move the ElementTree ''element-to-stream'' adaptation function `ET()` into the `genshi.input` module. cmlenz parents: 230 diff changeset	51 yield TEXT, element.tail, (None, -1, -1)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	52
821114ec4f69 Initial import. cmlenz parents: diff changeset	53
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	54 class ParseError(Exception):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	55 """Exception raised when fatal syntax errors are found in the input being
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	56 parsed.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	57 """
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	58
422 95089b6e37ca More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	59 def __init__(self, message, filename=None, lineno=-1, offset=-1):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	60 """Exception initializer.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	61
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	62 :param message: the error message from the parser
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	63 :param filename: the path to the file that was parsed
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	64 :param lineno: the number of the line on which the error was encountered
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	65 :param offset: the column number where the error was encountered
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	66 """
422 95089b6e37ca More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	67 self.msg = message
95089b6e37ca More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	68 if filename:
434 e065d7906b68 * Better method to propogate the full path to the template file on parse errors. Supersedes r513. cmlenz parents: 433 diff changeset	69 message += ', in ' + filename
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	70 Exception.__init__(self, message)
422 95089b6e37ca More work to include absolute file paths in exceptions. cmlenz parents: 419 diff changeset	71 self.filename = filename or '<string>'
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	72 self.lineno = lineno
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	73 self.offset = offset
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	74
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	75
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	76 class XMLParser(object):
821114ec4f69 Initial import. cmlenz parents: diff changeset	77 """Generator-based XML parser based on roughly equivalent code in
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	78 Kid/ElementTree.
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	79
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	80 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	81
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	82 >>> parser = XMLParser(StringIO('<root id="2"><child>Foo</child></root>'))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	83 >>> for kind, data, pos in parser:
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	84 ... print('%s %s' % (kind, data))
857 24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary. cmlenz parents: 856 diff changeset	85 START (QName('root'), Attrs([(QName('id'), u'2')]))
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary. cmlenz parents: 856 diff changeset	86 START (QName('child'), Attrs())
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	87 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	88 END child
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	89 END root
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	90 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	91
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	92 _entitydefs = ['<!ENTITY %s "&#%d;">' % (name, value) for name, value in
856 1e2be9fb3348 Add a couple of fallback imports for Python 3.0. cmlenz parents: 854 diff changeset	93 entities.name2codepoint.items()]
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	94 _external_dtd = u'\n'.join(_entitydefs).encode('utf-8')
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	95
316 4ab9edf5e83b Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	96 def __init__(self, source, filename=None, encoding=None):
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	97 """Initialize the parser for the given XML input.
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	98
425 5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	99 :param source: the XML text as a file-like object
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	100 :param filename: the name of the file, if appropriate
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	101 :param encoding: the encoding of the file; if not specified, the
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	102 encoding is assumed to be ASCII, UTF-8, or UTF-16, or
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	103 whatever the encoding specified in the XML declaration
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	104 (if any)
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	105 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	106 self.source = source
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	107 self.filename = filename
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	108
821114ec4f69 Initial import. cmlenz parents: diff changeset	109 # Setup the Expat parser
316 4ab9edf5e83b Configurable encoding of template files, closing #65. cmlenz parents: 312 diff changeset	110 parser = expat.ParserCreate(encoding, '}')
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	111 parser.buffer_text = True
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	112 # Python 3 does not have returns_unicode
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	113 if hasattr(parser, 'returns_unicode'):
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	114 parser.returns_unicode = True
160 faea6db52ef1 Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	115 parser.ordered_attributes = True
faea6db52ef1 Attribute order in parsed XML is now preserved. cmlenz parents: 146 diff changeset	116
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	117 parser.StartElementHandler = self._handle_start
821114ec4f69 Initial import. cmlenz parents: diff changeset	118 parser.EndElementHandler = self._handle_end
821114ec4f69 Initial import. cmlenz parents: diff changeset	119 parser.CharacterDataHandler = self._handle_data
821114ec4f69 Initial import. cmlenz parents: diff changeset	120 parser.StartDoctypeDeclHandler = self._handle_doctype
821114ec4f69 Initial import. cmlenz parents: diff changeset	121 parser.StartNamespaceDeclHandler = self._handle_start_ns
821114ec4f69 Initial import. cmlenz parents: diff changeset	122 parser.EndNamespaceDeclHandler = self._handle_end_ns
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	123 parser.StartCdataSectionHandler = self._handle_start_cdata
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	124 parser.EndCdataSectionHandler = self._handle_end_cdata
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	125 parser.ProcessingInstructionHandler = self._handle_pi
460 6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks! cmlenz parents: 458 diff changeset	126 parser.XmlDeclHandler = self._handle_xml_decl
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	127 parser.CommentHandler = self._handle_comment
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	128
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	129 # Tell Expat that we'll handle non-XML entities ourselves
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	130 # (in _handle_other)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	131 parser.DefaultHandler = self._handle_other
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	132 parser.SetParamEntityParsing(expat.XML_PARAM_ENTITY_PARSING_ALWAYS)
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	133 parser.UseForeignDTD()
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	134 parser.ExternalEntityRefHandler = self._build_foreign
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	135
821114ec4f69 Initial import. cmlenz parents: diff changeset	136 self.expat = parser
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	137 self._queue = []
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	138
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	139 def parse(self):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	140 """Generator that parses the XML source, yielding markup events.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	141
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	142 :return: a markup event stream
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	143 :raises ParseError: if the XML text is not well formed
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	144 """
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	145 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	146 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	147 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	148 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	149 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	150 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	151 data = self.source.read(bufsize)
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	152 if not data: # end of data
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	153 if hasattr(self, 'expat'):
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	154 self.expat.Parse('', True)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	155 del self.expat # get rid of circular references
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	156 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	157 else:
207 0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	158 if isinstance(data, unicode):
0619a27f5e67 The `XMLParser` now correctly handles unicode input. Closes #43. cmlenz parents: 182 diff changeset	159 data = data.encode('utf-8')
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	160 self.expat.Parse(data, False)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	161 for event in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	162 yield event
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	163 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	164 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	165 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	166 except expat.ExpatError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	167 msg = str(e)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	168 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	169 return Stream(_generate()).filter(_coalesce)
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	170
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	171 def __iter__(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	172 return iter(self.parse())
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	173
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	174 def _build_foreign(self, context, base, sysid, pubid):
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	175 parser = self.expat.ExternalEntityParserCreate(context)
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	176 parser.ParseFile(BytesIO(self._external_dtd))
293 38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	177 return 1
38adb4aa7df5 Fix a bug in the XML parser, where attributes containing HTML entity references would get pulled out of the attribute value, and instead added as a text node just before the associated start tag. Thanks to Hamish Lawson for [http://groups.google.com/group/genshi/browse_thread/thread/c64eb48676b0ff96/0e6ce786e8820f3d pointing out the problem]. cmlenz parents: 290 diff changeset	178
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	179 def _enqueue(self, kind, data=None, pos=None):
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	180 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	181 pos = self._getpos()
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	182 if kind is TEXT:
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	183 # Expat reports the end of the text event as current position. We
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	184 # try to fix that up here as much as possible. Unfortunately, the
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	185 # offset is only valid for single-line text. For multi-line text,
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	186 # it is apparently not possible to determine at what offset it
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	187 # started
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	188 if '\n' in data:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	189 lines = data.splitlines()
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	190 lineno = pos[1] - len(lines) + 1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	191 offset = -1
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	192 else:
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	193 lineno = pos[1]
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	194 offset = pos[2] - len(data)
df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	195 pos = (pos[0], lineno, offset)
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	196 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	197
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	198 def _getpos_unknown(self):
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	199 return (self.filename, -1, -1)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	200
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	201 def _getpos(self):
134 df44110ca91d * Improve the accuracy of line numbers for text nodes, so that reported errors about syntax or evaluation errors in expressions point to the right line (not quite perfect yet, though). cmlenz parents: 69 diff changeset	202 return (self.filename, self.expat.CurrentLineNumber,
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	203 self.expat.CurrentColumnNumber)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	204
821114ec4f69 Initial import. cmlenz parents: diff changeset	205 def _handle_start(self, tag, attrib):
403 32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	206 attrs = Attrs([(QName(name), value) for name, value in
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	207 zip([iter(attrib)] 2)])
32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	208 self._enqueue(START, (QName(tag), attrs))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	209
821114ec4f69 Initial import. cmlenz parents: diff changeset	210 def _handle_end(self, tag):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	211 self._enqueue(END, QName(tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	212
821114ec4f69 Initial import. cmlenz parents: diff changeset	213 def _handle_data(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	214 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	215
460 6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks! cmlenz parents: 458 diff changeset	216 def _handle_xml_decl(self, version, encoding, standalone):
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks! cmlenz parents: 458 diff changeset	217 self._enqueue(XML_DECL, (version, encoding, standalone))
6b5544bb5a99 Apply patch by Alec Thomas for processing XML declarations (#111). Thanks! cmlenz parents: 458 diff changeset	218
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	219 def _handle_doctype(self, name, sysid, pubid, has_internal_subset):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	220 self._enqueue(DOCTYPE, (name, pubid, sysid))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	221
821114ec4f69 Initial import. cmlenz parents: diff changeset	222 def _handle_start_ns(self, prefix, uri):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	223 self._enqueue(START_NS, (prefix or '', uri))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	224
821114ec4f69 Initial import. cmlenz parents: diff changeset	225 def _handle_end_ns(self, prefix):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	226 self._enqueue(END_NS, prefix or '')
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	227
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	228 def _handle_start_cdata(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	229 self._enqueue(START_CDATA)
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	230
ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	231 def _handle_end_cdata(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	232 self._enqueue(END_CDATA)
143 ef761afcedff CDATA sections in XML input now appear as CDATA sections in the output. This should address the problem with escaping the contents of `<style>` and `<script>` elements, which would only get interpreted correctly if the output was served as `application/xhtml+xml`. Closes #24. cmlenz parents: 140 diff changeset	233
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	234 def _handle_pi(self, target, data):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	235 self._enqueue(PI, (target, data))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	236
821114ec4f69 Initial import. cmlenz parents: diff changeset	237 def _handle_comment(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	238 self._enqueue(COMMENT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	239
821114ec4f69 Initial import. cmlenz parents: diff changeset	240 def _handle_other(self, text):
821114ec4f69 Initial import. cmlenz parents: diff changeset	241 if text.startswith('&'):
821114ec4f69 Initial import. cmlenz parents: diff changeset	242 # deal with undefined entities
821114ec4f69 Initial import. cmlenz parents: diff changeset	243 try:
856 1e2be9fb3348 Add a couple of fallback imports for Python 3.0. cmlenz parents: 854 diff changeset	244 text = unichr(entities.name2codepoint[text[1:-1]])
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	245 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	246 except KeyError:
209 5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	247 filename, lineno, offset = self._getpos()
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	248 error = expat.error('undefined entity "%s": line %d, column %d'
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	249 % (text, lineno, offset))
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	250 error.code = expat.errors.XML_ERROR_UNDEFINED_ENTITY
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	251 error.lineno = lineno
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	252 error.offset = offset
5b422db07359 * Fix bug in handling of undefined entities. Thanks to Arnar for reporting the issue on IRC. cmlenz parents: 207 diff changeset	253 raise error
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	254
821114ec4f69 Initial import. cmlenz parents: diff changeset	255
821114ec4f69 Initial import. cmlenz parents: diff changeset	256 def XML(text):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	257 """Parse the given XML source and return a markup stream.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	258
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	259 Unlike with `XMLParser`, the returned stream is reusable, meaning it can be
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	260 iterated over multiple times:
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	261
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	262 >>> xml = XML('<doc><elem>Foo</elem><elem>Bar</elem></doc>')
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	263 >>> print(xml)
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	264 <doc><elem>Foo</elem><elem>Bar</elem></doc>
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	265 >>> print(xml.select('elem'))
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	266 <elem>Foo</elem><elem>Bar</elem>
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	267 >>> print(xml.select('elem/text()'))
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	268 FooBar
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	269
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	270 :param text: the XML source
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	271 :return: the parsed XML event stream
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	272 :raises ParseError: if the XML text is not well-formed
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	273 """
859 fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	274 return Stream(list(XMLParser(StringIO(text))))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	275
821114ec4f69 Initial import. cmlenz parents: diff changeset	276
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	277 class HTMLParser(html.HTMLParser, object):
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	278 """Parser for HTML input based on the Python `HTMLParser` module.
821114ec4f69 Initial import. cmlenz parents: diff changeset	279
821114ec4f69 Initial import. cmlenz parents: diff changeset	280 This class provides the same interface for generating stream events as
821114ec4f69 Initial import. cmlenz parents: diff changeset	281 `XMLParser`, and attempts to automatically balance tags.
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	282
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	283 The parsing is initiated by iterating over the parser object:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	284
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	285 >>> parser = HTMLParser(BytesIO(u'<UL compact><LI>Foo</UL>'.encode('utf-8')), encoding='utf-8')
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	286 >>> for kind, data, pos in parser:
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	287 ... print('%s %s' % (kind, data))
857 24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary. cmlenz parents: 856 diff changeset	288 START (QName('ul'), Attrs([(QName('compact'), u'compact')]))
24733a5854d9 Avoid unicode literals in `repr`s of `QName` and `Namespace` when not necessary. cmlenz parents: 856 diff changeset	289 START (QName('li'), Attrs())
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	290 TEXT Foo
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	291 END li
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	292 END ul
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	293 """
821114ec4f69 Initial import. cmlenz parents: diff changeset	294
821114ec4f69 Initial import. cmlenz parents: diff changeset	295 _EMPTY_ELEMS = frozenset(['area', 'base', 'basefont', 'br', 'col', 'frame',
821114ec4f69 Initial import. cmlenz parents: diff changeset	296 'hr', 'img', 'input', 'isindex', 'link', 'meta',
821114ec4f69 Initial import. cmlenz parents: diff changeset	297 'param'])
821114ec4f69 Initial import. cmlenz parents: diff changeset	298
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	299 def __init__(self, source, filename=None, encoding=None):
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	300 """Initialize the parser for the given HTML input.
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	301
425 5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	302 :param source: the HTML text as a file-like object
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	303 :param filename: the name of the file, if known
5b248708bbed Try to use proper reStructuredText for docstrings throughout. cmlenz parents: 423 diff changeset	304 :param filename: encoding of the file; ignored if the input is unicode
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	305 """
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	306 html.HTMLParser.__init__(self)
821114ec4f69 Initial import. cmlenz parents: diff changeset	307 self.source = source
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	308 self.filename = filename
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	309 self.encoding = encoding
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	310 self._queue = []
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	311 self._open_tags = []
821114ec4f69 Initial import. cmlenz parents: diff changeset	312
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	313 def parse(self):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	314 """Generator that parses the HTML source, yielding markup events.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	315
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	316 :return: a markup event stream
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	317 :raises ParseError: if the HTML text is not well formed
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	318 """
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	319 def _generate():
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	320 try:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	321 bufsize = 4 * 1024 # 4K
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	322 done = False
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	323 while 1:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	324 while not done and len(self._queue) == 0:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	325 data = self.source.read(bufsize)
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	326 if not data: # end of data
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	327 self.close()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	328 done = True
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	329 else:
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	330 if not isinstance(data, unicode):
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	331 # bytes
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	332 if self.encoding:
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	333 data = data.decode(self.encoding)
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	334 else:
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	335 raise UnicodeError("source returned bytes, but no encoding specified")
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	336 self.feed(data)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	337 for kind, data, pos in self._queue:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	338 yield kind, data, pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	339 self._queue = []
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	340 if done:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	341 open_tags = self._open_tags
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	342 open_tags.reverse()
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	343 for tag in open_tags:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	344 yield END, QName(tag), pos
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	345 break
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	346 except html.HTMLParseError, e:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	347 msg = '%s: line %d, column %d' % (e.msg, e.lineno, e.offset)
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	348 raise ParseError(msg, self.filename, e.lineno, e.offset)
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	349 return Stream(_generate()).filter(_coalesce)
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	350
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	351 def __iter__(self):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	352 return iter(self.parse())
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	353
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	354 def _enqueue(self, kind, data, pos=None):
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	355 if pos is None:
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	356 pos = self._getpos()
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	357 self._queue.append((kind, data, pos))
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	358
21 eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	359 def _getpos(self):
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	360 lineno, column = self.getpos()
eca77129518a * Include paths are now interpreted relative to the path of the including template. Closes #3. cmlenz parents: 1 diff changeset	361 return (self.filename, lineno, column)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	362
821114ec4f69 Initial import. cmlenz parents: diff changeset	363 def handle_starttag(self, tag, attrib):
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	364 fixed_attrib = []
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	365 for name, value in attrib: # Fixup minimized attributes
039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	366 if value is None:
312 7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	367 value = unicode(name)
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	368 elif not isinstance(value, unicode):
7e743338a799 Follow-up to [385]: also decode attribute values in the `HTMLParser`. cmlenz parents: 311 diff changeset	369 value = value.decode(self.encoding, 'replace')
403 32b283e1d310 Remove some magic/overhead from `Attrs` creation and manipulation by not automatically wrapping attribute names in `QName`. cmlenz parents: 378 diff changeset	370 fixed_attrib.append((QName(name), stripentities(value)))
26 039fc5b87405 * Split out the XPath tests into a separate `unittest`-based file. cmlenz parents: 21 diff changeset	371
182 41db0260ebb1 Renamed `Attributes` to `Attrs` to reduce the verbosity. cmlenz parents: 160 diff changeset	372 self._enqueue(START, (QName(tag), Attrs(fixed_attrib)))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	373 if tag in self._EMPTY_ELEMS:
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	374 self._enqueue(END, QName(tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	375 else:
821114ec4f69 Initial import. cmlenz parents: diff changeset	376 self._open_tags.append(tag)
821114ec4f69 Initial import. cmlenz parents: diff changeset	377
821114ec4f69 Initial import. cmlenz parents: diff changeset	378 def handle_endtag(self, tag):
821114ec4f69 Initial import. cmlenz parents: diff changeset	379 if tag not in self._EMPTY_ELEMS:
821114ec4f69 Initial import. cmlenz parents: diff changeset	380 while self._open_tags:
821114ec4f69 Initial import. cmlenz parents: diff changeset	381 open_tag = self._open_tags.pop()
378 fff4a81ffc56 Improve handling of incorrectly nested tags in the HTML parser. cmlenz parents: 376 diff changeset	382 self._enqueue(END, QName(open_tag))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	383 if open_tag.lower() == tag.lower():
821114ec4f69 Initial import. cmlenz parents: diff changeset	384 break
821114ec4f69 Initial import. cmlenz parents: diff changeset	385
821114ec4f69 Initial import. cmlenz parents: diff changeset	386 def handle_data(self, text):
311 01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	387 if not isinstance(text, unicode):
01e2c48f6dfb * The `HTMLParser` class and the `HTML` function now accept an `encoding` parameter to properly deal with bytestring input (defaults to UTF-8). cmlenz parents: 293 diff changeset	388 text = text.decode(self.encoding, 'replace')
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	389 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	390
821114ec4f69 Initial import. cmlenz parents: diff changeset	391 def handle_charref(self, name):
423 7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	392 if name.lower().startswith('x'):
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	393 text = unichr(int(name[1:], 16))
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	394 else:
7589a0e51001 Applied patch for #106 (handling of hex charrefs in HTML parser). cmlenz parents: 422 diff changeset	395 text = unichr(int(name))
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	396 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	397
821114ec4f69 Initial import. cmlenz parents: diff changeset	398 def handle_entityref(self, name):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	399 try:
856 1e2be9fb3348 Add a couple of fallback imports for Python 3.0. cmlenz parents: 854 diff changeset	400 text = unichr(entities.name2codepoint[name])
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	401 except KeyError:
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	402 text = '&%s;' % name
28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	403 self._enqueue(TEXT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	404
821114ec4f69 Initial import. cmlenz parents: diff changeset	405 def handle_pi(self, data):
376 74b6bf92f0cd Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	406 target, data = data.split(None, 1)
74b6bf92f0cd Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	407 if data.endswith('?'):
74b6bf92f0cd Fix parsing of processing instructions in HTML input. cmlenz parents: 326 diff changeset	408 data = data[:-1]
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	409 self._enqueue(PI, (target.strip(), data.strip()))
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	410
821114ec4f69 Initial import. cmlenz parents: diff changeset	411 def handle_comment(self, text):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	412 self._enqueue(COMMENT, text)
1 821114ec4f69 Initial import. cmlenz parents: diff changeset	413
821114ec4f69 Initial import. cmlenz parents: diff changeset	414
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	415 def HTML(text, encoding=None):
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	416 """Parse the given HTML source and return a markup stream.
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	417
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	418 Unlike with `HTMLParser`, the returned stream is reusable, meaning it can be
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	419 iterated over multiple times:
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	420
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	421 >>> html = HTML('<body><h1>Foo</h1></body>', encoding='utf-8')
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	422 >>> print(html)
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	423 <body><h1>Foo</h1></body>
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	424 >>> print(html.select('h1'))
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	425 <h1>Foo</h1>
853 4376010bb97e Convert a bunch of print statements to py3k compatible syntax. cmlenz parents: 852 diff changeset	426 >>> print(html.select('h1/text()'))
433 6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	427 Foo
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	428
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	429 :param text: the HTML source
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	430 :return: the parsed XML event stream
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	431 :raises ParseError: if the HTML text is not well-formed, and error recovery
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	432 fails
6d01e91f2a49 More API docs. cmlenz parents: 425 diff changeset	433 """
932 e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	434 if isinstance(text, unicode):
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	435 return Stream(list(HTMLParser(StringIO(text), encoding=encoding)))
e53161c2773c Merge r1140 from py3k: hodgestar parents: 859 diff changeset	436 return Stream(list(HTMLParser(BytesIO(text), encoding=encoding)))
859 fbe34d12acde More bits of 2to3 related cleanup. cmlenz parents: 857 diff changeset	437
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	438
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	439 def _coalesce(stream):
144 28b56f09a7e1 * Coalesce adjacent text events that the parsers would produce when text crossed the buffer boundaries. Fixes #26. cmlenz parents: 143 diff changeset	440 """Coalesces adjacent TEXT events into a single event."""
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	441 textbuf = []
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	442 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	443 for kind, data, pos in chain(stream, [(None, None, None)]):
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	444 if kind is TEXT:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	445 textbuf.append(data)
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	446 if textpos is None:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	447 textpos = pos
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	448 else:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	449 if textbuf:
852 04945cd67dad Remove usage of unicode literals in a couple of places where they were not strictly necessary. cmlenz parents: 750 diff changeset	450 yield TEXT, ''.join(textbuf), textpos
146 db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	451 del textbuf[:]
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	452 textpos = None
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	453 if kind:
db0dacc1239a Simplifed `CoalesceFilter` (now a function) cmlenz parents: 145 diff changeset	454 yield kind, data, pos

Mercurial > genshi > genshi-test

annotate genshi/input.py @ 932:e53161c2773c