public class BoilerpipeHTMLParser extends org.apache.xerces.parsers.AbstractSAXParser implements BoilerpipeDocumentSource
BoilerpipeSAXInput
. The parser uses CyberNeko to parse HTML content.ALLOW_UE_AND_NOTATION_EVENTS, DECLARATION_HANDLER, DOM_NODE, fContentHandler, fDeclaredAttrs, fDeclHandler, fDocumentHandler, fDTDHandler, fLexicalHandler, fLexicalHandlerParameterEntities, fNamespaceContext, fNamespacePrefixes, fNamespaces, fParseInProgress, fQName, fResolveDTDURIs, fStandalone, fUseEntityResolver2, fVersion, fXMLNSURIs, LEXICAL_HANDLER, NAMESPACES, STRING_INTERNING
fDocumentSource, fDTDContentModelSource, fDTDSource, fInDTD
ENTITY_RESOLVER, ERROR_HANDLER, fConfiguration
Modifier | Constructor and Description |
---|---|
|
BoilerpipeHTMLParser()
Constructs a
BoilerpipeHTMLParser using a default HTML content handler. |
|
BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
Constructs a
BoilerpipeHTMLParser using the given BoilerpipeHTMLContentHandler . |
protected |
BoilerpipeHTMLParser(boolean ignore) |
Modifier and Type | Method and Description |
---|---|
void |
setContentHandler(BoilerpipeHTMLContentHandler contentHandler) |
void |
setContentHandler(org.xml.sax.ContentHandler contentHandler) |
TextDocument |
toTextDocument()
Returns a
TextDocument containing the extracted TextBlock
s. |
attributeDecl, characters, comment, doctypeDecl, elementDecl, endCDATA, endDocument, endDTD, endElement, endExternalSubset, endGeneralEntity, endNamespaceMapping, endParameterEntity, externalEntityDecl, getAttributePSVI, getAttributePSVIByName, getContentHandler, getDeclHandler, getDTDHandler, getElementPSVI, getEntityResolver, getErrorHandler, getFeature, getLexicalHandler, getProperty, ignorableWhitespace, internalEntityDecl, notationDecl, parse, parse, processingInstruction, reset, setDeclHandler, setDocumentHandler, setDTDHandler, setEntityResolver, setErrorHandler, setFeature, setLexicalHandler, setLocale, setProperty, startCDATA, startDocument, startElement, startExternalSubset, startGeneralEntity, startNamespaceMapping, startParameterEntity, unparsedEntityDecl, xmlDecl
any, element, empty, emptyElement, endAttlist, endConditional, endContentModel, endGroup, getDocumentSource, getDTDContentModelSource, getDTDSource, ignoredCharacters, occurrence, pcdata, separator, setDocumentSource, setDTDContentModelSource, setDTDSource, startAttlist, startConditional, startContentModel, startDTD, startGroup, textDecl
public BoilerpipeHTMLParser()
BoilerpipeHTMLParser
using a default HTML content handler.public BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler contentHandler)
BoilerpipeHTMLParser
using the given BoilerpipeHTMLContentHandler
.contentHandler
- protected BoilerpipeHTMLParser(boolean ignore)
public void setContentHandler(BoilerpipeHTMLContentHandler contentHandler)
public void setContentHandler(org.xml.sax.ContentHandler contentHandler)
setContentHandler
in interface org.xml.sax.XMLReader
setContentHandler
in class org.apache.xerces.parsers.AbstractSAXParser
public TextDocument toTextDocument()
TextDocument
containing the extracted TextBlock
s. NOTE: Only call this after AbstractSAXParser.parse(org.xml.sax.InputSource)
.toTextDocument
in interface BoilerpipeDocumentSource
TextDocument