In this post , the Apache Tika framework and explain its concepts (e.g., N-gram, parsing, mime detection, and content analysis) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.
Throughout this tutorial, you will learn:
- Apache Tika's API, most relevant modules, and related functions
- Apache Nutch (one of the progenitors of Tika) and its NgramProfiler and LanguageIdentifier classes, which have recently been ported to Tika
- cpdetector, the code page detector project, and its functionality
As Apache Tika's site suggests, Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
The parser interface
The org.apache.tika.parser.Parser interface is the key component of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
void parse(InputStream stream, ContentHandler handler, Metadata metadata) throws IOException, SAXException, TikaException; |
The
parse
method takes the document to be parsed and related metadata
as input, and outputs the results as XHTML SAX events and extra
metadata. The main criteria that led to this design are shown in
Table 1.Table 1. Criteria for Tika parsing design
Criterion | Explanation |
Streamed parsing
| The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements. |
Structured content
| A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information, for example, to better judge the relevance of different parts of the parsed document. |
Input metadata
| A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process. |
Output metadata
| A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata, such as the name of the author, that may be useful to client applications. |
These criteria are reflected in the arguments of the
parse
method.Document
InputStream
The first argument is an
InputStream
for
reading the document to be parsed.If this document stream cannot be read, parsing stops and the thrown
IOException
is passed up to the
client application. If the stream can be read but not parsed (if the
document is corrupted, for
example), the parser throws a TikaException
.The parser implementation will consume this stream, but will not close it. Closing the stream is the responsibility of the client application that opened it initially. Listing 1 shows the recommended pattern for using streams with the
parse
method.Listing 1. Recommended pattern for using streams with the
parse
methodInputStream stream = ...; // open the stream try { parser.parse(stream, ...); // parse the stream } finally { stream.close(); // close the stream } |
XHTML SAX events
The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document, and SAX events enable streamed processing. Note that the XHTML format is used here only to convey structural information, not to render the documents for browsing.
The XHTML SAX events produced by the parser implementation are sent to a
ContentHandler
instance given to the
parse
method. If the
content handler fails to process an event, parsing stops and the
thrown SAXException
is passed up to the client application.Listing 2 shows the overall structure of the generated event stream (with indenting added for clarity).
Listing 2. Structure of the generated event stream
<html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>...</title> </head> <body> ... </body> </html> |
Parser implementations typically use the
XHTMLContentHandler
utility
class to generate the XHTML output. Dealing with the raw SAX events
can be complex, so Apache Tika (since V0.2) comes with several utility
classes that can be used to process and convert the event stream to other representations.For example, the
BodyContentHandler
class can be used to extract just
the body part of the XHTML output and feed it as SAX events to
another content handler or as characters to an output stream, a
writer, or simply a string. The following code snippet parses a
document from the standard input stream and outputs the extracted text
content to standard output: ContentHandler handler = new BodyContentHandler(System.out); parser.parse(System.in, handler, ...); |
Another useful class is
ParsingReader
that uses a background thread to
parse the document and returns the extracted text content as a
character stream. Listing 3. Example of the
ParsingReader
InputStream stream = ...; // the document to be parsed Reader reader = new ParsingReader(parser, stream, ...); try { ...; // read the document text using the reader } finally { reader.close(); // the document stream is closed automatically } |
Document metadata
The final argument to the
parse
method is used to pass document
metadata in and out of the parser. Document metadata is expressed
as a metadata object.Table 2 lists some of the more interesting metadata properties.
Table 2. Metadata properties
Property | Description |
Metadata.RESOURCE_NAME_KEY
| The name of the file or resource that contains the document — A client application can set this property to allow the parser to use file name heuristics to determine the format of the document. The parser implementation may set this property if the file format contains the canonical name of the file (the GZIP format has a slot for the file name, for example). |
Metadata.CONTENT_TYPE
| The declared content type of the
document — A client application can set this property based on,
such as an HTTP Content-Type
header. The declared content type may help the parser to
correctly interpret the document. The parser implementation
sets this property to the content type according to which the
document was parsed. |
Metadata.TITLE
| The title of the document — The parser implementation sets this property if the document format contains an explicit title field. |
Metadata.AUTHOR
| The name of the author of the document — The parser implementation sets this property if the document format contains an explicit author field. |
Note that metadata handling is still being discussed by the Apache Tika development team, and it is likely that there will be some (backwards-incompatible) changes in metadata handling before Tika V1.0.
Parser implementations
Apache Tika comes with a number of parser classes for parsing various document formats, as shown in Table 3.
Table 3. Tika parser classes
Format | Description |
Microsoft® Excel® (application/vnd.ms-excel) | Excel spreadsheet support is available in all versions of Tika and is based on the HSSF library from POI. |
Microsoft Word® (application/msword) | Word document support is available in all versions of Tika and is based on the HWPF library from POI. |
Microsoft PowerPoint® (application/vnd.ms-powerpoint) | PowerPoint presentation support is available in all versions of Tika and is based on the HSLF library from POI. |
Microsoft Visio® (application/vnd.visio) | Visio diagram support was added in Tika V0.2 and is based on the HDGF library from POI. |
Microsoft Outlook® (application/vnd.ms-outlook) | Outlook message support was added in Tika V0.2 and is based on the HSMF library from POI. |
GZIP compression (application/x-gzip) | GZIP support was added in Tika
V0.2 and is based on the
GZIPInputStream class in the Java 5
class library. |
bzip2 compression (application/x-bzip) | bzip2 support was added in Tika V0.2 and is based on bzip2 parsing code from Apache Ant, which was originally based on work by Keiron Liddle from Aftex Software. |
MP3 audio (audio/mpeg) | The parsing of
ID3v1 tags from
MP3 files was added in Tika V0.2. If found, the following metadata is extracted and set:
|
MIDI audio (audio/midi) | Tika uses the MIDI support in
javax.audio.midi to parse MIDI
sequence files. Many karaoke file formats are based on MIDI
and contain lyrics as embedded text tracks that Tika knows how
to extract. |
Wave audio (audio/basic) | Tika supports sampled wave audio
(.wav files, etc.) using the
javax.audio.sampled package. Only
sampling metadata is extracted. |
Extensible Markup Language (XML) (application/xml) | Tika uses the
javax.xml classes to parse
XML files. |
HyperText Markup Language (HTML) (text/html) | Tika uses the CyberNeko library to parse HTML files. |
Images (image/*) | Tika uses the
javax.imageio classes to extract
metadata from image files. |
Java class files | The parsing of Java class files is based on the ASM library and work by Dave Brosius in JCR-1522. |
Java Archive Files | The parsing of JAR files is performed using a combination of the ZIP and Java class file parsers. |
OpenDocument (application/vnd.oasis.opendocument.*) | Tika uses the built-in ZIP and XML features in the Java language to parse the OpenDocument document types used most notably by OpenOffice V2.0 and higher. The older OpenOffice V1.0 formats are also supported, although they are currently not auto-detected as well as the newer formats. |
Plain text (text/plain) | Tika uses the International Components for Unicode Java library (ICU4J) to parse plain text. |
Portable Document Format (PDF) (application/pdf) | Tika uses the PDFBox library to parse PDF documents. |
Rich Text Format (RTF) (application/rtf) | Tika uses Java's built-in Swing library to parse RTF documents. |
TAR (application/x-tar) | Tika uses an adapted version of the TAR parsing code from Apache Ant to parse TAR files. The TAR code is based on work by Timothy Gerard Endres. |
ZIP (application/zip) | Tika uses Java's built-in ZIP classes to parse ZIP files. |
You can also extend Apache Tika with your own parsers, and any contributions to Tika are welcome. The goal of Tika is to reuse existing parser libraries like Apache PDFBox or Apache POI as much as possible, so most of the parser classes in Tika are adapters to such external libraries.
Apache Tika also contains some general-purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the
AutoDetectParser
class that encapsulates all Tika
functionality into a single parser that can handle any type of
document. This parser will automatically determine the type of the
incoming document based on various heuristics and will then parse the
document accordingly.Now it's time for hands-on activities. Here are the classes we will develop throughout our tutorial:
-
BudgetScramble
— Shows how to use Apache Tika metadata to determine which document has been changed recently and when. -
TikaMetadata
— Shows how to get all Apache Tika metadata of a specific document, even if there is no data (just to display all metadata types). -
TikaMimeType
— Shows how to use Apache Tika's mimetypes to detect the mimetype of a particular document. -
TikaExtractText
— Shows Apache Tika's text-extraction capabilities and saves extracted text as an appropriate file. -
LanguageDetector —
Introduces the Nutch language's identification ability to identify the language of particular content. -
Summary —
Sums up Tika features, such asMimeType
, content charset detection, and metadata. In addition, it introduces cpdetector functionality to determine a file's charset encoding. Finally, it shows Nutch's language identification in process.
No comments:
Post a Comment