Introduction
In this post , the Apache Tika framework and explain its
concepts (e.g., N-gram, parsing, mime detection, and content analysis) via
illustrative examples that should be applicable to not only seasoned
software developers but to beginners to content analysis and programming
as well. We assume you have a working knowledge of the Java™
programming language and plenty of content to analyze.
Throughout this tutorial, you will learn:
- Apache Tika's API, most relevant modules, and related functions
- Apache Nutch (one of the progenitors of Tika) and its
NgramProfiler and LanguageIdentifier classes, which have recently
been ported to Tika
- cpdetector, the code page detector project, and its
functionality
What is Apache Tika?
As Apache Tika's site suggests, Apache Tika is a toolkit for detecting and
extracting metadata and structured text content from various documents
using existing parser libraries.
The parser interface
The org.apache.tika.parser.Parser interface is the key component of
Apache Tika. It hides the complexity of different file formats and
parsing libraries while providing a simple and powerful mechanism for
client applications to extract structured text content and metadata
from all sorts of documents. All this is achieved with a single
method:
void parse(InputStream stream, ContentHandler handler, Metadata metadata)
throws IOException, SAXException, TikaException;
|
The
parse
method takes the document to be parsed and related metadata
as input, and outputs the results as XHTML SAX events and extra
metadata. The main criteria that led to this design are shown in
Table 1.
Table 1. Criteria for Tika parsing
design
Criterion
| Explanation
|
Streamed parsing
| The interface should require neither the
client application nor the parser implementation to keep
the full document content in memory or spooled to disk.
This allows even huge documents to be parsed without
excessive resource requirements.
|
Structured content
| A parser implementation should be able to
include structural information (headings, links, etc.) in
the extracted content. A client application can use this
information, for example, to better judge the relevance of
different parts of the parsed document.
|
Input metadata
| A client application should be able to
include metadata like the file name or declared content
type with the document to be parsed. The parser
implementation can use this information to better guide
the parsing process.
|
Output metadata
| A parser implementation should be able to
return document metadata in addition to document content.
Many document formats contain metadata, such as the name of
the author, that may be useful to client applications.
|
These criteria are reflected in the arguments of the
parse
method.
Document InputStream
The first argument is an
InputStream
for
reading the document to be parsed.
If this document stream cannot be read, parsing stops and the
thrown
IOException
is passed up to the
client application. If the stream can be read but not parsed (if the
document is corrupted, for
example), the parser throws a
TikaException
.
The parser implementation will consume this stream, but will not close
it. Closing the stream is the responsibility of the client application
that opened it initially. Listing 1 shows the recommended pattern for
using streams with the
parse
method.
Listing 1.
Recommended pattern for using streams with the parse
method
InputStream stream = ...; // open the stream
try {
parser.parse(stream, ...); // parse the stream
} finally {
stream.close(); // close the stream
}
|
XHTML SAX events
The parsed content of the document stream is returned to the client
application as a sequence of XHTML SAX events. XHTML is used to
express structured content of the document, and SAX events enable
streamed processing. Note that the XHTML format is used here only to
convey structural information, not to render the documents for
browsing.
The XHTML SAX events produced by the parser implementation are sent to
a
ContentHandler
instance given to the
parse
method. If the
content handler fails to process an event, parsing stops and the
thrown
SAXException
is passed up to the client application.
Listing 2 shows the overall structure of the generated event stream
(with indenting added for clarity).
Listing 2.
Structure of the generated event stream
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>...</title>
</head>
<body>
...
</body>
</html>
|
Parser implementations typically use the
XHTMLContentHandler
utility
class to generate the XHTML output. Dealing with the raw SAX events
can be complex, so Apache Tika (since V0.2) comes with several utility
classes that can be used to process and convert the event stream to other representations.
For example, the
BodyContentHandler
class can be used to extract just
the body part of the XHTML output and feed it as SAX events to
another content handler or as characters to an output stream, a
writer, or simply a string. The following code snippet parses a
document from the standard input stream and outputs the extracted text
content to standard output:
ContentHandler handler = new BodyContentHandler(System.out);
parser.parse(System.in, handler, ...);
|
Another useful class is
ParsingReader
that uses a background thread to
parse the document and returns the extracted text content as a
character stream.
Listing 3. Example of the ParsingReader
InputStream stream = ...; // the document to be parsed
Reader reader = new ParsingReader(parser, stream, ...);
try {
...; // read the document text using the reader
} finally {
reader.close(); // the document stream is closed automatically
}
|
Document metadata
The final argument to the
parse
method is used to pass document
metadata in and out of the parser. Document metadata is expressed
as a metadata object.
Table 2 lists some of the more interesting metadata properties.
Table 2. Metadata properties
Property
| Description
|
Metadata.RESOURCE_NAME_KEY
| The name of the file or resource that
contains the document — A client application can set this
property to allow the parser to use file name heuristics to
determine the format of the document. The parser
implementation may set this property if the file format
contains the canonical name of the file (the
GZIP format has a slot for the file
name, for example). |
Metadata.CONTENT_TYPE
| The declared content type of the
document — A client application can set this property based on,
such as an HTTP Content-Type
header. The declared content type may help the parser to
correctly interpret the document. The parser implementation
sets this property to the content type according to which the
document was parsed. |
Metadata.TITLE
| The title of the document — The parser
implementation sets this property if the document format
contains an explicit title field.
|
Metadata.AUTHOR
| The name of the author of the document —
The parser implementation sets this property if the
document format contains an explicit author field.
|
Note that metadata handling is still being discussed by the Apache Tika
development team, and it is likely that there will be some
(backwards-incompatible) changes in metadata handling before Tika V1.0.
Parser implementations
Apache Tika comes with a number of parser classes for parsing various
document formats, as shown in Table 3.
Table 3. Tika parser classes
Format
| Description
|
Microsoft® Excel®
(application/vnd.ms-excel)
| Excel spreadsheet support is
available in all versions of Tika and is based on the
HSSF library from
POI. |
Microsoft Word® (application/msword)
| Word document support is available in
all versions of Tika and is based on the
HWPF library from
POI. |
Microsoft PowerPoint®
(application/vnd.ms-powerpoint)
| PowerPoint presentation support is
available in all versions of Tika and is based on the
HSLF library from
POI. |
Microsoft Visio®
(application/vnd.visio)
| Visio diagram support was added in
Tika V0.2 and is based on the
HDGF library from
POI. |
Microsoft Outlook®
(application/vnd.ms-outlook)
| Outlook message support was added in
Tika V0.2 and is based on the
HSMF library from
POI. |
GZIP compression
(application/x-gzip)
| GZIP support was added in Tika
V0.2 and is based on the
GZIPInputStream class in the Java 5
class library. |
bzip2 compression
(application/x-bzip)
| bzip2 support was added in Tika
V0.2 and is based on bzip2 parsing code from Apache Ant,
which was originally based on work by Keiron Liddle
from Aftex Software. |
MP3 audio (audio/mpeg)
| The parsing of
ID3v1 tags from
MP3 files was added in Tika V0.2. If found, the following metadata is extracted and set:
-
TITLE
Title
-
SUBJECT
Subject
|
MIDI audio (audio/midi)
| Tika uses the MIDI support in
javax.audio.midi to parse MIDI
sequence files. Many karaoke file formats are based on MIDI
and contain lyrics as embedded text tracks that Tika knows how
to extract. |
Wave audio (audio/basic)
| Tika supports sampled wave audio
(.wav files, etc.) using the
javax.audio.sampled package. Only
sampling metadata is extracted. |
Extensible Markup Language (XML)
(application/xml)
| Tika uses the
javax.xml classes to parse
XML files. |
HyperText Markup Language (HTML)
(text/html)
| Tika uses the CyberNeko library to parse
HTML files.
|
Images (image/*)
| Tika uses the
javax.imageio classes to extract
metadata from image files. |
Java class files
| The parsing of Java class files is based
on the ASM library and work by Dave Brosius in
JCR-1522.
|
Java Archive Files
| The parsing of JAR files is performed using a
combination of the ZIP and Java
class file parsers. |
OpenDocument
(application/vnd.oasis.opendocument.*)
| Tika uses the built-in
ZIP and
XML features in the Java language to parse the
OpenDocument document types used
most notably by OpenOffice V2.0 and higher. The older
OpenOffice V1.0 formats are also supported, although they are
currently not auto-detected as well as the newer formats.
|
Plain text (text/plain)
| Tika uses the International
Components for Unicode Java library
(ICU4J) to parse plain text. |
Portable Document Format (PDF)
(application/pdf)
| Tika uses the
PDFBox library to parse PDF documents.
|
Rich Text Format (RTF) (application/rtf)
| Tika uses Java's built-in Swing
library to parse RTF documents. |
TAR (application/x-tar)
| Tika uses an adapted version of the
TAR parsing code from Apache Ant to
parse TAR files. The TAR code is
based on work by Timothy Gerard Endres. |
ZIP (application/zip)
| Tika uses Java's built-in
ZIP classes to parse
ZIP files. |
You can also extend Apache Tika with your own parsers, and any
contributions to Tika are welcome. The goal of Tika is to reuse
existing parser libraries like Apache PDFBox or Apache POI as much as
possible, so most of the parser classes in Tika are adapters to
such external libraries.
Apache Tika also contains some general-purpose parser implementations that are
not targeted at any specific document formats. The most notable of
these is the
AutoDetectParser
class that encapsulates all Tika
functionality into a single parser that can handle any type of
document. This parser will automatically determine the type of the
incoming document based on various heuristics and will then parse the
document accordingly.
Now it's time for hands-on activities. Here are the classes we will develop throughout our
tutorial:
-
BudgetScramble
— Shows how
to use Apache Tika metadata to
determine which document has been changed recently and
when.
-
TikaMetadata
— Shows how to
get all Apache Tika metadata of a specific
document, even if there is no data (just to display all metadata
types).
-
TikaMimeType
— Shows how to
use Apache Tika's mimetypes to detect
the mimetype of a particular document.
-
TikaExtractText
— Shows
Apache Tika's text-extraction capabilities and saves extracted text as
an appropriate file.
-
LanguageDetector —
Introduces
the Nutch language's identification ability to identify the language
of particular content.
-
Summary —
Sums up Tika
features, such as MimeType
, content
charset detection, and metadata. In addition, it introduces cpdetector
functionality to determine a file's charset encoding.
Finally, it shows Nutch's language identification in process.