Pages

Subscribe:

Ads 468x60px

Tuesday, October 18, 2011

Understanding information content with Apache Tika

Introduction


In this post , the Apache Tika framework and explain its concepts (e.g., N-gram, parsing, mime detection, and content analysis) via illustrative examples that should be applicable to not only seasoned software developers but to beginners to content analysis and programming as well. We assume you have a working knowledge of the Java™ programming language and plenty of content to analyze.
Throughout this tutorial, you will learn:
  • Apache Tika's API, most relevant modules, and related functions
  • Apache Nutch (one of the progenitors of Tika) and its NgramProfiler and LanguageIdentifier classes, which have recently been ported to Tika
  • cpdetector, the code page detector project, and its functionality
What is Apache Tika?


As Apache Tika's site suggests, Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

The parser interface


The org.apache.tika.parser.Parser interface is the key component of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:
void parse(InputStream stream, ContentHandler handler, Metadata metadata)
    throws IOException, SAXException, TikaException;
                

The parse method takes the document to be parsed and related metadata as input, and outputs the results as XHTML SAX events and extra metadata. The main criteria that led to this design are shown in Table 1.

Table 1. Criteria for Tika parsing design


Criterion Explanation
Streamed parsing The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.
Structured content A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information, for example, to better judge the relevance of different parts of the parsed document.
Input metadata A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process.
Output metadata A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata, such as the name of the author, that may be useful to client applications.

These criteria are reflected in the arguments of the parse method.

Document InputStream


The first argument is an InputStream for reading the document to be parsed.
If this document stream cannot be read, parsing stops and the thrown IOException is passed up to the client application. If the stream can be read but not parsed (if the document is corrupted, for example), the parser throws a TikaException.
The parser implementation will consume this stream, but will not close it. Closing the stream is the responsibility of the client application that opened it initially. Listing 1 shows the recommended pattern for using streams with the parse method.

Listing 1. Recommended pattern for using streams with the parse method

InputStream stream = ...;      // open the stream
try {
    parser.parse(stream, ...); // parse the stream
} finally {
    stream.close();            // close the stream
}
                

XHTML SAX events


The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document, and SAX events enable streamed processing. Note that the XHTML format is used here only to convey structural information, not to render the documents for browsing.
The XHTML SAX events produced by the parser implementation are sent to a ContentHandler instance given to the parse method. If the content handler fails to process an event, parsing stops and the thrown SAXException is passed up to the client application.
Listing 2 shows the overall structure of the generated event stream (with indenting added for clarity).

Listing 2. Structure of the generated event stream


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>...</title>
  </head>
  <body>
    ...
  </body>
</html>
                

Parser implementations typically use the XHTMLContentHandler utility class to generate the XHTML output. Dealing with the raw SAX events can be complex, so Apache Tika (since V0.2) comes with several utility classes that can be used to process and convert the event stream to other representations.
For example, the BodyContentHandler class can be used to extract just the body part of the XHTML output and feed it as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:
ContentHandler handler = new BodyContentHandler(System.out);
parser.parse(System.in, handler, ...);
                

Another useful class is ParsingReader that uses a background thread to parse the document and returns the extracted text content as a character stream.

Listing 3. Example of the ParsingReader


InputStream stream = ...; // the document to be parsed
Reader reader = new ParsingReader(parser, stream, ...);
try {
 ...; // read the document text using the reader
} finally {
 reader.close(); // the document stream is closed automatically
}
                

Document metadata


The final argument to the parse method is used to pass document metadata in and out of the parser. Document metadata is expressed as a metadata object.
Table 2 lists some of the more interesting metadata properties.

Table 2. Metadata properties
Property Description
Metadata.RESOURCE_NAME_KEY The name of the file or resource that contains the document — A client application can set this property to allow the parser to use file name heuristics to determine the format of the document. The parser implementation may set this property if the file format contains the canonical name of the file (the GZIP format has a slot for the file name, for example).
Metadata.CONTENT_TYPE The declared content type of the document — A client application can set this property based on, such as an HTTP Content-Type header. The declared content type may help the parser to correctly interpret the document. The parser implementation sets this property to the content type according to which the document was parsed.
Metadata.TITLE The title of the document — The parser implementation sets this property if the document format contains an explicit title field.
Metadata.AUTHOR The name of the author of the document — The parser implementation sets this property if the document format contains an explicit author field.

Note that metadata handling is still being discussed by the Apache Tika development team, and it is likely that there will be some (backwards-incompatible) changes in metadata handling before Tika V1.0.

Parser implementations


Apache Tika comes with a number of parser classes for parsing various document formats, as shown in Table 3.

Table 3. Tika parser classes
Format Description
Microsoft® Excel® (application/vnd.ms-excel) Excel spreadsheet support is available in all versions of Tika and is based on the HSSF library from POI.
Microsoft Word® (application/msword) Word document support is available in all versions of Tika and is based on the HWPF library from POI.
Microsoft PowerPoint® (application/vnd.ms-powerpoint) PowerPoint presentation support is available in all versions of Tika and is based on the HSLF library from POI.
Microsoft Visio® (application/vnd.visio) Visio diagram support was added in Tika V0.2 and is based on the HDGF library from POI.
Microsoft Outlook® (application/vnd.ms-outlook) Outlook message support was added in Tika V0.2 and is based on the HSMF library from POI.
GZIP compression (application/x-gzip) GZIP support was added in Tika V0.2 and is based on the GZIPInputStream class in the Java 5 class library.
bzip2 compression (application/x-bzip) bzip2 support was added in Tika V0.2 and is based on bzip2 parsing code from Apache Ant, which was originally based on work by Keiron Liddle from Aftex Software.
MP3 audio (audio/mpeg) The parsing of ID3v1 tags from MP3 files was added in Tika V0.2. If found, the following metadata is extracted and set:
  • TITLE Title
  • SUBJECT Subject
MIDI audio (audio/midi) Tika uses the MIDI support in javax.audio.midi to parse MIDI sequence files. Many karaoke file formats are based on MIDI and contain lyrics as embedded text tracks that Tika knows how to extract.
Wave audio (audio/basic) Tika supports sampled wave audio (.wav files, etc.) using the javax.audio.sampled package. Only sampling metadata is extracted.
Extensible Markup Language (XML) (application/xml) Tika uses the javax.xml classes to parse XML files.
HyperText Markup Language (HTML) (text/html) Tika uses the CyberNeko library to parse HTML files.
Images (image/*) Tika uses the javax.imageio classes to extract metadata from image files.
Java class files The parsing of Java class files is based on the ASM library and work by Dave Brosius in JCR-1522.
Java Archive Files The parsing of JAR files is performed using a combination of the ZIP and Java class file parsers.
OpenDocument (application/vnd.oasis.opendocument.*) Tika uses the built-in ZIP and XML features in the Java language to parse the OpenDocument document types used most notably by OpenOffice V2.0 and higher. The older OpenOffice V1.0 formats are also supported, although they are currently not auto-detected as well as the newer formats.
Plain text (text/plain) Tika uses the International Components for Unicode Java library (ICU4J) to parse plain text.
Portable Document Format (PDF) (application/pdf) Tika uses the PDFBox library to parse PDF documents.
Rich Text Format (RTF) (application/rtf) Tika uses Java's built-in Swing library to parse RTF documents.
TAR (application/x-tar) Tika uses an adapted version of the TAR parsing code from Apache Ant to parse TAR files. The TAR code is based on work by Timothy Gerard Endres.
ZIP (application/zip) Tika uses Java's built-in ZIP classes to parse ZIP files.

You can also extend Apache Tika with your own parsers, and any contributions to Tika are welcome. The goal of Tika is to reuse existing parser libraries like Apache PDFBox or Apache POI as much as possible, so most of the parser classes in Tika are adapters to such external libraries.
Apache Tika also contains some general-purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the AutoDetectParser class that encapsulates all Tika functionality into a single parser that can handle any type of document. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.
Now it's time for hands-on activities. Here are the classes we will develop throughout our tutorial:
  1. BudgetScramble — Shows how to use Apache Tika metadata to determine which document has been changed recently and when.
  2. TikaMetadata — Shows how to get all Apache Tika metadata of a specific document, even if there is no data (just to display all metadata types).
  3. TikaMimeType — Shows how to use Apache Tika's mimetypes to detect the mimetype of a particular document.
  4. TikaExtractText — Shows Apache Tika's text-extraction capabilities and saves extracted text as an appropriate file.
  5. LanguageDetector — Introduces the Nutch language's identification ability to identify the language of particular content.
  6. Summary — Sums up Tika features, such as MimeType, content charset detection, and metadata. In addition, it introduces cpdetector functionality to determine a file's charset encoding. Finally, it shows Nutch's language identification in process.

No comments:

Post a Comment