B Oracle Text Supported Document Formats

Oracle Text uses the HTML export technology of Oracle Outside In for automatic filtering. This appendix provides tables with the document and graphic file formats supported by the automatic AUTO_FILTER filtering technology for this release.

This appendix contains the following topics:

See Also:

"AUTO_FILTER" for information on using AUTO_FILTER

B.1 About Document Filtering Technology

The automatic filtering technology in Oracle Text enables you to convert documents to HTML for document presentation with the CTX_DOC package.

To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER object in your filter preference.

To use automatic filtering technology for converting documents to HTML with the CTX_DOC package, you need not use the AUTO_FILTER indexing preference.

This section contains these topics:

B.1.1 Latest Updates for Patch Releases

The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases.

B.1.2 Restrictions on Format Support

The formats listed in this appendix are those formats recognized by AUTO_FILTER. Recognizing a format does not necessarily mean that text can be extracted from it. For example, a scanned document is usually an image and AUTO_FILTER does not perform optical character recognition. Similarly, text cannot be extracted for indexing from multimedia file types.

Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER filter.

For other limitations, see "Supported Document Formats" concerning specific document types.

B.1.3 Supported Platforms for AUTO_FILTER Document Filtering Technology

Several platforms can take advantage of AUTO_FILTER filter technology.

B.1.3.1 Supported Platforms

AUTO_FILTER filter technology is supported on the following platforms:

  • Windows (x86 32-bit) Windows 2000, Windows Server 2003, Windows Server 2008, Windows XP, and Windows Vista Enterprise

  • Windows (x86 64-bit) Windows Server 2003 and Windows Server 2008 x64 Standard, Enterprise, and Datacenter Editions (64-bit Extended Systems)

  • HP-UX (PA-RISC 64-bit) 11.i

  • HP/UX (Itanium 64) 11i

  • IBM AIX (32-bit pSeries) 5.1 - 5.3

  • iSeries (OS/400 using PASE) V5R2

  • Red Hat Linux (x86) Advanced Server 3, 4, and 5

  • Red Hat Linux (x86) Red Hat Enterprise Linux (RHEL) 4

  • Red Hat Linux (Itanium 64) Advanced Server 3, 4, and 5

  • Red Hat Linux (zSeries, 31-bit) Advanced Server 3 and 4

  • Red Hat Enterprise Linux AS/ES 3.0, 4.0 and 5.0, x86-64 (AMD64/EM64T)Oracle Linux 4.0 and 5.0, x86-64 (AMD64/EM64T)

  • SuSE Linux (x86) 9, 10, and Enterprise Server 9.0

  • SuSE Linux (x86 64-bit) SUSE Enterprise Server (SLES) 9, 10

  • SuSE Linux (Itanium 64) Enterprise Server 8

  • SuSE Linux (zSeries, 31-bit) 9

  • Sun Solaris (SPARC 64-bit) 9.x - 10.x

  • Sun Solaris (x86-64-bit) 10x

Note that some of these platforms may not be supported by the Oracle Database.

B.1.4 Filtering on PDF Documents and Security Settings

A PDF document can have different levels of security settings as follows:

Table B-1 AUTO_FILTER Behavior with PDF Security Settings

Security Level Description PDF Version Encryption AUTO_FILTER Support Level

Level 1

Requires a password for opening the document.

1.2+

40 bit RC4

Not supported.

   

1.4+

128 bit RC4

Not supported.

   

1.5+

128 bit RC4

Not supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 2

Disallows user printing of the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 3

Disallows user modification or change of the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit RC4

Not supported.

   

1.7+

256 bit AES

Not supported.

Level 4

Disallows the user from copying or extracting content from the document.

1.2+

40 bit RC4

Supported.

   

1.4+

128 bit RC4

Supported.

   

1.5+

128 bit RC4

Supported.

   

1.6+

128 bit AES

Not supported.

   

1.7+

256 bit AES

Not supported.


B.1.5 PDF Filtering Limitations

The following limitations apply when filtering PDF files:

  • Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts.

  • Embedded fonts in a PDF document are not filtered correctly. They are usually displayed using the question mark (?) replacement character.

  • Hyperlinks in a PDF are not active when displayed in a browser or a viewing window.

  • Annotations, such as notes, sound, or movies, are not supported.

B.1.6 Environment Variables

No environment variables need to be set by the user.

B.1.7 General Limitations

AUTO_FILTER filter technology has the following limitations:

  • Any ASCII characters less then 0x20 (decimal 32) are converted to hexadecimal numbers.

  • Files larger than 2GB are not handled.

B.2 Supported Document Formats

Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC package. The tables in this section list the document formats that Oracle Text supports for filtering.

This section contains the following topics:

Note:

These lists do not represent the complete list of formats that Oracle Text is able to process. The USER_FILTER and PROCEDURE_FILTER enable Oracle Text to process any document format, provided an external filter exists that can filter to some textual format like plain-text, HTML, XML, and so forth.

B.2.1 Archive File Format

When filtering an archive file, all the contents of the files inside the archive will be exported to a single output file. This will also include the contents of all subfolders and files inside the archive file.

Table B-2 lists the archive formats that Oracle Text supports.

Table B-2 Supported Archive File Formats

Archive Format Version

7z (BZIP2 and split archives not supported)

 

7z Self Extracting .exe (BZIP2 and split archives not supported)

 

LZA Self Extracting Compress

 

LZH Compress

 

Microsoft Office Binder

95 – 97

Microsoft Cabinet (CAB)

 

RAR

1.5, 2.0, 2.9

Self-extracting .exe

 

UNIX Compress

 

UNIX GZip

 

UNIX Tar

 

Uuencode

 

Zip

PKZip

Zip

WinZip


B.2.2 Database Formats

Format Version
DataEase 4.x
dBASE III, IV, V
First Choice DB Through 3.0
Framework DB 3.0
Microsoft Access 1.0, 2.0
Microsoft Access Report Snapshot (File ID only) 2000 – 3000
Microsoft Works DB for DOS 1.0, 2.0
Microsoft Works DB for Macintosh 2.0
Microsoft Works DB for Windows 3.0, 4.0
Paradox for DOS 2.0 – 4.0
Paradox for Windows 1.0
Q&A Database Through 2.0
R:BASE 5000 Through 3.1
R:BASE System V 1.0
Reflex 2.0
SmartWare II 1.02

B.2.3 Email Formats

Format Version
Apple Mail Message (EMLX) 2.0
Encoded mail messages MHT
Encoded mail messages Multi Part Alternative
Encoded mail messages Multi Part Digest
Encoded mail messages Multi Part Mixed
Encoded mail messages Multi Part News Group
Encoded mail messages Multi Part Signed
Encoded mail messages TNEF
IBM Lotus Notes Domino XML Language DXL 8.5
IBM Lotus Notes NSF (File ID) 7.x, 8.x
IBM Lotus Notes NSF (Windows, Linux x86-32 and Oracle Solaris 32-bit only with Notes Client or Domino Server) 8.x
MBOX Mailbox RFC 822
Microsoft Outlook Message (MSG) 97 – 2007
Microsoft Outlook Express (EML) MIME-encoded mail messages.
Microsoft Outlook Forms Template (OFT) 97 – 2007
Microsoft Outlook OST 97 – 2007
Microsoft Outlook PST 97 – 2007
Microsoft Outlook PST (Mac) 2001

B.2.3.1 MIME Support Notes

The following formats are supported:

  • MIME formats

    • EML

    • MHT (Web Archive)

    • NWS (Newsgroup single-part and multi-part)

    • Simple Text Mail (defined in RFC 2822)

  • TNEF format

  • MIME encodings, including

    • base64 (defined in RFC 1521)

    • binary (defined in RFC 1521)

    • binhex (defined in RFC 1741)

    • btoa

    • quoted-printable (defined in RFC 1521)

    • utf-7 (defined in RFC 2152)

    • uue

    • xxe

    • yenc

In addition, the body of a message can be encoded in several ways. The following encodings are supported:

  • HTML

  • RTF

  • TNEF

  • Text/enriched (defined in RFC 1523)

  • Text/richtext (defined in RFC1341)

  • Embedded mail message (defined in RFC 822) - this is handled as a link to a new message

The attachments of a MIME message can be stored in many formats. Oracle Corporation processes all attachment types that its technology supports.

B.2.4 Graphic Formats (Raster and Vector Image)

The graphic formats that the AUTO_FILTER filter recognizes ensure that indexing a text column containing any of these formats produces no error. Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.

This section contains the following tables for supported graphic formats:

Note:

The AUTO_FILTER filter cannot extract textual information from graphics.

Table B-3 Supported Raster Image Formats for AUTO_FILTER Filter

Format Version

CALS Raster (GP4)

Type I

CALS Raster (GP4)

Type II

Computer Graphics Metafile

ANSI

Computer Graphics Metafile

CALS

Computer Graphics Metafile

NIST

Encapsulated PostScript (EPS)

TIFF header Only

GEM Image (Bitmap)

 

Graphics Interchange Format (GIF)

 

IBM Graphics Data Format (GDF)

1.0

IBM Picture Interchange Format

1.0

JBIG2

Graphic Embeddings in PDF

JFIF (JPEG not in TIFF format)

 

JPEG

 

JPEG 2000

JP2

Kodak Flash Pix

 

Kodak Photo CD

1.0

Lotus PIC

 

Lotus Snapshot

 

Macintosh PICT

BMP only

Macintosh PICT2

BMP only

MacPaint

 

Microsoft Windows Bitmap

 

Microsoft Windows Cursor

 

Microsoft Windows Icon

 

OS/2 Bitmap

 

OS/2 Warp Bitmap

 

Paint Shop Pro (Win32 only)

5.0, 6.0

PC Paintbrush (PCX)

 

PC Paintbrush DCX (multi-page PCX)

 

Portable Bitmap (PBM)

 

Portable Graymap PGM

 

Portable Network Graphics (PNG)

 

Portable Pixmap (PPM)

 

Progressive JPEG

 

StarOffice Draw

6.x – 9.0

Sun Raster

 

TIFF

Group 5 & 6

TIFF CCITT

Group 3 & 4

TruVision TGA (Targa)

2.0

Word Perfect Graphics

1.0

WBMP wireless graphics format

 

X-Windows Bitmap

x10 compatible

X-Windows Dump

x10 compatible

X-Windows Pixmap

x10 compatible

WordPerfect Graphi

2.0 – 10.0


Table B-4 Supported Vector Image Formats for AUTO_FILTER Filter

Graphics Format Version

Adobe Illustrator

4.0 – 7.0, 9.0

Adobe Illustrator (XMP only)

11 – 13 (CS 1 – 3)

Adobe InDesign (XMP only)

3.0 – 5.0 (CS 1 - 3)

Adobe InDesign Interchange (XMP only)

 

Adobe Photoshop (XMP only)

8.0 – 10.0 (CS 1 – 3)

Adobe PDF

1.0 – 1.7 (Acrobat 1 – 9)

Adobe PDF Package

1.7 (Acrobat 8 – 9)

Adobe PDF Portfolio

1.7 (Acrobat 8 – 9)

Adobe Photoshop

4.0

Ami Draw

SDW

AutoCAD Drawing

2.5, 2.6

AutoCAD Drawing

9.0 – 14.0

AutoCAD Drawing

2000i – 2010

AutoShade Rendering

2

Corel Draw

2.0 – 9.0

Corel Draw Clipart

5.0, 7.0

Enhanced Metafile (EMF)

 

Escher graphics

 

FrameMaker Graphics (FMV)

3.0 – 5.0

Gem File (Vector)

 

Harvard Graphics Chart DOS

2.0 – 3.0

Harvard Graphics for Windows

 

Hewlett Packard Graphics Language (HPGL)

2.0

IGES Drawing

5.1 – 5.3

Micrografx Designer (DRW)

Through 3.1

Micrografx Designer (DFS)

6.0

Micrografx Draw (DRW)

Through 4.0

Microsoft XPS (Text only)

 

Novell PerfectWorks Draw

2

OpenOffice Draw

1.1 – 3.0

Oracle Open Office Draw

3.x

Visio (Page Preview mode WMF/EMF)

4.0

Visio

5.0 - 2007

Visio XML VSX (File ID only)

2007

Windows Metafile (WMF)

 

B.2.5 Multimedia Formats

The multimedia formats listed below are those formats recognized by AUTO_FILTER. Recognizing a format does not necessarily mean that text can be extracted from it. Also, the file name and file header information are not indexed. A scanned document is usually an image, and AUTO_FILTER does not perform optical character recognition. Similarly, text cannot be extracted for indexing from multimedia file types.

Format Version
AVI (Metadata extraction only)  
Flash (text extraction only) 6.x, 7.x, Lite
Flash (File ID only) 9, 10
Real Media (File ID only)  
MP3 (ID3 metadata only)  
MPEG-1 Audio layer 3 V ID3 v1 (File ID only  
MPEG-1 Audio layer 3 V ID3 v2 (File ID only)  
MPEG-1 Video V 2 (File ID only  
MPEG-1 Video V 3 (File ID only)  
MPEG-2 Audio (File ID only)  
MPEG-4 (Metadata extraction only)  
MPEG-7 (Metadata extraction only)  
QuickTime (Metadata extraction only)  
Windows Media ASF (Metadata extraction only)  
Windows Media DVR-MS (Metadata extraction only)  
Windows Media Audio WMA (Metadata extraction only)  
Windows Media Playlist (File ID only)  
Windows Media Video WMV (Metadata extraction only)  
WAV (Metadata extraction only)  

B.2.6 Other Formats

Format Version
AOL Messenger (File ID only) 7.3
Microsoft InfoPath (File ID only) 2007
Microsoft Live Messenger (via XML filter) 10.0
Microsoft OneNote (File ID only) 2007
Microsoft Project (table view only) 98 – 2003
Microsoft Project (table view only) 2007, 2010
Microsoft Windows Compiled Help (File ID only) .chm
Microsoft Windows DLL -
Microsoft Windows Executable -
Microsoft Windows Explorer Command (File ID only) .scf
Microsoft Windows Help (File ID only) .hlp
Microsoft Windows Shortcut (File ID only) .lnk
Trillian Text Log File (via text filter) 4.2
Trillian XML Log File (File ID only) 4.2
TrueType Font (File ID only) ttf, ttc
vCalendar 2.1
vCard 2.1
Yahoo Messenger 6.x – 8

B.2.7 Presentation Formats

Format Version
Harvard Graphics Presentation DOS 3.0
IBM Lotus Symphony Presentations 1.x
Kingsoft WPS Presentation 2010
Lotus Freelance 1.0 – Millennium 9.6
Lotus Freelance for OS/3 2
Lotus Freelance for Windows 95, 97
Microsoft PowerPoint for Macintosh 4.0 – 2008
Microsoft PowerPoint for Windows 3.0 – 2010
Microsoft PowerPoint for Windows Slideshow 2007 – 2010
Microsoft PowerPoint for Windows Template 2007 – 2010
Novell Presentations 3.0, 7.0
OpenOffice Impress 1.1, 3.0
Oracle Open Office Impress 3.x
StarOffice Impress 5.2 – 9.0
WordPerfect Presentations 5.1 – X4

B.2.8 Spreadsheet Formats

Format Version
Enable Spreadsheet 3.0 – 4.5
First Choice SS Through 3.0
Framework SS 3.0
IBM Lotus Symphony Spreadsheets 1.x
Kingsoft WPS Spreadsheets 2010
Lotus 1-2-3 Through Millennium 9.6
Lotus 1-2-3 Charts for DOS and Windows Through 5.0
Lotus 1-2-3 for OS/2 Through 2.0
Microsoft Excel Charts 2.x – 2007
Microsoft Excel for Macintosh 98 – 2008
Microsoft Excel for Windows 3.0 – 2010
Microsoft Excel for Windows (text only via XML filter) 2003 XML
Microsoft Excel for Windows (.xlsb) 2007 – 2010 (Binary)
Microsoft Works SS for DOS 2.0
Microsoft Works SS for Macintosh 2.0
Microsoft Works SS for Windows 3.0, 4.0
Multiplan 4.0
Novell PerfectWorks Spreadsheet 2.0
OpenOffice Calc 1.1 – 3.0
Oracle Open Office Calc 3.x
PFS: Plan 1.0
Quattro Pro for DOS Through 5.0
Quattro Pro for Windows Through X4
SmartWare Spreadsheet  
SmartWare II SS 1.02
StarOffice Calc 5.2 – 9.0
SuperCalc 5.0
VP-Planner 1.0

B.2.9 Text and Markup Formats

Format Version
ANSI Text 7 and 8 bit
ASCII Text 7 and 8 bit
DOS character set  
EBCDIC  
HTML (CSS rendering not supported) 1.0 – 4.0
IBM DCA/RFT  
Macintosh character set  
Rich Text Format (RTF)  
Unicode Text  
UTF-8  
Wireless Markup Language  
XML (text only)  
XHTML (file ID only) 1.0

B.2.10 Word Processing and Desktop Publishing Formats

Format Version
Adobe FrameMaker (MIF only) 3.0 – 6.0
Adobe Illustrator Postscript Level 2
Ami  
Ami Pro for OS2  
Ami Pro for Windows 2.0, 3.0
DEC DX Through 4.0
DEC DX Plus 4.0, 4.1
Enable Word Processor 3.0 – 4.5
First Choice WP 1.0, 3.0
Framework WP 3.0
Hangul 97 – 2007
IBM DCA/FFT  
IBM DisplayWrite 2.0 – 5.0
IBM Writing Assistant 1.01
Ichitaro 5.0, 6.0, 8.0 – 13.0, 2004, 2010
JustWrite Through 3.0
Kingsoft WPS Writer 2010
Legacy 1.1
Lotus Manuscript Through 2.0
Lotus WordPro 9.7, 96 – Millennium 9.6
Lotus Word Pro (non Win32) 97 – Millennium 9.6
MacWrite II 1.1
Mass 11 Through 8.0
Microsoft Publisher (File ID only) 2003 – 2007
Microsoft Word for DOS 4.0 – 6.0
Microsoft Word for Macintosh 4.0 – 6.0, 98 – 2008
Microsoft Word for Windows 1.0 – 2010
Microsoft Word for Windows (text only via XML filter) 2003 XML
Microsoft Word for Windows 98-J
Microsoft WordPad  
Microsoft Works WP for DOS 2.0
Microsoft Works WP for Macintosh 2.0
Microsoft Works WP for Windows 3.0, 4.0
Microsoft Write for Windows 1.0 – 3.0
MultiMate Through 4.0
MultiMate Advantage 2.0
Navy DIF  
Nota Bene 3.0
Novell PerfectWorks Word Processor 2.0
OfficeWriter 4.0 – 6.0
OpenOffice Writer 1.1 – 3.0
Oracle Open Office Writer 3.x
PC File Doc 5.0
PFS: Write A, B
Professional Write for DOS 1.0, 2.0
Professional Write Plus for Windows 1.0
Q&A Write 2.0, 3.0
Samna Word IV 1.0 – 3.0
Samna Word IV+  
Samsung JungUm Global (File ID only)  
Signature 1.0
SmartWare II WP 1.02
Sprint 1.0
StarOffice Writer 5.2 – 9.0
Total Word 1.2
Wang IWP Through 2.6
WordMarc Composer  
WordMarc Composer+  
WordMarc Word Processor  
WordPerfect for DOS 4.2
WordPerfect for Macintosh 1.02 – 3.1
WordPerfect for Windows 5.1 – X4
WordStar 2000 for DOS 1.0 – 3.0
Wordstar for DOS 3.0 – 7.0
Wordstar for Windows 1.0
XyWrite Through III+