PDF Converter usually converts PDF file into another file fomat,
such as Word, Excel, PowerPoint, Plain text, html, image, etc.
It should have clear understanding of PDF document structure as
well as target file format structure. For instance, a PDF to Word
Converter must know PDF objects and Word file structure. In fact,
there is no one to one mapping between PDF objects (text streams,
images, shapes, etc) and Word document elements. So, PDF Converter
has to create compatible Word document elements for each PDF object.
This process is further complicated because of the different PDF
object attributes in different PDF versions.
There are two types of PDF Converters:
- OCR-Not-enabled
- OCR-enabled
Most PDF converters (including GIRDAC PDF to Word Converter
and PDF to Word Converter Pro)
belong to the first category. There are very few in the
second category and are very costly. OCR (Optical Character
Recognition) is a technique to recognize characters based on
the pixels order. Every image is set of ordered pixels
(picture elements). Each pixel has a color number to
display that color. Each character (alphabet, number, etc.)
is combination of ordered pixels.
OCR software converts hand-written or typewritten text documents
into machine editable text formats. Earlier versions of OCR are
trained to translate specific fonts. The current OCRs are intelligent
enough to recognize most of the fonts with high accuracy.
Some OCRs can converts the image into a formatted version same
as the original image. OCR uses algorithms to recognize characters
and Neural Networks to increase the accuracy.
There are two methods employed in OCR software.
- Matrix matching
- Feature extraction
Matrix matching is simpler than Feature extraction.
Matrix Matching compares each character with a library
of character matrices. When an image matches one of
the matrices of pixels, it labels that image as
the corresponding character.
Feature Extraction uses artificial intelligence to analyze
features such as closed shapes, diagonal lines, line
intersections, etc. This method is flexible and is employed
in both type-wriiten and hand-written documents.
Some PDF documents have text on images. Scanned hand-written
and typed text usually results in text on image. Such text on
image can be extracted through OCR-enabled PDF Converters.
GIRDAC PDF to Word Converter and PDF to Word Converter Pro
extracts such text as an image, not as text.
GIRDAC PDF to Word Converter and PDF to Word Converter Pro
does not convert PDF documents having the secuity setting:
Content Copying: Not Allowed
or
Page Extraction: Not Allowed
One can see this information in Adobe Reader top-level menu
File -> Document Properties and clicking on
Security tab.