Results 1 to 7 of 7

Thread: Looking for some OCR software

  1. #1
    Senior Member
    Join Date
    Aug 2001
    Location
    Melbourne, FL USA
    Posts
    1,498

    Looking for some OCR software

    My mother passed away recently and is surprising what kind of stuff you find after 60 years of accumulation.

    My dad had the ultimate Swiss Army knife it had a stapler (we found the replacement staples also) and a tape measure. I guess it's the white-collar Swiss Army knife.

    Mom wrote a book that I never knew about and would have to put it circa 1962 in Alaska. It is hard to read though because there are 140 some pages single spaced and no paragraph Breaks. I will attach a page that will fit.

    From a cursory three sentences read, I've taken it as a love story on the Eastern front.

    What I need is reliable software package or a service that will convert the scanned pages into a text document. Does anyone have experienced or recommendations?
    Attached Images Attached Images  

  2. #2
    Wow! What a find!!

    There are a few ways to go about scanning a document like that in. Often, I find the older the document, the less I want it to go through my mechanical scanner (certainly no autofeeder and only glass page by page if I must).

    As phones get smarter and cameras have gotten digital I have been amazed at the increase in quality and ease of scanning material.

    Our first goal will be to digitize the document and get the document into the computer.
    Personally, if you have a scanner with glass, you could go ahead and start there, with that equipment. If you have a digital camera, or smartphone with a memory card chip, this will also work (and personally, work well). You may need to adjust lighting or angle if using a camera, to ensure you get the most accurate text.

    If the photos are on a digital camera or memory card chip in your phone you will need to download them to the computer.

    Now that the photos are on a computer you need OCR software. I have Adobe Acrobat from my college days and that works very well. I also have Abbyy FineReader (spelled correctly) and that also works very well, especially if there is some skew to the photos (Adobe does not have a deskewer; Abbyy does). If you are "ready," and have done some reading of the instructions, you may be able to work with a program quickly enough you can take advantage of a free trial. Then, only purchase the program if you need it, and liked it. You may also be able to "stack" the free trials of programs together, trying one and then the other, before you purchase.

    There may be free OCR software with different benefits. I believe the free screenreader NVDA may have a free OCR software program, but I have not tried it.

    OCR software will not be perfect; I find that it misinterprets a lot of ones and L's, for example, or zeros and O's, etc. Typed pages will do much better than handwritten pages. But, you will be able to get somewhat close, and then you can make the corrections manually.

    FYI: Whatever software you use, make sure to save (and/or export) the files to a format that is not proprietary. Adobe has many choices, including save as PDF or save as a Word document, etc., and that generally works well. (You can get Adobe Acrobat Reader for free.) However, software like Abbyy is in a proprietary format as far as my experience; it must be exported to something like PDF, Word, etc. before you can view it in anything else. But, once exported, it shouldn't "expire."

    My condolences on the loss of your mother.

    Take care,
    Mystery

  3. #3
    Senior Member
    Join Date
    Aug 2001
    Location
    Melbourne, FL USA
    Posts
    1,498
    Thanks for your input.

    We scanned in page by page under glass I handand they came out like the example above. What format should I convert them to before trying to use OCR software on them?

  4. #4

    Tesseract

    It will need to be corrected by hand as original has smudges and corrections. I use https://en.wikipedia.org/wiki/Tesseract_(software)

    Sample tesseract output:
    Code:
    ' Don't break it,"
    
    
    20
    They were only quiet, when the sun was hiding behind the silhouettes of the
    
    
    forest. He looked at the bluegray sky, and was reminded of Natascha's eyes
    I t became so still, that he thought, he could hear his own thinking. With
    the first bright beams of the sun, the wild-geese were interrupting the
    dead stillnes with their piercing cries. But they weren' the first. Every
    once in a while, he would hear the slamming of a barrack door, shuffling
    steps, than splashing...
    For moments reverence and nausea were fighting.
    nightwalker was emptying, he returned into reality.
    room, wishing for sleep.
    This year the summer was unusually warm.
    not remember one like it.
    Natascha was wearing a summerdress. The uniform was a bother. Often she
    went for walks through the woods. She carried an expression of melancholy
    around her eyes. An expression that belonged to women that gave themselves
    to men without love.
    One day she found some flowers, which blooming time had passed, long ago.
    She saw them under some willow brush, on the steep bank of the river, there,
    where the sun only touched them in the late afternoon. She knelt down, but
    didn't pick any of the little bluebells. At home, in a bbok about internal
    medizine, she had another blossom like it. After the visit in the next
    camp, she had not thrown the flower away, after it had wielded on the lapel
    of her suitjacket. She kept it as a remembrance to the prisoner with the
    dark eyes, and the unplanned touching of their hands. Secretly she connec
    ted an unfullfilled yearning from her girlhood with the flower. Twice
    Rubanow had found her asleep over the book, the bluebell in her hand.
    
    
     What are you doing with that dried up junk?" He had asked the last time,
    reaching clumsily for the stem.
    she cried out, hiding the dried flower between the pages
    of her book, pressing it against her breasts. Her eyes pierced him like
    daggers. That night nothing could keep him away. He took her by force.
    
    
    She bore it without a sound.
    
    
    'After completion of duty, everybody has the right to choose their new
    place of employment, in any state of the CCCP. If the person is married,
    
    
    he contract of the spouse becomes nullified.l
    
    
    This paragraph from her contract, Natascha knew by heart. She repeated it
    to herSElf, whenever the price to get away from the camp and prisoners, s
    seemed too high.
    
    
    One more year!"her hands closing themselves around the leaves of the flow
    ers and ripping them out of the ground by their roots. She smelled the
    fresh earth and came back to reality. "This I didn't want to do to you,"
    She said to the bluebells. She gathered the flowers and made a bouquette,
    But didn't take it to hefcabin. She sat down on the sandy bank, and threw
    flowe I fter flower into the river. The current went slowly. When a flower
    hit t . Water, it disturbedi ti it, and seemed to make the pebbles on the
    bottonfcome alive. It seemed as if they were jumping. The blossom floated
    and Natascha looked after every one of them. Doing this, she remembered
    something....
    
    
    The night before the revolution celebration,
    
    
    Tamara and Alexander!
    
    
    The evening in the club.
    
    
    And than....
    
    
    Natascha threw a bluebell into the water.
    The next day she didn't leave her room.
    her door.
    
    
    Why aren't you at the club?"
    
    
    "I don't feel like it."
    
    
    Let's go and dance!"
    "No I ll
    
    
    titans? gtdhlk "
    
    
    While the bladder of the
    He went back to his
    
    
    Old people from around there could
    
    
    In the evening Rubanow knouked at

  5. #5
    Senior Member
    Join Date
    Aug 2001
    Location
    Melbourne, FL USA
    Posts
    1,498
    Thank you for your reply. Found someone to type the last version into the computer for $100. That was the easiest way to get into a Word document without confronting the OCR software, physical scanning, error correction, etc.

    Think it is a sweet deal, unfortunately the last page may be unreadable. So we may not know the ending.

  6. #6

    The Digital Dark Age has been upon us since the 80s

    Microsoft Word is a volatile and self-inconsistent format. The reason for this is to get ahead at Microsoft the Office team has to perform better than other teams including the Mac Business Unit. They booby trap their code and documentation so no other teams can benefit from it. Then there are frequent staff changes between versions so old files are not readable by later versions of software.

    I found out when I got old files off tape. Then then night before the court case I was told the data was unreadable. The user no longer had a 486 with Windows for Workgroups 3.11 and Microsoft Office 4.2. We had not a single 486 machine so I had to use Bochs to faithfully emulate a 486 in pure software and run our installer ISO image on it. The Apple Laserwriter (the only printer driver we ever used) produces PostScript to a virtual port which was captured as a file. I then sent that file to printer, made PDF from it. Burnt the original, PostScript and PDF to a CD and hand delivered the printout and the CD to the credit manager on morning of the trial with a note of the environment necessary to open the Microsoft Office file.

    Did not appreciate an all nighter doing digital archaeology due to obfuscated data.

    I also found out at Standards Australia that Microsoft Office for Mac in completely incompatible with Microsoft Office for Windows. I now use LibreOffice and iWork both of which are the product of reverse engineering of actual Microsoft Office for Windows files.

    Be advised to export the the obfuscated MS Word data in to plaintext and PostScript if you wish to be able to read it in the future. Note that Microsoft do not use standard US-ASCII or UTF-8. They use their own byte swapped UTF-16 with code substitutions for a number of common punctuation marks. They also do not include paragraph breaks in text output. Paragraphs appear on a single very long line with no blank lines between. I have a C program to remap punctuation to US-ASCII and an AWK script to convert Microsoft Text into plaintext. I.e. Microsoft Text is also obfuscated.

    Computers used to be used by people who knew what they were doing and who wanted to own their own data. That meant that data was always plaintext and if you could read the tape you could always read the data. That all changed since the 80s where software vendors wanted to make users dependent on their software. That meant that you data would now only last as long as that volatile software.

    Users are users in the same sense as junkies.

    The advantage of the OCR software is that it can produce plaintext and if you edit the plaintext in a plaintext editor you will still have plaintext which can be read in the future.

    http://www.planix.com/~woods/ms-word.sucks.html
    http://www.cv.nrao.edu/~pmurphy/doc-...e.shtml#docrtf
    Last edited by zagam; 09-12-2016 at 02:52 AM. Reason: A simple truth

  7. #7

Similar Threads

  1. FTP Software
    By mike in forum Computers
    Replies: 8
    Last Post: 07-21-2008, 02:41 PM
  2. WEB CEO software
    By Jimi5 in forum Computers
    Replies: 0
    Last Post: 12-29-2006, 06:20 PM
  3. New FES software...
    By Chris Chappell in forum Exercise & Recovery
    Replies: 7
    Last Post: 02-08-2004, 09:23 PM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •