I have a printed Japanese language intonation dictionary that I would like to convert into a machine readable file format.

One difficulty that prevent simple OCR is the special intonation 'guides' printed above the readings that indicate intonation. I've attached a scan of some pages from dictionary as examples.

Here are the first four entries from the dictionary, including the above text marks that indicate intonation.

アー (~言う, ~した) →61

アー (~は行かない, ~だこうだ) →76, 86a 【感】(~驚いた) →66

アークトー arc 灯 →15

 ───┐ ─┐
アーケード,アーケード arcade →9

And one more (from page 1) to highlight the use of handakuten (カ゚キ゚ク゚ケ゚コ゚):

アイキョーケ゚ン (間狂言) →15

Each entry in the dictionary has the following pattern:

1- one or more readings (with intonation above) separated by commas
2- a space
3- one or more word data sectionss

Word data is:

a- the word
b- descriptive data about the word

I would need the above turned into a file with three columns,:

1- the reading with intonation
2- the intonation patter (H: High, L: Low)
3- the word(s) only (no descriptive data)

For the five entries above, the result would be:

アー {LH} (~言う, ~した)
アー {HL} (~は行かない, ~だこうだ)
アークトー {LHHHH} arc 灯
アーケード,アーケード {HHHLL,HLLLL} arcade
アイキョーケ゚ン {HHHHLLL} (間狂言)

I've tried to keep the above simple, but I'm sure I am missing some edge cases.

Please don't hesitate to ask if anything is not clear or you have questions.

I don't know much about OCR, so I am hoping you will be able to help me figure out the following:

1- Is what I am asking for possible?
2- How long will it take (or how much will it cost)?

I have two mains questions for this project:

Firstly, I understand the basics of OCR, but I don't see how OCR can get the intonation information from the scans. Can you describe how you can achieve this a bit more?

Secondly, I'm worried about extracting data for the third column.

From the samples you can see that there is a lot of extra information in the dictionary's third 'column'. For me it's garbage since I just want the word(s).

Will it be possible to extract the words from all that 'garbage', and if so how do you propose to do it?

