I am going to write it so that anyone new to project can understand everything.
What is my Project?
The title is Improving information retrieval methods for OCR data sets consisting of Indic scripts. it means I have to build a search engine which is capable of searching on erroneous data. Now what does that mean? Before explaining this I first want to explain what a search engine means today.
Search Engine: A system which is capable finding documents which are related to the information need of a user which he expresses in terms of search query. But it may be possible that what users needs is not directly specified in any of the document which are available to the search engine, it may be hidden in the documents. So the search system should be intelligent also so that it can suffice to user needs.
Now what is different in the search system that I am going to build is the data available to my search system on which I am going to search for the user information need. Now the data is erroneous means it may be possible that the term user is searching for doesn’t directly exists in the data. “Directly” in the sense that there is no term in the data which is completely matched to the user’s term. For eg. suppose user is searching for “surreal places Alaska” and in the corpus there is no word which is completely matched with surreal or places or Alaska. But there are words like sral, sueal, sural, plces, paces, alask, alas, ask. So what I have to do is I have to find do match of each of query terms with the closely related words in the corpus. This adds another level of intelligence over the search system and this intelligence is also a bit complex as in our normal language we use words like intuition, feeling to show this type of intelligence. For eg. according to my intuition alask is erroneous form of alaska. So I have to give intuition or feeling capability to my search system. Then I have to order the result documents in the order of higher intuition.
This project is somewhat different from other GSoC projects that this project is starting from scratch. There is no prior work done on it. So the most tough things I have to do in this project is I have to take so many decisions about the design of the system, what libraries I am going to use, how I am going to add the intuition capability to the search system. And it is pretty much clear from the first talk I had with my mentor that he want me to take all these decisions so that I can learn more, research more before taking any decision and know the importance of a right decision by experiencing the problems faced because of a wrong one. Taking decision is tough for me because following any decision which is standard of the community doesn’t satisfies me. I am satisfied that a decision is good only after experiencing why others are bad. Taking decisions bring lot of responsibilities with it. In the end I am sure that this will be the lifetime learning experience for me.
What I have planned:
I am not going to explain my plan in detail here else I will be re-writing my proposal again. for the complete proposal you can go here (for a more readable version go here, please. ).
In brief my plan was:
First I have to create some schemes which will help me matching the query terms with the erroneous data. Following are the schemes I proposed:
- PLCS: A probabilistic version of Longest Common Sub-sequence which will return the matching probability for the possible LCSs.
- PLD: A probabilistic version of Levenshtein Distance.
- Suffix Tree: A probabilistic version of suffix tree to do the matching very efficiently.
- Stemming: Simple stemmer.
Then I am going to merge the results of all these schemes in a weighted manner to get the final results. I will arrange the final result in the weighted order of their matching score.
One of the most important task of this project is to create the Content-Based Probabilistic Character Error Model (CBPCEM). This is model which represent the probability of every character can be misinterpreted as every other character or to nothing. This misinterpretation will happen at OCR time. I will be constructing this model from the corpus itself.
The other thing what I have to consider is to do everything in a language independent way because my system should support all Indic scripts. This adds another layer of abstraction to the system.
Work Timeline:
Data:
Initially I started with the task to get the data-set. There is a data set present on the we which is made for this project by the FIRE community for their RISOT-2011 task, but it data was not publicly available. The procedure to get the data is too long. There the recommendation of Prasenjit sir helped me. Utpal Garain si one of the core member of RISOT and Prasenjit sir told me to contact to him directly on his behalf to get the access to data. Utpal responded fast and helped me getting the access to RISOT data.
Data Processing:
Now I got the data. The first thing I did was to collect the simple stats about the data. The data-set consists of two version of the same data: one is the original data without any error (the base data) and the ocred data with OCR errors (erroneous version of base data). While collecting the stats I found that there is some inconsistency in the files present in the two version of the data. So my first task is to find such files and solve the problem accordingly with logging all the information. You can get some of the corpus stats here and stats about inconsistency between two version of data here. The script which solves this inconsistency by deleting files is here.
Content Based Probabilistic Character Error Model:
Now my task is to make the CBPCEM but the data-set with the actual data contain too many meta information which will be used while indexing the data but of no use right now. So I wrote the scripts to clean the data, which you can find the script here.
I have to write the code to make the CBPCEM and the most simple way to do so is to first map the two versions of data and then count what is mapping to what. Right now this sounds so trivial but its one the most challenging part of this project because the two version of data are different. It is very difficult to map them because it may be possible that there would be no common map point between the two versions of data. To get the intuition of difference in two versions of data I wrote a script which compare the two versions for each file. You can find the script here. The stats for the difference between the two versions of data could be find here. It is pretty much clear from it the difference between two versions of data.
So I wrote a code based on Levenshtein Distance, a word window size and some other parameters which will map the words between the two version of data. It may be possible that this code don’t find all the mappings but this will find enough amount of mapping to construct CBPCEM. Also the number of mapping returned can be maximized empirically by tweaking the parameters.
Now suppose I got the words matched between the two version of the data. The main problem is now. It may be possible that the two words are of different length. How I am going to find the mapping now about which letter is mapped to which. For this I wrote a code which consists of same manual case rejections and recursion to find every possible mapping between the two words and return the best among them. The ideal way to do so is to use Dynamic Programming (DP) to check all the possible case with memoization to make it faster. But writing the code to check all possible cases (in my mind) led me to reject many cases without even checking them because they can’t occur. You can find the complete code here. This code is very precious to me (just like in LOTR).
This code will generate a Content Based Statistical Character Error Model i.e. a model in which represent the number of times every character can be misinterpreted as every other character or to nothing. Now I have to write the code to change this model to probabilistic one. To do so I wrote a script which can be find here. The Content Based Probabilistic Character Error Model for Bengali can be found here.
Here is the example of CBPCEM looks like (for two letters, n is for nothing):
"ম": {
"n": 0.022254117995318203,
"ং": 0.0009977707581110258,
"অ": 0.01333585210401744,
"ই": 0.0023389512306272128,
"উ": 0.0032614185352958973,
"এ": 0.0023746554542421164,
"ঐ": 6.426760250682599e-05,
"ও": 0.002362321267902422,
"ক": 0.01837599014303761,
"খ": 0.0034892763987291893,
"গ": 0.007441409369047944,
"ঘ": 0.0014502406464671645,
"ঙ": 0.0014119397520439044,
"চ": 0.0045597539394742,
"ছ": 0.003800227728029893,
"জ": 0.009255833096387122,
"ঝ": 0.00042585401257048337,
"ঞ": 0.00042260817406003764,
"ট": 0.004620126535768492,
"ঠ": 0.000912729789137347,
"ড": 0.002061107454133056,
"ঢ": 0.00026031624853774975,
"ণ": 0.0010529500127886037,
"ত": 0.010406158264489098,
"থ": 0.005606212275241913,
"দ": 0.009447337568503422,
"ধ": 0.00234868874615855,
"ন": 0.016321374365925446,
"প": 0.010843697295697187,
"ফ": 0.0019117988826525512,
"ব": 0.017607375583764056,
"ভ": 0.004161164970391461,
"ম": 0.6847797438903581,
"য": 0.003724924274587551,
"র": 0.02017678134863292,
"ল": 0.00929543232621456,
"শ": 0.005808103430591638,
"ষ": 0.019191344776861585,
"স": 0.013529304079240006,
"হ": 0.00619695488414304,
"া": 0.013499442364943905,
"ি": 0.006618913890500989,
"ী": 0.0007342086710628304,
"ু": 0.0006796785840873416,
"ৃ": 0.0001350268820345435,
"ে": 0.010532745966396483,
"ৈ": 0.00024084121747507521,
"্": 0.011382506488431183,
"ৗ": 0.0003278296895550215,
"ড়": 0.0018007912055953063,
"য়": 0.006161899828230226
},
"n": {
"ং": 0.0072325395044072075,
"অ": 0.014854986795622836,
"ই": 0.0053889755072925255,
"উ": 0.005008565474265337,
"এ": 0.0059438442808275845,
"ঐ": 0.00011972168582130326,
"ও": 0.005878609708803868,
"ক": 0.025695173091563845,
"খ": 0.0059325969408234955,
"গ": 0.011551018184199437,
"ঘ": 0.0017663322624199378,
"ঙ": 0.002333198198626025,
"চ": 0.008203809798982541,
"ছ": 0.007292275376873369,
"জ": 0.014174397754930958,
"ঝ": 0.0006296010993400061,
"ঞ": 0.0017415881144109417,
"ট": 0.009287303552709784,
"ঠ": 0.0024619177564505993,
"ড": 0.003955064627215663,
"ঢ": 0.00038790825969658174,
"ণ": 0.0030660248851146705,
"ত": 0.021603640738965233,
"থ": 0.008017103954914662,
"দ": 0.017079710648431647,
"ধ": 0.005544188799348954,
"ন": 0.02907412396968117,
"প": 0.015717782744380952,
"ফ": 0.0030697739984493666,
"ব": 0.028846177878931633,
"ভ": 0.008757178927183721,
"ম": 0.02554645826262089,
"য": 0.014718019188461929,
"র": 0.0474590259403651,
"ল": 0.021910318209743394,
"শ": 0.01155926623353577,
"ষ": 0.009292802252267338,
"স": 0.02419477793501837,
"হ": 0.00824430022299726,
"া": 0.2104794716049654,
"ি": 0.09217819985573412,
"ী": 0.00904561071306636,
"ু": 0.008577971309785236,
"ূ": 1.749586222858294e-06,
"ৃ": 0.002059512925193192,
"ে": 0.13432273267372266,
"ৈ": 0.00205126487585686,
"্": 0.07317044518971264,
"ৗ": 0.006015077434186815,
"ড়": 0.0033397101585475033,
"য়": 0.015218150907310421
},
Probabilistic Longest Common Sub-sequence (PLCS):
This task involves writing a probabilistic version of LCS code. What do I mean by probabilistic? A simple LCS trying to match the longest common sub-sequence but in my case it may be possible that one of the words out of the two we are comparing can have say one letter erroneous, so normal LCS would return a length one less string which should be returned if there was no error. PLCS will work in the sense suppose if two letters are not matching then PLCS will use CBPCEM with a threshold pre-specified. For the letter in the data (not in the query term, which I am hoping to be correct) PLCS will try all the misinterpretations which have probability greater than the threshold. If a letter matches then it will take it as a match but with the probability of the match as the probability of the misinterpretation. The complete probability of the match of the complete word is the multiplication of of the probability of the match of all the letters in the return common sequence.
Along with this, there can be one more issue. It may be possible for two words there are multiple same length common sub-sequences exist. So I have to consider them all and return them all with their own matching probability. How I am doing this is explained here. Along with this it may be possible that probability of these same length common sub-sequences is different.
Also right now I am simply multiplying the probability of all the matched letters. But for the case of multiple same length common sub-sequences it may be possible that in one case starting letters of one word are mapping to ending letters of other word. Though the number of match is same but it is wrong. So I should also bring the positional dependency in assigning the probability to a matching. I haven’t done it yet but will do it after making a simple basic complete search system before.
You can find the code for PLCS here. This code return all the possible PLCS for two words with the LCS size, mapping details and matching probability.
Probabilistic Levenshtein Distance (PLD):
As I explained use of CBPCEM in making PLCS similarly I am using it here to make LD probabilistic. In the same way PLD can also return the matching probability along but I right now I am not returning this because PLD probabilistic function is going to be very complex. This is because the distance we return is the number of mismatches now I have probability with matched part & with not matched part. The distance is due to unmatched part. So the final probability would be the combination of these three things which I have to deduce. I will be doing it after making the search system first.
In case of PLD there are no multiple case as of PLCS so no worry for that and also there is no problem of position dependent matching as LD consider the whole string not its sub-sequence.
You can find the code for PLD here. It takes two string one is original and other is erroneous and use CBPCEM for the erroneous one to return PLD for right now.
Suffix Trees:
I have just studied suffix trees and already impressed by them and planning to use them but after making the search system. More about suffix trees is here.
Deciding IR Library:
My next task is to decide which search library I am going to use. The best part about this task is I initially assumed it to be simple decision which I considered trivial. But I think it the most tough decision I have to take and it took me about 10-12 days just to check some of the search libraries. This decision really showed me sometimes how tough is to take a trivial decision not because its tough but because there are many options and complete future project is dependent upon it. So you want to select the best and most suitable solution. In more than 10 days I am just able to compare some of these libraries and manually test only two of them: Lucene and Terrier. My complete research about this can be found here.
Finally I reduced my options to two: Terrier and Lucene. But selecting among two is the toughest part. There I selected Lucene because I found its code more organised and Lucene is defacto option for the IR library in the industry. But Lucene code is more complex as compared to Terrier. Also Terrier supports TREC format and evaluation methods by default but for Lucene I have to write them manually. Still I selected Lucene.
Modifying Lucene:
Now I have to modify the Lucene code to include my schemes in it. I started understanding the code but I got stuck at someplace. I tried again & again but got no success. So finally I asked for help on Stack Overflow, Lucene-dev IRC & mailing-list. Initially I got no response but after that I got response from Mike McCandless. Our discussion till now is posted here. Right now I am not responding to him I am planning to read the book Lucene in Action in this weekend (which kind of huge work to do) before contacting him. But right now I am stuck here.
Collecting Search Evaluations:
This involves getting the evaluation of the Lucene search system on the original data and then the ocred data. After that I will get the evaluation of the modified Lucene on the ocred data. In this way i will have three point comparison of my search system. Right now I have indexed the original data and ocred data with the normal Lucene system. Doing this is kind of confusing because I am using PyLucene and it is the first time I am using a language port. I don’t know how it works. So I am bit confused in understanding where to use python syntax & classes and where to use Java syntax & classes. The code for the indexing can be found here (and here helper code for indexer code).
I also wrote the code for searching but i still have to write the code to understand the TREC topics format. The code for the search files can be found here.
Right now I stopped doing it because I want to read the Lucene in Action book first so that I can know everything about the Lucene. This is my weakness because if I am doing something and I don’t fully understands it then I feel unsatisfied. It may be possible that there is something already present in Lucene to handle TREC format.
Expected Schedule vs Reality:
According to the schedule proposed in the proposal I should have completed till making a search system. But right now I am stuck there as explained above. It may take more than one week in setting up the complete system. This is going to happen as there are many things mentioned above which I haven’t thought about while writing the schedule in the proposal. I think I can overcome this as in proposal I have mentioned nothing after this just tweaking the system. Everything else is up to my expectation if we only see from the end points not the complete path. Because the path is changed completely compared to what I have mentioned in the proposal. For eg. I have mentioned CBPCEM after PLCS and PLD but I have done the opposite. In one line while writing the schedule I missed so many details but I am able to overcome this till now expect for the search system because selecting the search system took more than 10 days which I estimated to be 0.
Results Till Now:
Except the CBPCEM (which can be found here) there is no other relevant result to show. I will be producing result only after I am able to make the search system live.
Tasks Completed:
- Collecting data.
- Processing data.
- CBPCEM
- PLCS
- PLD
- Selecting IR library
Tasks Currently working on:
- Modifying Lucene
- Collecting search system evaluation results
Remaining Tasks:
- Modifying the score scheme of Lucene
- Using Suffix Trees
- Make CBPCEM using DP
- Add positional dependency in calculating matching probability in PLCS
- Add code to reject matching if the some part of the string is matched and this part is actual a different word in dictionary.
- Make scheme to calculate the probability of matching for PLD.
- Add model to do spell checking for query terms.
- Tweaking the system to give better results.