Final Report

Objective v/s Accomplishment (in brief):

  • Getting and preparing data (OCRed and corresponding original version) with topics(testing search data) and qrels(actual result of the topic file ) file – Done
  • Making the character error model from the data and also a code to make character error model from any corpus – Done
  • Implementing LCS which result all possible LCS – Done
  • Make it Probabilistic LCS by using character error model – Done
  • Make Probabilistic Levenshtein Distance (PLD) – Done
  • Study about suffix trees to know whether they can be useful or not – Done
  • Make PLCS to return the probability of correctness with each possible LCS – Done
  • Make the complete IR system – Currently Doing
  • —————————– Mid Term —————————–
  • Make the complete IR system – Currently Doing
  • Stemming approach for searching  – Done
  • Tweaking and Testing the system (Experimenting with the system) – Not Done Yet
  • Documentation  – Done for the work done till now

Objective v/s Accomplishment (in detail):

In this I am going to explain where was I before the mid-term and everything after that. To know more about the previous work see this.

Before Mid-Term:

I was doing two things in Lucene:

  • Writing code for creating Benchmark :
    I was unable to do this using the Lucene Benchmark library. Actually I wasn’t even able to compile it properly.
  • Trying to modify Lucene :
    I was trying to find my way in Lucene code and I got stuck at many places and didn’t know which class is being called.

Now:

In short I am at the same point as before Mid-Term but with many more new insights about what I have done wrong and what is right.

I initially started to compile the benchmark API with Lucene. The worst part is Lucene took about 15-20 mins to compile and install. Also to make benchmark API work I have to resolve its dependencies in both base Lucene’s source code in Java and python interface. But after all the time I have spent it feels nice that finally I got success in this.

Then as I was unable to understand the Lucene’s code structure. I starting diving in the source code with more breath. First I tried to make a debugger work with it. I tried pdb but I already know that it wouldn’t work because base is in java. Then I tried to set up simple Lucene (not pyLucne) in eclipse and use its debugger. Setting up it in eclipse some time but when I tried to use debugger I got another problem: my code was not stopping at any break-point. I tried to search in on net: some says it a problem with eclipse and some with only a particular JDK. I wasn’t able to figure out the problem so I dropped it and start using the most naive approach putting print statements in the code to debug. The worse part of this decision is that I know it would take about 15-20 mins each time I will compile and install Lucene. Finally I figured out the what is happening and what is code flow where I was previously stuck. Then I spent about 10 more days following the same flow. I understood the basic structure of Lucene and what I have to modify to make my schemes work with Lucene. To know what I have understood see this.
After understanding the complete structure and finding what I have to modify I came to the conclusion that I won’t be able to modify Lucene, I must shift to something else (knowing the fact that I won’t be able to complete my project on time but I also know that if I don’t shift then for sure I won’t be able to complete ( on time). ). The I choose Whoosh to use because its complete python based and follows the Lucene structure. This time before directly starting working on it I first contacted its developer to know whether my idea is possible to implement in whoosh (and also whether I can do that or not)?
You can find my complete chat with Matt Chaput here.

Right now following what Matt has explained I started working on whoosh and I am able to follow everything without any error till now.

What could be the reason?

When I wrote the proposal at that time I know about the IR techniques and all the things about what I have to do with the current searching schemes to make them search on erroneous data and what algorithms I am going to use. I haven’t modified any search system before though I have used some of them. And that was the problem. I think if I would have a little bit of experience with modifying a search system the project would be at a better place now. I was good at making schemes but not doing open source development albeit this is what GSoC is about, doing development and producing a working system at the end which I am unable to do. But what’s good for me is that after this 3 month journey(better I would say a venture) now I know how to make schemes and has some experience with modifying search systems which I am going to follow until I complete the project.

There is one more thing that I felt about which I am not sure myself that whether it could be a reason or just a way to find solution to your problems. This is my first project that I am doing completely alone (the codding part), prior to this I have already worked with a group. According to me if you have someone other who knows everything about the project its east to solve impediments and if you are alone sometimes you spend more time on errors because may be you are understanding something completely wrong (and no one is there to correct it) or sometimes you just get scared of the error and believe that you won’t be able to solve it (and no one there to push you).

Chat with Matt about Whoosh

Re: [Whoosh] Using whoosh to make search engine over OCRed data
6 messages

Matt <ch4ppy@gmail.com> Mon, Sep 9, 2013 at 12:15 AM
Reply-To: whoosh@googlegroups.com
To: whoosh@googlegroups.com
> Hi,

Hi Abhishek! I’d be happy to help you figure out how to integrate your code with Whoosh.

> The change I want to incorporate:
> I will talk in terms of Vector Space Model.
> I have made some schemes to the matching of a query term with the labels of the inverted index. My matching scemes use LCS(longest common sub-sequences) and LD(Levenshtein Distance) instead of simple direct matching. So I want to replace the matching scheme at the base with my own schemes.

Do you want to be able to say “find all documents containing this term” but do term matching using a custom function? If so, that’s easy to do with a custom Query object.

Whoosh already has a FuzzyTerm query that finds documents containing terms within N Damerau–Levenshtein distance of a given search term.

FuzzyTerm uses an FSA of all the terms in a field (that also supports spell checking) for speed, which could possibly also help with LCS.

If you can explain exactly what you want to do in relation to search, I can better explain how to do it with Whoosh.

> Also it would be really helpful if you help me understand the basics of code.

Whoosh’s architecture is quite similar to Lucene’s:

(I need something like this in the docs 😉

* An Index is based on a Schema, which contains a list of all the Fields. (A schema can have “dynamic fields”, so you can say “any time the user tries to index a field that matches *_int, use this Field object”.) — see src/whoosh/fields.py

* A Field defines how input in a certain document field (e.g. “title”) is indexed and looked up. The basic function of a Field object is to translate some input into sortable bytestring “terms”. For example, the NUMERIC field converts numeric input (ints, floats, Decimal) into sortable bytestrings. — see src/whoosh/fields.py

* A Format is wrapped by a Field and controls how a posting is represented as bytes on disk. This allows custom per-posting information similar to Lucene’s “payload” feature. — see src/whoosh/formats.py

* A Codec object provides a high-level interface (e.g. “Add this term to the index”) to lower-level reading and writing subsystems (e.g. an on-disk hash file). This is not documented yet. — see src/whoosh/codec/*.py

* A Writer object wraps a Codec and provides a user-level API for adding to the index (e.g. “add_document”). — see src/whoosh/writing.py

* The current writing architecture uses an append-only log structured merge (LSM) approach, where new documents are written to a new sub-index (called a segment). Over time smaller segments are merged into larger segments.

* A Reader object wraps a Codec and provides a user-level API for reading from the index. — see src/whoosh/reading.py

* A Searcher object wraps a Reader and adds extra search-related methods based on the lower-level Reader methods. — see src/whoosh/searching.py

* A Query object represents an abstract query, such as “find all documents containing the term ‘foo’ in the field ‘bar'”. It is independent of any particular Index. A complex query is represented as a “tree” of Query objects, e.g. And([Term(“x”, “y”), Term(“w”, “z”)]). Whoosh has many built-in query types, such as Phrase, Range, FuzzyTerm, Near, And, Or, Nested, etc. — see src/whoosh/query/*.py

* At search time, a call to the matcher() method on a Query object translates the Query tree into an equivalent Matcher tree specific to a given Searcher object. For example, a FuzzyTerm query object will be trasnlated into a Union matcher of term matchers for all the index terms within N edits of the search term. Matchers do the actual work of examining term information in the index to find matching documents. — see src/whoosh/matching/*.py

* A Collector object does the work of “running” the Matcher tree and collecting the results. The Collector returns a Results object, which lets you access Hit objects. Whoosh has several collector objects, such as one that only returns the top N results, and collectors that wrap other collectors, such as one that remembers which terms matched in which documents. — see src/whoosh/collectors.py

Cheers,

Matt


You received this message because you are subscribed to a topic in the Google Groups “Whoosh” group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/whoosh/nz_rJKDnM8A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to whoosh+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Matt Chaput <matt@whoosh.ca> Mon, Sep 9, 2013 at 6:53 AM
Reply-To: whoosh@googlegroups.com
To: whoosh@googlegroups.com
> * At search time, a call to the matcher() method on a Query object translates the Query tree into an equivalent Matcher tree specific to a given Searcher object. For example, a FuzzyTerm query object will be trasnlated into a Union matcher of term matchers for all the index terms within N edits of the search term. Matchers do the actual work of examining term information in the index to find matching documents. — see src/whoosh/matching/*.py

I forgot:

* A WeightingModel object represents a document scoring algorithm, e.g. BM25F. At search time, a WeightingModel is used to create a Scorer object which is passed to the leafs in the matcher tree to actually score matching documents. However, it is not REQUIRED for a Matcher’s score() function to use the Scorer to calculate the score… a matcher may simply compute its own score value where that makes sense. — see src/whoosh/scoring.py

[Quoted text hidden]

Abhishek Gupta <abhi.bansal21@gmail.com> Mon, Sep 9, 2013 at 3:24 PM
To: whoosh@googlegroups.com
Hey Matt thanks for replying (so fast and explaining all the things though I have already got a gist of whoosh architecture because I was diving in whoosh src code and documentation for last two days).

Now first I will explain in detail what I want to do:
What I have to do:
I am making a search system which is going to search on a erroneous data (because I am searching on the OCRed data). For example suppose I am searching for abhishek and there is no abhishek in my whole corpus but there arebishek, ahiek, …. So I must return them.
My Schemes:
As I already told you that my schemes are based on LCS and LD. But there is a twist I am not using them directly because using them directly would also match twitter & platter as abhishek is matched with bishek. The error in data is due to wrong predictions by OCR but for the case of twitter vs platter ( suppose twitter is my search query and platter is a word appearing in many documents) there is very less chance that the word platter is appeared in the data due to mis-reading of word twitter by OCR because there are very less chance of any letter of twitter to be replaced by any letter of platter by OCE except for i -> l. Also there are other kind of errors introduced by OCR like deletion/removal of letter, insertion of new letter.
So to get rid of these kind of errors I have developed a character error model which lists the probability of a letter to get converted into every other letter by OCR. I developed my LCS and LD schemes to make them probabilistic (PLD and PLCS). For eg. take the two words twitter and platter, the simple LCS would return tter but the PLCS may return itter if the probability of mis-reading i to l is less then the threshold pre-specified. So I am returning all the possible matches with the probability how correct this matching could be.
How it is different from already present schemes in whoosh:
As you suggested to use FuzzyTerms but this is different. In my case there is no problem in query (user need is conveyed properly in the query). But there are errors in the data i.e. I don’t have to consider different versions of a query terms instead I have to consider the different versions of every term in the data. And I can’t simply consider the N-gram approach because the  word may be totally different.
Let me reform it in very simple language:
I am going to give partial yes for the matching of all the following pairs: (i,l), (e,c), (d,l), (b, l), (v,u), (P,R) but your schemes won’t consider it.
What I have to do with whoosh:
  1. I have to modify the code of collector to use my matching schemes.
  2. My matching schemes will return a probability which I have to use then in scoring.
What I want from you:
I already trying modifying the collector of Lucene but I got no success because I was unable to understand the code of collector completely as there comes all the things related to utf and how index is stored.
Also I was unable to find all the points in Lucene where I have to do this change because there are multiple collectors and I have to modify them all. Here it would be easy for me as I am able to use a debugger to find the flow of code. I was unable to use any debugger with pyLucene.
Just help me out in understanding the collector and all the points which I have to modify.
Thanking you,
Abhishek Gupta
[Quoted text hidden]


Abhishek Gupta,
897876422, 9416106204, 9624799165

Matt Chaput <matt@whoosh.ca> Mon, Sep 9, 2013 at 9:36 PM
Reply-To: whoosh@googlegroups.com
To: whoosh@googlegroups.com
Sorry, I wasn’t clear, I’d like to know how you want it to work with the search engine at a high level. For example, here’s a guess:

“I want to be able to search for ‘twitter’ and have it also find,
for example, documents containing ‘twltter’ or ‘twlHer’, based on
my algorithm. I’d like to score each document based on which
generated terms matched.”

If that is correct, here’s an outline of what you’ll need:

* A custom Query type representing “search for this term using Abhishek Gupta’s algorithm”. The query would be very similar to the “ExpandingTerm” subclasses in Whoosh (FuzzyTerm and Variations). You would need to override the query’s expanded_terms() method to “expand” the user’s search term into the “error” words to look for in the inverted index.

* A custom Matcher type generated by the Query type. The matcher would wrap a normal term matcher, but return a custom score based on your algorithm (using, I guess, some combination of the sub-matcher’s score and the amount of “error” in the term).

* If your scoring algorithm needs to know which terms matched in a document, then you need a custom Collector that will look into the matcher tree at each matched document, find the matching terms, and use them as inputs to the algorithm. (The custom Matcher may need to have properties on it indicating which Query it was generated from, if you need to know which matched terms came from each original search term.)

Unfortunately, these three areas (subclassing ExpandingTerm, custom Matchers, and custom Collectors) are under-documented because they’re a bit low level.

If you don’t mind, I’ll try implementing a quick skeleton of how you would do this in Whoosh (with a fake function plugged in where your algorithm would go). I’m interested to see for myself how easy or hard it will be.

[Quoted text hidden]

Abhishek Gupta <abhi.bansal21@gmail.com> Tue, Sep 10, 2013 at 12:31 AM
To: whoosh@googlegroups.com
Yes you are right and great(for all the explanation you have provided)!! I have to do the same.

But there is one error. According to me I don’t have to use ExpandingTerm because this would make my search system slow as according to my error model a single word could be expanded in many words. So what I have planned is instead of using ‘normal term matcher I have to write my own matcher which will do matching in term of my scheme. Why I am following this way because suppose I can expand the word twitter to 10 different words, based on my scheme (error model) but in the actual corpus only two out of those ten words are present. So in this way I don’t have to consider all the cases. Simply there are many possibilities for a word to expand but out of those many only some can appear in the corpus.
So please explain anything else you want to tell for this scheme (not using normal term matcher intead of making my own).
Really appreciate your help but for the sake of learning I first want to do it on my own else I will not be satisfied with my work (or at least my effort). I will continuously disturb you if I have any doubt 😛
 
Abhishek
[Quoted text hidden]


Abhishek Gupta,
897876422, 9416106204, 9624799165

Matt Chaput <matt@whoosh.ca> Tue, Sep 10, 2013 at 1:03 AM
Reply-To: whoosh@googlegroups.com
To: whoosh@googlegroups.com
> But there is one error. According to me I don’t have to use ExpandingTerm because this would make my search system slow as according to my error model a single word could be expanded in many words. So what I have planned is instead of using ‘normal term matcher I have to write my own matcher which will do matching in term of my scheme. Why I am following this way because suppose I can expand the word twitter to 10 different words, based on my scheme (error model) but in the actual corpus only two out of those ten words are present. So in this way I don’t have to consider all the cases. Simply there are many possibilities for a word to expand but out of those many only some can appear in the corpus.

You can quickly tell which terms are in the index when you construct the Matcher object in Query.matcher(), e.g.:

Class MyQuery(Query):

def matcher(searcher, context=None):
fieldname = self.field()

# Usually a query is created with unicode text
text = self.text

# Ask the field to convert the term text to bytes
fieldobj = searcher.schema[fieldname]
bytestring = fieldobj.to_bytes(text)

if (fieldname, bytestring) in ixreader:
# Searcher.postings() should maybe be called
# Searcher.term_matcher() or something…

return searcher.postings(fieldname, bytestring)

else:
return matching.NullMatcher()

The collector also prunes empty/finished matchers as it goes (in the Matcher.replace() method).

Since the index maps terms to list of documents containing the terms, I’m not sure how your matcher will work if not by generating all the error variations and doing a big Union of them (like the Variations query). You could set the field to have a forward index (vector=True) and then examine the words in every document, but I think that will be slower. But maybe I just don’t understand so I’ll let you work it out 🙂

Let me know whenever you need more info 🙂

[Quoted text hidden]

Lucene is so big!!

Previously:
Working on solving the problems in evaluating Lucene and modifying Lucene.

Today/ In 10 Days / Tomorrow/ Impediments:
First of all out of these 100 days 4 days are wasted in my exams. This new timeline for GSoC is totally clashing with my college studies.

I tried to give two more days to evaluating lucene. If I get success then I will continue else I will leave it and try to make my own evaluator. According to me all the problems I was facing before are soled, fortunately, in these two days but ,unfortunately, Benchmark is still not working and I have no clue about the cause. So I decided to leave it for future (which is getting smaller and smaller).

Now about modifying Lucene. Everything was started with this and I have tried many things with it. Problems were stacking up exponentially in this. After the last time I have faced two more ridiculous problems. One in ivy error. I tried to solve this error for about 1 day but no success. So I deleted the downloaded ivy library (for the context: the problem I was facing that when I compile lucene it tries to download ivy but ivy was already downloaded there. I don’t know why it is not reading it.), then everything worked fine. i don’t even have a clue why this is happening.

Another error is when I am compiling Lucene with some new modifications, it compiles completely but those modifications don’t reflect. I tried many things but nothing happened. So I deleted my lucene code and replaced it with fresh code from lucene. When I compiled this the changes reflected. But when I made some more changes, for the second compile same problem appeared again. I was thinking about all the possible problems, then following various hunches to solve this. Luckily one hunch worked. I have to first clean the already build jars and then have to compile. I am from C/C++ background and there the build tools work completely on the basis of time-stamp of files and  programs built from them. You can clean there, but you don’t have to. I don’t know why this is happening here but I am happy that I found the problem because it saved me about 20 mins in each compilation of lucene and also now I have to maintained only a single version of lucene.

What about the problems I was facing before in modifying Lucene. Luckily I am able to get rid of each one  up to some extent. Right now I am going deep and deep in Lucene to find the code block which is doing the actual matching. Lucene is too big. There are so many things going on there.  As I am unable to set any debugger I have to do things manually by the use of printing statements.
Right now I am only trying to follow a example of single word query to modify lucene. But there are about 13 types of queries:

  • TermQuery
  • BooleanQuery
  • WildcardQuery
  • PhraseQuery
  • PrefixQuery
  • MultiPhraseQuery
  • FuzzyQuery
  • RegexpQuery
  • TermRangeQuery
  • NumericRangeQuery
  • ConstantScoreQuery
  • DisjunctionMaxQuery
  • MatchAllDocsQuery

as seen from outside. Inside I don’t know how subtle it is. I am right now using TermQuery to follow everything in Lucene. There are 12 more. Also right now simply doing simple search but there are options for multi-threaded, distributive, real time and search with multiple indexes. Then there are further classifications foe different type of data structures. Then there is index so there is everything related to storage and memory optimization. This is what I am able to find till now but I don’t know completely what this beast contains.

To give a glimpse of how many things are going under the hood. Here is the function calls I am tracing for a single TermQuery with default settings i.e. without bothering about multi-threaded / distributive search.

lucene subtleness

As you can see these are only the point of change I am able to find yet (the number is already crossing 8000, 13*2*3*3*6*6), I don’t know what is more there. But I am pretty much sure there would be much more things related to choice of data structure, index type, search schemes (right now I am simply using Vector Space Model).

Problems are stacking up!!

First of all yippee!! I have passed the mid-term!!

Previously:
Wrote a mid-term summary. And was working to modify Lucene & gathering evaluation result. I was also reading the Lucene in Action.

Today(Actually covering about 20 days in this, :-O):
First of all you would be wondering that why I haven’t posted anything since mid-term. At the starting I was posting about daily and now the gap is gradually increased to 20. The reasons are many but I think the most significant reason would be that I haven’t able to successfully achieve  anything in this period. Why so, its not that I haven’t worked but primary because I am facing more and more problems and getting confused & frustrated while solving them because while solving the one I am continuously finding two more. And these never ending error are killing me, not literally.

So bit about the problems I am facing right now:

In gathering Evaluation Result:
Initially the problem I was facing that I need some java libraries to use the lucene’s benchmark API and these java libraries  are not included in the jcc. So i tried to get some help from the lucene developer community and they helped me including those java libraries by modifying the main Makefile. But as I mentioned above, I got more errors. The error is then benchmark API is not included in default pylucene build and it also not included in the default process of building the pylucene from the source. So I first have to figure out how to include it in the build. This is not easy for anti-java programmer because java build files seems so complex at first (lucene use ant to build itself). Also the confusion of what to modify the base lucene itself or python port, always stays with me. As Andi(who responded me on the developer list) give me some hint to modify the Makefile and as I don’t want to modify the java build xml files. I started studying Makefile and I somehow included the benchmark-jar to be part of the main package target. But I told I will surely get some more error and indeed I got one more. Now while compiling the pylucene again it tries to compile the benchmark-jar and it has some dependencies on an other API (org.apache.commons.compress.compressors). Now I have wasted about 5-6 days on solving this error but I got no success till today. I have tried compiling this library manually and then providing a path to its compiled version. Then I got another dependency error. I have tried this over and over till 4 level down. But then I got dependency on a lib which source I am unable to find.

The main problem was I was trying to modify the main Makefile but I have to modify the build.xml file for lucene package. I got this solution becuase I tried to manually install Lucene(not Pylucene) and then tried to compile benchmark for it. To make it work In have to modify the ant xml file, which I was afraid of initially.

So finally today I applied the same modifications in the base lucene of pylucene but the problem is not yet solved. Now there is benchmark.jar present in the right place still my code giving the same error i.e. there is no benchmark. I will try to solve it tomorrow.

The complete solution (seems a solution for now) is mentioned here.

Modifying Lucene:
As I was previously stuck in finding the base source which was doing actual mapping between the query terms and the inverted index. I started reading the book Lucene In Action to make myself more familiar with the lucene structure, its design and its control flow. I haven’t completed the complete book but I have read about 5-6 relevant chapters. But the book is more for users perspective then developer.

I also tried to study the source code in more depth and start spending frustrating hours seeing the same code again & again so I don’t miss any subtle detail. I have understood so many more things and also understood the code at more deeper level. Sharing the exact information regarding what I have understood would be redundant.

I am still haven’t deduced the complete mechanism but I am trying. I have also tried to debug it using pdb(python debugger) to deduce the control of flow but pdb don’t have any control over JVM. So I set the complete lucene(not pylucene ) source code in eclipse to debug it. But there I am getting the problem that eclipse debugger is not stopping at the breakpoint, I don’t know why. I tried to search for it on the net but I got no success. As some mentioned on the SO that its the problem with newer eclipse and using older version(Indigo) of eclipse solves it. So  I also tried with that eclipse still no success. With the frustration of this I also tried Terrrier in between but I realized that I have to again spend some more days in understanding its basic workflow, usage, structure and all that first time stuff that I have already understood for the Lucene and every error could be solved provided the endless effort. Actually I think the thought that there is something which I haven’t able to solve don’t let me move over it. Whatever right now I am trying too much and understood many things more. So after a little bit more attempts, if I get no success, I will contact Micahel.

Tomorrow:
I will work on the same thing.

Impediments:
Ya, right now I am stuck at both modifying Lucene & gathering evaluation result.

Mid-Term Summary!!

I am going to write it so that anyone new to project can understand everything.

What is my Project?

The title is Improving information retrieval methods for OCR data sets consisting of Indic scripts. it means I have to build a search engine which is capable of searching on erroneous data. Now what does that mean? Before explaining this I first want to explain what a search engine means today.

Search Engine: A system which is capable finding documents which are related to the information need of a user which he expresses in terms of search query.  But it may be possible that what users needs is not directly specified in any of the document which are available to the search engine, it may be hidden in the documents. So the search system should be intelligent also so that it can suffice to user needs.

Now what is different in the search system that I am going to build is the data available to my search system on which I am going to search for the user information need.  Now the data is erroneous means it may be possible that the term user is searching for doesn’t directly exists in the data. “Directly” in the sense that there is no term in the data which is completely matched to the user’s term. For eg. suppose user is searching for “surreal places Alaska” and in the corpus there is no word which is completely matched with surreal or places or Alaska. But there are words like sral, sueal, sural, plces, paces, alask, alas, ask. So what I have to do is I have to find do match of each of query terms with the closely related words in the corpus. This adds another level of intelligence over the search system and this intelligence is also a bit complex as in our normal language we use words like intuition, feeling to show this type of intelligence. For eg. according to my intuition alask is erroneous form of alaska. So I have to give intuition or feeling capability to my search system. Then I have to order the result documents in the order of higher intuition.

This project is somewhat different from other GSoC projects that this project is starting from scratch. There is no prior work done on it. So the most tough things I have to do in this project is I have to take so many decisions about the design of the system, what libraries I am going to use, how I am going to add the intuition capability to the search system. And it is pretty much clear from the first talk I had with my mentor that he want me to take all these decisions so that I can learn more, research more before taking any decision and know the importance of a right decision by experiencing the problems faced because of a wrong one. Taking decision is tough for me because following any decision which is standard of the community doesn’t satisfies me. I am satisfied that a decision is good only after experiencing why others are bad.  Taking decisions bring lot of responsibilities with it. In the end I am sure that this will be the lifetime learning experience for me.

What I have planned:

I am not going to explain my plan in detail here else I will be re-writing my proposal again. for the complete proposal you can go here (for a more readable version go here, please. ).

In brief my plan was:

First I have to create some schemes which will help me matching the query terms with the erroneous data. Following are the schemes I proposed:

  • PLCS: A probabilistic version of Longest Common Sub-sequence which will return the matching probability for the possible LCSs.
  • PLD: A probabilistic version of Levenshtein Distance.
  • Suffix Tree: A probabilistic version of suffix tree to do the matching very efficiently.
  • Stemming: Simple stemmer.

Then I am going to merge the results of all these schemes in a weighted manner to get the final results. I will arrange the final result in the weighted order of their matching score.

One of the most important task of this project is to create the Content-Based Probabilistic Character Error Model (CBPCEM). This is model which represent the probability of every character can be misinterpreted  as every other character or to nothing. This misinterpretation will happen at OCR time. I will be constructing this model from the corpus itself.

The other thing what I have to consider is to do everything in a language independent way because my system should support all Indic scripts. This adds another layer of abstraction to the system.

Work Timeline:

Data:

Initially I started with the task to get the data-set. There is a data set present on the we which is made for this project by the FIRE community for their RISOT-2011 task, but it data was not publicly available. The procedure to get the data is too long. There the recommendation of Prasenjit sir helped me. Utpal Garain si one of the core member of RISOT and Prasenjit sir told me to contact to him directly on his behalf to get the access to data. Utpal responded fast and helped me getting the access to RISOT data.

Data Processing:

Now I got the data. The first thing I did was to collect the simple stats about the data. The data-set consists of two version of the same data: one is the original data without any error (the base data) and the ocred data with OCR errors (erroneous version of base data). While collecting the stats I found that there is some inconsistency in the files present in the two version of the data. So my first task is to find such files and solve the problem accordingly with logging all the information. You can get some of the corpus stats here and stats about inconsistency between two version of data here.  The script which solves this inconsistency by deleting files is here.

Content Based Probabilistic Character Error Model:

Now my task is to make the CBPCEM but the data-set with the actual data contain too many meta information which will be used while indexing the data but of no use right now. So I wrote the scripts to clean the data, which you can find the script here.

I have to write the code to make the CBPCEM and the most simple way to do so is to first map the two versions of data and then count what is mapping to what. Right now this sounds so trivial but its one the most challenging part of this project because the two version of data are different. It is very difficult to map them because it may be possible that there would be no common map point between the two versions of data. To get the intuition of difference in two versions of data I wrote a script which compare the two versions  for each file. You can find the script here. The stats for the difference between the two versions of data could be find here. It is pretty much clear from it the difference between two versions of data.

So I wrote a code based on Levenshtein Distance, a word window size and some other parameters which will map the words between the two version of data. It may be possible that this code don’t find all the mappings but this will find enough amount of mapping to construct CBPCEM. Also the number of mapping returned can be maximized empirically by tweaking the parameters.

Now suppose I got the words matched between the two version of the data. The main problem is now. It may be possible that the two words are of different length. How I am going  to find the mapping now about which letter is mapped to which.  For this I wrote a code which consists of same manual case rejections and recursion to find every possible mapping between the two words and return the best among them.  The ideal way to do so is to use Dynamic Programming (DP) to check all the possible case with memoization to make it faster. But writing the code to check all possible cases (in my mind) led me to reject many cases without even checking them because they can’t occur. You can find the complete code here. This code is very precious to me (just like in LOTR).

This code will generate a Content Based Statistical Character Error Model i.e.  a model in which represent the number of times every character can be misinterpreted  as every other character or to nothing. Now I have to write the code to change this model to probabilistic one. To do so I wrote a script which can be find here. The Content Based Probabilistic Character Error Model for Bengali can be found here.

Here is the example of CBPCEM looks like (for two letters, n is for nothing):

"ম": {
 "n": 0.022254117995318203,
 "ং": 0.0009977707581110258,
 "অ": 0.01333585210401744,
 "ই": 0.0023389512306272128,
 "উ": 0.0032614185352958973,
 "এ": 0.0023746554542421164,
 "ঐ": 6.426760250682599e-05,
 "ও": 0.002362321267902422,
 "ক": 0.01837599014303761,
 "খ": 0.0034892763987291893,
 "গ": 0.007441409369047944,
 "ঘ": 0.0014502406464671645,
 "ঙ": 0.0014119397520439044,
 "চ": 0.0045597539394742,
 "ছ": 0.003800227728029893,
 "জ": 0.009255833096387122,
 "ঝ": 0.00042585401257048337,
 "ঞ": 0.00042260817406003764,
 "ট": 0.004620126535768492,
 "ঠ": 0.000912729789137347,
 "ড": 0.002061107454133056,
 "ঢ": 0.00026031624853774975,
 "ণ": 0.0010529500127886037,
 "ত": 0.010406158264489098,
 "থ": 0.005606212275241913,
 "দ": 0.009447337568503422,
 "ধ": 0.00234868874615855,
 "ন": 0.016321374365925446,
 "প": 0.010843697295697187,
 "ফ": 0.0019117988826525512,
 "ব": 0.017607375583764056,
 "ভ": 0.004161164970391461,
 "ম": 0.6847797438903581,
 "য": 0.003724924274587551,
 "র": 0.02017678134863292,
 "ল": 0.00929543232621456,
 "শ": 0.005808103430591638,
 "ষ": 0.019191344776861585,
 "স": 0.013529304079240006,
 "হ": 0.00619695488414304,
 "া": 0.013499442364943905,
 "ি": 0.006618913890500989,
 "ী": 0.0007342086710628304,
 "ু": 0.0006796785840873416,
 "ৃ": 0.0001350268820345435,
 "ে": 0.010532745966396483,
 "ৈ": 0.00024084121747507521,
 "্": 0.011382506488431183,
 "ৗ": 0.0003278296895550215,
 "ড়": 0.0018007912055953063,
 "য়": 0.006161899828230226
 },

"n": {
 "ং": 0.0072325395044072075,
 "অ": 0.014854986795622836,
 "ই": 0.0053889755072925255,
 "উ": 0.005008565474265337,
 "এ": 0.0059438442808275845,
 "ঐ": 0.00011972168582130326,
 "ও": 0.005878609708803868,
 "ক": 0.025695173091563845,
 "খ": 0.0059325969408234955,
 "গ": 0.011551018184199437,
 "ঘ": 0.0017663322624199378,
 "ঙ": 0.002333198198626025,
 "চ": 0.008203809798982541,
 "ছ": 0.007292275376873369,
 "জ": 0.014174397754930958,
 "ঝ": 0.0006296010993400061,
 "ঞ": 0.0017415881144109417,
 "ট": 0.009287303552709784,
 "ঠ": 0.0024619177564505993,
 "ড": 0.003955064627215663,
 "ঢ": 0.00038790825969658174,
 "ণ": 0.0030660248851146705,
 "ত": 0.021603640738965233,
 "থ": 0.008017103954914662,
 "দ": 0.017079710648431647,
 "ধ": 0.005544188799348954,
 "ন": 0.02907412396968117,
 "প": 0.015717782744380952,
 "ফ": 0.0030697739984493666,
 "ব": 0.028846177878931633,
 "ভ": 0.008757178927183721,
 "ম": 0.02554645826262089,
 "য": 0.014718019188461929,
 "র": 0.0474590259403651,
 "ল": 0.021910318209743394,
 "শ": 0.01155926623353577,
 "ষ": 0.009292802252267338,
 "স": 0.02419477793501837,
 "হ": 0.00824430022299726,
 "া": 0.2104794716049654,
 "ি": 0.09217819985573412,
 "ী": 0.00904561071306636,
 "ু": 0.008577971309785236,
 "ূ": 1.749586222858294e-06,
 "ৃ": 0.002059512925193192,
 "ে": 0.13432273267372266,
 "ৈ": 0.00205126487585686,
 "্": 0.07317044518971264,
 "ৗ": 0.006015077434186815,
 "ড়": 0.0033397101585475033,
 "য়": 0.015218150907310421
 },

Probabilistic Longest Common Sub-sequence (PLCS):

This task involves writing a probabilistic version of LCS code. What do I mean by probabilistic? A simple LCS trying to match the longest common sub-sequence but in my case it may be possible that one of the words out of the two we are comparing can have say one letter erroneous, so normal LCS would return a length one less string which should be returned if there was no error. PLCS will work in the sense suppose if two letters are not matching then PLCS will use CBPCEM with a threshold pre-specified. For the letter in the data (not in the query term, which I am hoping to be correct) PLCS will try all the misinterpretations which have probability greater than the threshold. If a letter matches then it will take it as a match but with the probability of the match as the probability of the misinterpretation. The complete probability of the match of the complete word is the multiplication of of the probability of the match of all the letters in the return common sequence.

Along with this, there can be one more issue. It may be possible for two words there are multiple same length common sub-sequences exist. So I have to consider them all and return them all with their own matching probability. How I am doing this is explained here. Along with this it may be possible that probability of these same length common sub-sequences is different.

Also right now I am simply multiplying the probability of all the matched letters. But for the case of multiple same length common sub-sequences it may be possible that in one case starting letters of one word are mapping to ending letters of other word. Though the number of match is same but it is wrong. So I should also bring the positional dependency in assigning the probability to a matching.  I haven’t done it yet but will do it after making a simple basic complete search system before.

You can find the code for PLCS here. This code return all the possible PLCS for two words with the LCS size, mapping details and matching probability.

Probabilistic Levenshtein Distance (PLD):

As I explained use of CBPCEM in making PLCS similarly I am using it here to make LD probabilistic. In the same way PLD can also return the matching probability along  but I right now I am not returning this because PLD probabilistic function is going to be very complex. This is because the distance we return is the number of mismatches now I have probability with matched part & with not matched part. The distance is due to unmatched part. So the final probability would be the combination of these three things which I have to deduce. I will be doing it after making the search system first.

In case of PLD there are no multiple case as of PLCS so no worry for that and also there is no problem of position dependent matching as LD consider the whole string not its sub-sequence.

You can find the code for PLD here. It takes two string one is original and other is erroneous and use CBPCEM for the erroneous one to return PLD for right now.

Suffix Trees:

I have just studied suffix trees and already impressed by them and planning to use them but after making the search system. More about suffix trees is here.

Deciding IR Library:

My next task is to decide which search library I am going to use. The best part about this task is I initially assumed it to be simple decision which I considered trivial. But I think it the most tough decision I have to take and it took me about 10-12 days just to check some of the search libraries. This decision really showed me sometimes how tough is to take a trivial decision not because its tough but because there are many options and complete future project is dependent upon it. So you want to select the best and most suitable solution. In more than 10 days I am just able to compare some of these libraries and manually test only two of them: Lucene and Terrier. My complete research about this can be found here.

Finally I reduced my options to two: Terrier and Lucene. But selecting among two is the toughest part. There I selected Lucene because I found its code more organised and Lucene is defacto option for the IR library in the industry. But Lucene code is more complex as compared to Terrier. Also Terrier supports TREC format and evaluation methods by default but for Lucene I have to write them manually. Still I selected Lucene.

Modifying Lucene:

Now I have to modify the Lucene code to include my schemes in it. I started understanding the code but I got stuck at someplace. I tried again & again but got no success. So finally I asked for help on Stack Overflow, Lucene-dev IRC & mailing-list. Initially I got no response but after that I got response from Mike McCandless. Our discussion till now is posted here. Right now I am not responding to him I am planning to read the book Lucene in Action in this weekend (which kind of huge work to do) before contacting him. But right now I am stuck here.

Collecting Search Evaluations:

This involves getting the evaluation of the Lucene search system on the original data and then the ocred data. After that I will get the evaluation of the modified Lucene on the ocred data. In this way i will have three point comparison  of my search system. Right now I have indexed the original data and ocred data with the normal Lucene system. Doing this is kind of confusing because I am using PyLucene and it is the first time I am using a language port. I don’t know how it works. So I am bit confused in understanding where to use python syntax & classes and where to use Java syntax & classes. The code for the indexing can be found here (and here helper code for indexer code).

I also wrote the code for searching but i still have to write the code to understand the TREC topics format. The code for the search files can be found here.

Right now I stopped doing it because I want to read the Lucene in Action book first so that I can know everything about the Lucene. This is my weakness because  if I am doing something and I don’t fully understands it then I feel unsatisfied. It may be possible that there is something already present in Lucene to handle TREC format.

Expected Schedule vs Reality:

According to the schedule proposed in the proposal I should have completed till making a search system. But right now I am stuck there as explained above. It may take more than one week in setting up the complete system. This is going to happen as there are many things mentioned above which I haven’t thought about while writing the schedule in the proposal. I think I can overcome this as in proposal I have mentioned nothing after this just tweaking the system. Everything else is up to my expectation if we only see from the end points not the complete path. Because the path is changed completely compared to what I have mentioned in the proposal. For eg. I have mentioned CBPCEM after PLCS and PLD but I have done the opposite. In one line while writing the schedule I missed so many details but I am able to overcome this till now expect for the search system because selecting the search system took more than 10 days which I estimated to be 0.

Results Till Now:

Except the CBPCEM (which can be found here) there is no other relevant result to show. I will be producing result only after I am able to make the search system live.

Tasks Completed:

  • Collecting data.
  • Processing data.
  • CBPCEM
  • PLCS
  • PLD
  • Selecting IR library

Tasks Currently working on:

  • Modifying Lucene
  • Collecting search system evaluation results

Remaining Tasks:

  • Modifying the score scheme of Lucene
  • Using Suffix Trees
  • Make CBPCEM using DP
  • Add positional dependency in calculating matching probability in PLCS
  • Add code to reject matching if the some part of the string is matched and this part is actual a different word in dictionary.
  • Make scheme to calculate the probability of matching for PLD.
  • Add model to do spell checking for query terms.
  • Tweaking the system to give  better results.

Proposal!! (a more readable version)

Improving IR methods on OCRed texts using Probabilistic and Voting approaches.

 

Synopsis:

In this project we are going to apply multiple techniques for doing IR on the OCRed Text. Each technique will give multiple weighted result for each query where these results contain each possibility (to mitigate the error introduced by OCR) with a probability to be assigned to each. Then we are going to use a weighted voting scheme to find the final results.

How is your project going to benefit language computing:

This project is going to develop an approach based on some already applied approaches (actually going to combine and modify those approaches to gain benefits from each) for IR on OCRed text. This project will be build as to make an open source tool which anyone can use (as there will be an API), modify (as code will be open and well documented ), easily change a single module (as the whole system will be modular). Also the complete project will be implemented as to use with any Indic Script. As today there is a great need of performing IR on OCRed text, this tool is going to act as a basic IR system for noisy data for any Indic script.

Why would you like to do a project with us:

I would like to do a project on IR based on Indian languages as I was already doing it for about 5 months as a part of my college course on Information Retrieval under Prof. Prasenjit Majumder. This kind of project is very intriguing as one can draw information very easily from a very big corpus. And this is the only organisation working on IR on Indian Languages in GSoC. Also doing IR on erroneous data adds another step of complexity to it, which I like.

How many hours per week can you commit to the project:

My plan is to give 35-40 Hours per week to this project.

Implementation details of the project:

First we are going to set corpus and qrels files for various languages.Then we are going to write code to extract various features for various schemes that we are planning to apply.

Various schemes are:

  • Probabilistic Longest Common Sub-sequence (PLCS): As one team applied longest common sub-sequence on same task and got some good results in RISOT 2012. We are going to modify this scheme to make it less prone to low recall. We are going to make a character based  language model for the OCRed text which will contain the probability of mis-classification of a character into any other character. As this is suggested in this and this paper that OCR error mainly occur for certain characters and very oftenly for them. Now first we are going to find LCS and then we are going to increase the size of this LCS by adding more letters at the starting or ending ,if possible, according to the language model i.e. if the letters we are adding (which will be different for the two words in consideration else it wouldn’t be LCS) have probability of mis-classification greater than a certain threshold. We are going to do again & again for all the words and till the LCS length can increase given all the current increment are below the threshold. Then we are going to do the same process as explained in the paper above but we are also going to assign a weight to all these pseudo-LCSs according to the probability present in the language model. In this way we are going to cover many cases (Only those which are missed due to error in OCR because of our character language model) which can be missed by simple LCS.
  • Probabilistic Levenshtein Distance: We are going to calculate the PLD for each of the query word and vocabulary term. We are going to make it probabilistic using the same language model explained above. PLD and PLCS seems to do the same thing but they are bit different. PLCS is going to consider the common sequence. It may be possible that the first try to increment the LCS doesn’t satisfy the threshold condition, so we are not going to increment it any further. While PLD always going to compare the complete words not a sequence. We can also use another approach based on the suffix tree as mentioned here. I am not very sure about this. We have to select empirically between these.
  • Stemming: In this we are going to do IR on more crudely stemmed words. The need of this in-spite of the presence of above two schemes is explained in the voting part.
  • Content-based Probabilistic Correction Model (CPCM) : As explained here we can use a language model on the words,  but instead of words I prefer n-chargram. The difference between PLCS and this would be CPCM is actually using language model based on the n-char gram instead of letters/characters. Why this is beneficial is explained later in the voting part.
  • Voting Part: Then we are going to merge results of all these schemes to give final result on the query. The merging part will be weighted. As PLCS and PLD can induce there own error of co-relating two totally different things, we are going to give more weight to Stemming and CPCM (That’s why they are used here). Initially we are going to use he Bayesian approach for merging and select the top results according the probability (The benefit is that we can arrange al the results arranged in a ranked order easily). As Bayesian is more classic approach, it will give us a benchmark for this project. Then we can use any other learning algorithm to for voting. If results get improved, then we can find some more co-relation between the different schemes which aren’t visible under direct näive Bayes approach.
  • To build language model and calculating various statistics(letters occurrence frequencies) out of data we could use Indri (Dump Index), CMUSLM Toolkit.
  • We plan to build the final voting system in an general(abstract) manner so that more schemes could be added to it, if developed further by anyone in the open source community.
  • We are going to use python and various python libraries in achieving this. We also plan to use any data representation library(may be D3.js) to visualize the data and response of various schemes so that we can induce various inferences out of it.

 

Phases/milestones with dates:

  • 28 May – 10 June : Making benchmark data(degradation of data to make it erroneous like OCRed data) and qrels for various languages.
  • 11 June – 9 July : Implementation of the five schemes (PLCS, PLD, Suffix Tree, Stemming, CPCM)
    • 11 June – 18 June : PLCS
    • 19 June – 23 June : PLD
    • 23 June- 26 June : Stemming (Only for those languages for which stemmer is already available)
    • 26 June – 30 June : CPCM
    • 1 July – 7 July : Suffix Tree
  • 8 July – 9 July : Slack Time
  • 10 July – 25 July : Voting and complete IR system
  • 26 July – 5 September : Tweaking the system
  • 6 Sep – 15 September : Data Representation ,Final Documentation, Documenting API, Testing Code etc.

 

Let us know who you are:

I am a 3rd year student at Dhirubhai Ambani Institute of Information & Communication Technology. I have knowledge about programming in Python, C++, C & Java. I have done a course on Information Retrieval this year under Prof. Prasenjit Majumder. The main focus of the course is on IR based methods for Indian languages. I have done a project on Authorship Attribution for Gujarati text using POS Tags. My paper about the same is here. I have used Lemur/Indri and CMU SLM Toolkit for some other project tasks. I have also done a course on Fuzzy Logic and Fuzzy Neural Network. My current focus is in Machine Learning.

GitHub Repo : https://github.com/knoxxs

Curriculum vitae

things to ponder!!

  • IDF metrix using suffix trees
  • Complete dynamic programming scheme to construct character error model
  • After the word matching test on the base of semantic that whether this word can be matched. For eg. alas can be mapped to alaska but according to semantic alas is an another word and it may be possible that it is in the corpus not due to any error.
  • What is the query term is erroneous. Add spell checking model first to correct it.
  • the case of multiple same length common sub-sequences it may be possible that in one case starting letters of one word are mapping to ending letters of other word. Though the number of match is same but it is wrong. So I should also bring the positional dependency in assigning the probability to a matching.
  • Probabilistic score of PLD.

Finally there is a light source but far far away!!

Previously:
I was stuck in Lucene code and I tried to contacted Lucene developer community but I got no response.

Today:
Finally, I got a response from Mike McCandless. He is one of the core developer of Lucene out of about six and also the co-author of Lucene  in Action. I am sharing the discussion we had:

Abhishek Gupta <abhi.bansal21@gmail.com>
Jul 23 (3 days ago)
to dev

Hi,
I have a problem which is explained completely here. Please help!! or just give me some suggestion about from where to get help.

————————————————————————————

Abhishek Gupta <abhi.bansal21@gmail.com>
Jul 23 (3 days ago)
to dev

As I have some discussion on IRC with . They want to know my objective in doing so. I have already post the objective there and posting here again:
Sry, for late response. I am making a search system in which the data I have indexed is erroneous. So I have made some schemes to match a query term(which is error free) to the indexed term(which might be erroneous). So I have to change the Lucene code where it matches a query term to the indexed data, so that I can code my matching schemes there.

————————————————————————————

Michael McCandless lucene@mikemccandless.com via lucene.apache.org
Jul 24 (2 days ago)
to Lucene/Solr

Is there some mapping from clean term X to dirty indexed terms A, B,

C?  If so, can’t you just take a TermQuery(X) and replace with

BooleanQuery SHOULD A, B, C?

Mike McCandless

————————————————————————————

Abhishek Gupta <abhi.bansal21@gmail.com>
Jul 23 (3 days ago)
to dev

Michael thanks for replying so fast.

No there is no mapping. What I have is some code based on LCS(Longest Common Subsequence), Levenshtein Distance, Suffix Tree and a probabilistic error model which maps original word(for eg. michael) to the erroneous word(mihel). So I think I have to change the way how matching done.

I am not sure whether it is clear to you or not the matching I am talking about. So I am explaining it a little bit. I am taking the case of Vector Space Model and for indexing I am taking the case of inverted list (I am actually not sure what lucene uses). I am talking about matching of each query term with the labels of inverted index.

Also just because of curiosity,  the problem I mentioned in the SO question is that I am not finding the definition of the weight .scorer(). Can you help me with how things are working there. And which model by default Lucene selects.
 ————————————————————————————
Michael McCandless lucene@mikemccandless.com via lucene.apache.org
Jul 24 (2 days ago)
to Lucene/Solr

But somehow from “michael” you’ll generate the N other terms to search

for?  And then it seems like you could just make a new query with
those expanded terms?

If you need to know all terms in the index, you can use a TermsEnum to
iterate through them.

I’m only pushing this because doing this outside of Lucene is going to
be far, far easier than modifying Lucene’s sources to do what seems to
be essentially query expansion.

Each query has it’s own Weight impl, and those all implement
Weight.scorer.  E.g. see TermWeight.scorer(), which returns
TermScorer

From this some things are clear. Right now he is not understanding my doubt. Because he is enforcing two things: one to don’t modify Lucene code and other the Weight Implementation.  There may also be possibility that I am not understanding what he is trying to suggests because right now I am naive in Lucene. So i decided to first read the book Lucene In Action and then to contact him again. I think it would take me about 3-4 days to read this book. But I have to do it.

It feels so good that such people are helping you. 🙂

Tomorrow:
I will try to complete Lucene In Action.
Impediments:
Right now there is only one where I am stuck in Lucene code.

Lucene’s code is acting weird

Previously:
I started writing code for indexing and querying TREC data.

Today:
I first started with yesterday’s code to check it. I solved two or three minute bugs. Then I run  the indexing code for both the original and ocred text. It ran fine. Index for both is created. Then  I started exploring Lucene internals where I got stuck in finding the specific piece of code which is actually doing the matching. I tried hard but nothing happened. So I spent some more time but today there is no success. Finally I decided to take help of someone else. So I posted my doubt on #lucene-dev IRC but no one replied. Then I posted it on stackoverflow (you can find it here).   But I got no answer. I think I have to wait more to get an answer on SO. I also posted the same doubt on the Lucene dev-mailing list but till now I got no replies.

I am not sure what to do next. I think I will wait for tomorrow to get an answer. If I got nothing then I may shift to terrier again. 😦 These 4-5 days with Lucene will be wasted.

Tomorrow:
I will decide what to do about this problem.

Impediments:
Ya, right now following one which is explained above.

Language port is confusing!!

Previously:
I setup my system to work.

Today:
I started with writing code to index and search TREC data using PyLucene. You can find the code here. I started writing this code first because I want to first do evaluation of original data using pyLucene. Then I want to do evaluation of OCRed data using Pylucene. After that I will modify the Lucene code to include my schemes to do the matching and will again run the pyLucene on OCRed data. In this way I will have three dimensional comparison of my schemes.
Today I spent writing code to do indexing and searching. Writing this code is horrific as its too confusing in deciding what functions to use: Java or Python.  But I am able to finish this. There are still many areas to explore more and improve the code. But this is not my main concern as I just want to have a comparison for a start. After that I will modify the code to bring it more close to perfection.

Tomorrow:
I will start exploring Lucene internals to know where I have to do modification to include my schemes in between. And I also try to run the unmodified Lucene on Original and OCRed text to get the data to compare.

Impediments:
None today. 🙂