Things never go the way you want. This hypothesis is nearly always true. But to me - "when things go more than what you want there is some problem somewhere".
I have changed my job and joined THE big software firm. As always , the problem with the big firm is 'process'. Everything has its own process and so every thing , be it an request or suggestion, the turn around time is usually large. Adding to this, the release cycle is in its near end , and I am left with no work. Idle for more than a month , I decided to go ahead with implementing some challenging stuff.
As my interest in machine learning and information retrieval is growing like an exponential upward curve , I decided to do "text classification". The algorithm classifies the given text into some category. Given a web page / document it classifies them automatically into some topics , say news , sports , religion or celebrities or anything.
Its basically a machine learning algorithm. It creates or generates a learning function from the training data. Using this learning function , we classify the test data into one of the categories. I went ahead implementing the naïve bayes classification algorithm as prescribed in Information Retrieval text book. My aim is to build the prototype first and then do extensive tuning and find some important information which improved the classifier accuracy. When am building the prototype , I started reading papers online to see how we can tune this algorithm. My major interest is in finding some better tuning algorithm. But what happened is a miracle. ..
As always , I don’t normally make programming errors and the code is running the first time. Guess what , I felt the fundamental model very simple ,and have implemented the tuning stuff also. After the program ran ,results were weird . The accuracy rate is below 40%. I couldn’t spot the error. After struggling considerable amount of time , I decided , ok this is not going to work and just for the sake of it I removed the extra tuning algorithm and ran the simple one. And there it is 100% accurate classification. I wasn’t happy. I didn’t jump in the air because its 100%. I am quite sure , classification algorithms cant predict them accurately that too 100%. This is one of the moments where scoring an 100 is not good. If you score an hundred ,it could be almost sure that this is due to some programming bug. Adding to my woes, I have removed my so called tuning algorithm. It has defeated my very purpose of doing this project which is to find some heuristics on top of the existing one. disappointed. I realized - " not always scoring an 100 is good" . All the effort , its gives 100% accuracy after having implemented the simple model. Now how do I get the enthusiasm to work and learn complex algorithms. Simple one did everything it has to do. Dejected. Felt something wrong , something fundamentally wrong.
Then started thinking why its 100%. How come it could predict so accurately. The problem is , I have to test with large number of test cases. All my carefully selected test-cases did the trick somehow. Finally I tested the whole program with more number of test cases, and I got the accuracy down to 92%. Happy ! Delighted. Because I believe this is the right number and also because it makes me work on advanced heuristics algorithm to increase the classifier accuracy.
One moment where it exceeded my expectations , still I wasn’t happy with what I got ! Expectations - achieve more or less - there is some problem somewhere ! Now that its 92% , it gives me the greatest opportunity to explore advanced methods and heuristics to go towards 100% ! I never want to reach the ultimate 100 , as it will cease my opportunities to learn more. There should always be another mile stone !
note:
information retrieval book is avaialble here
linkthis is actually done , by taking details from the assignment given at stanford for IR course -
link