Tuesday, February 10, 2009

[week 5] Muddiest Points

Talking about the Probability Ranking Principle, it was mentioned the case of retrieval costs but we didn't deepen into that concept. In the case that Costs are taking into account in a retrieval model:
  • Which are commonly the values for these "C" costs?
  • Which variables or factors are taking into account to set these costs values (hardware, size of collection, mean size of documents, etc)?
Another question I have is that several models take constants into account and after some experiments they suggest a range to set the values of those constants when defining a model. I am not absolutely sure, but these ranges must come using metrics such as precision and recall, and the documents collections must come from a programme like TREC. Is this enough to establish a model? It seems that all the theory for these probabilistic and language models is, in practice, oversimplified by smoothing factors and other constants added to the models. How can we be sure that the theory still states when adding these factors and constants in practice?


