Sunday, February 19, 2012

How to access the data from a custom data mining plugin ?

I'm stucked in a problem and I thought if you would be so kind as to helping me to resolve it.

I'm implementing a clustering algorithm plugin for text mining. I've already read the tutorials and sample codes provided by the MSDN Library.

Well... My problem is: I can't go through the data when the Predict method is called. I've read that this method implements the "core" of the custom algorithms. Here is a small snippet of my code for you to understand my doubt:

STDMETHODIMP ALGORITHM::Predict(/* [in] */ IDMContextServices* in_pContext,/* [in] */ DM_PREDICTION_FLAGS in_Flags,/* [in] */ IDMAttributeGroup* in_pPredAttGroup,/* [in] */ DM_CaseID in_CaseID, /* [in] */ ULONG in_ulCaseValues,/* [in] */ DM_ATTRIBUTE_VALUE* in_rgValues,/* [in] */ ULONG in_ulMaxPredictions,/* [in] */ ULONG in_ulMaxStates, /* [out] */ DM_ATTRIBUTE_STAT** io_prgPredictions,/* [out] */ ULONG* out_pulPredictions) {

for(UINT i=0;i<in_ulCaseValues;i++) {
DM_ATTRIBUTE_VALUE& dmattributevalue = in_rgValuesIdea;
ULONG iAttribute = dmattributevalue.Attribute;
if (iAttribute == DM_UNSPECIFIED)
continue;
double dblValue = ::DblContinuous(dmattributevalue.Value);
char buffer[129];
sprintf(buffer,"%f\t",dblValue);
RENAN_Log::log(buffer);
}
return S_OK;
}

As you can see, I'm going through in_rgValues to get its values, but i'm only obtaining the first register of the table on the database. I need to roll over a kind of resultset so I could access all the registers I need. Is there any way to do so ?

I expected Predict() received a matrix containing all my data, but the only thing I noticed that could represent the data is that in_rgValues vector. So I can go through this vector, but it holds only the first register of the table in the database (that's what's being saved on my log). I need all of the registers in order to pre-process the data and implement my clustering algorithm.

Well... That's it... I would be very pleased if you could help me.

Hello

The training part should happen in the InsertCases method. There you can go over all the training set, case by case, once or multiple times and detect patterns.

Predict is a request to apply the already detected patterns over a single case. E.g. : determine the cluster containing this row. Now, you mentioned text mining: are you using nested tables to represent your documents?

|||

Your awnser was really helpfull. Thanks a lot.

Could you please also tell me what object holds my data from the various that InsertCases receives ?

I'm not intending to use nested tables. I think I'll read the raw text, take off the stopwords, do stemming, count the frequencies and then cluster the data.

Thanks again and I would be very pleased if you could tell me what object holds the data in InsertCases.

Best regards,

-Renan Souza

|||

InsertCases allows you to iterate over all the training case and hold whatever information your algorithm needs. One of the parameters is a case set object. Basically, an iterator over all the training cases. That object can be used to traverse all the training set (by calling the StartCases method), once or multiple times, as needed by your algorithm.

StartCases needs, as parameter, a Case Processor object, which implements the ProcessCase method and handles training cases one by one. From that point, it is up to your algorithm what you do with the training data.

So, in short, there is no object that holds the cases, you can iterate over all training cases and handle them according to the needs of your algorithm. A simple case processor could just accumulate cases in a table, in memory, but that would not work for large data sets.

Now, for the text analysis - your training cases are presented to the algorithm as attribute/value pairs. A TEXT attribute, say HairColor (with values of "Blonde", "Red"...) will be presented to the algorithm as two integers: one is the attribute number, the other is the state number ( 1 for "Blonde", 2 for "Red" etc.). You can access the actual string value, but it is really not efficient and you will likely run into problems.

You could try to use the same approach used by our text mining solution (via Integration Services): do some preprocessing outside of Analysis Services, and convert each document to a set of key words (including their frequencies, if necessary). Then, pass the key words to your plug-in algorithm as a nested table

|||

Thanks a lot, Mr. Bogdan.

As you can see, what I'm trying to do is a plugin to cluster texts in brazilian portuguese.

So... Here's one more doubt: what is the predict method for ? Can't I roll over the IDMPushCaseSet from the InsertCases method, call my pre-processing classes, actually cluster the data and export the results to a file ? So what is Predict() for ?

Could you please give me an example on how to access the actual string ? I only need it to pre-process itself and calculate the tfxidf wheight. From that point I'll do the clustering using only the weights.

I'm trying to find a IDMPushCaseSet class reference but I haven't. So... Would you be so kind as to telling me how to iterate through the IDMPushCaseSet ?

Thanks a lot for your help. From all of you.

|||

First, on the IDMPushCaseSet -- it is a "push" interface -- it is used by the model to "push" training cases into your algorithm. The algorithm is supposed to invoke StartCases on the IDMPushCaseSet passing an IDMCaseProcessor method. The IDMPushCaseSet object will call ProcessCase for each individual case.

Next-- the processing results should not be exported to a file, but save in the Save method implemented by the algorith. Exporting to a file breaks the transactional integrity of the server (the file will not be updated/deleted with the model).

On the text processing part: the algorithm sees the training cases as a set of attribute/value pairs. For a discrete attribute, such as Text, the values are various states (0, 1, 2, ... N). The algorithm should not care about the actual string (or other data type) representation of the state. Therefore, mining against a text column in this case is not the right way to perform document clustering.

Text mining can be performed with a pre-processing step, similar to the Text mining component in Integration Services (which, unfortunately, only works fine for English texts):

- a collection of noun-phrases (or other kinds of tokens that are representative for the document collection) is built by reading the documents

- each document is split in terms/phrases (with a variable length)

- each term from a document obtains a score (frequency/TFxIDF) for each document, if present in that document

- effectively, the text data is converted into a 1->Many representation (each document has a collection of terms/phrases with associated score)

This preprocessing should happen outside Analysis Services, as it currently does not provide support for this kind of operations.

Mining can only start after this step, representing a document as a case with a nested table, which contains all the terms for that document and their frequencies

No comments:

Post a Comment