Tech

Guides
 

OpenAmplify developer's diary - part three: Topic intention comparisons

By Justin James, Special to ZDNet Asia
Friday, November 06, 2009 01:48 PM
Justin James is chronicling his process of using Hapax’s OpenAmplify Web service to create an application that can match documents with content that is similar or identical to the source document.

In part two of this series, I discussed comparison of the author information ("Demographics" and "Style") that the OpenAmplify output provides. In part three, I am shooting for a much more ambitious target, "Topic Intentions".

My goal is to be able to provide an approximation of how similar the two documents are in terms of what they discuss and how they discuss it.

Why do I want to do this? Well, my application, Rat Catcher, gives me what I call a "Semantic Match Score" (SM score). The SM score is used to display any similarities between the contents of the two documents. From my initial testing, the SM score is a great supplement to the existing percentage that shows the number of matched "phrases" in the documents. What makes SM score so useful for this is that it helps the user find documents which may be a "creative rewording" and as a result, will not have a very high phrase match percentage.

In the future, I plan to take the SM score much further. To begin with, I would like to use a high SM score to trigger a "thesaurus comparison" of documents in which individual phrases are broken down to root word stems. From there variations can be created from a thesaurus and each variation looked for in the target document. Needless to say, this will be a computationally brutal exercise, so if the SM score can be used to filter out documents that are eligible for this treatment, I will be much happier.



My logic in this method is to do the following:

  1. Create a "Topic Intention" score for the "Top Topics" from 0 to 100. ("0" means "no Top Topics from the original document appear in the comparison document" and "100" indicates "all Top Topics in the original appear in the comparison document") and match Polarity, Requesting Guidance, and Offering Guidance.
  2. Replicate this logic for Proper Nouns.
  3. Replicate this logic for Locations, but instead of using the in-depth Topic Intention comparison, just check for the existence in each document.
  4. Combine the three scores into a composite score by adding them together and dividing by 3.
  5. Any errors result in immediate termination and a result of 0, for the sake of expediency.

For now let's look at the SM score and how it is expressed as a float, with a range of 0 to 100 (0 being "no match" and 100 meaning "perfect match").

Here is my method declaration:

private float CompareOpenAmplifyContent(XDocument Original, XDocument CompareTo)

The code to perform the comparison between the Top Topics and the code for the Proper Nouns is identical, other than the XML elements referenced, so I am only going to show how I compare the Top Topics:

var topOriginalTopics =
from topic in
Original.Root.Element("AmplifyReturn").Element("TopicIntentions").Element
("TopTopics").
Elements()
select topic;
var topCompareToTopics =
from topic in
CompareTo.Root.Element("AmplifyReturn").Element("TopicIntentions").Element
("TopTopics").
Elements()
select topic;

float topTopicsResult = 0;

if (topOriginalTopics.Count() > 0 && topCompareToTopics.Count() > 0)
{
foreach (var originalTopic in topOriginalTopics)
{
XElement compareToTopic = null;

foreach (var topic in topCompareToTopics)
{
if (topic.Element("Topic").Element("Name").Value.ToLower().Trim() == originalTopic.Element("Topic").Element("Name").Value.ToLower().Trim())
{
compareToTopic = topic;
break;
}
}

if (compareToTopic == null)
{
continue;
}

topTopicsResult += CompareOpenAmplifyTopicIntentionResults
(originalTopic, compareToTopic) *
(100 / topOriginalTopics.Count());
}

topTopicsResult = (float)Math.Max(Math.Round(topTopicsResult), 100);
}

if (topOriginalTopics.Count() == 0)
{
topTopicsResult = 100;
}

I create a list of the Top Topics in each document. Next, I iterate through the list of original Top Topics and search for nodes in the comparison Top Topics with the same name. If they match I break out of the loop. At the end of the loop if I find anything (using some negative logic; I continue to the next iteration if nothing was found) I calculate the Topic Intention Result (the XML node which contains the details of an item within Topic Intentions) similarity and divide it by the number of Top Topics in the original document (so a 100 percent match is weighted to the number of topics) and add it to the current score for the Top Topics. If there were no Top Topics (unlikely) I give it a 100 percent match. Here is my code to compare Topic Intentions:

private float CompareOpenAmplifyTopicIntentionResults(XElement Original, XElement CompareTo)
{
if (Original == null || CompareTo == null)
{
return 0;
}

var matchedItems = 0;
if (Original.Element("Polarity").Element("Min").Element("Name") ==
CompareTo.Element("Polarity").Element("Min").Element("Name"))
{
matchedItems++;
}

if (Original.Element("Polarity").Element("Mean").Element("Name") == CompareTo.Element("Polarity").Element("Mean").Element("Name"))
{
matchedItems++;
}

if (Original.Element("Polarity").Element("Max").Element("Name") == CompareTo.Element("Polarity").Element("Max").Element("Name"))
{
matchedItems++;
}

var polarityRating = (float)matchedItems / 3;

var offeringGuidanceRating = 0;
if (Original.Element("OfferingGuidance").Element("Name") == CompareTo.Element("OfferingGuidance").Element("Name"))
{
offeringGuidanceRating++;
}

var requestingGuidanceRating = 0;
if (Original.Element("RequestingGuidance").Element("Nam e") == CompareTo.Element("RequestingGuidance").Element("Name"))
{
requestingGuidanceRating++;
}

var result = Math.Min(((polarityRating + offeringGuidanceRating + requestingGuidanceRating) / 3), 1);

return result;
}

As you can see, there is nothing particularly complex or exciting about this code; it's just doing a quick and dirty comparison on an element-by-element basis between the two Topic Intention nodes. If you look carefully, you will see that the Polarity rating has three components for the three Polarity results (Mean, Min and Max). I am kicking back the results as a value between 0 and 1.

To perform the Locations comparison:

var originalLocations =
from topic in
originalXml.Root.Element("AmplifyReturn").Element("TopicIntentions").
Element("Locations").
Elements()
select topic;
var compareToLocations =
from topic in
compareToXml.Root.Element("AmplifyReturn").Element("TopicIntentions").
Element("Locations").
Elements()
select topic;

float locationsResult = 0;

if (originalLocations.Count() > 0 && compareToLocations.Count() > 0)
{
foreach (var originalTopic in originalProperNouns)

{
foreach (var topic in compareToLocations)
{
if (topic.Element("Result").Element("Name&").Value.ToLower().Trim() == originalTopic.Element("Result").Element("Name").Value.ToLower().Trim())
{
locationsResult += 100 / originalLocations.Count();
break;
}
}

}
}

if (originalLocations.Count() == 0)
{
locationsResult = 100;
}

Again, there is nothing terribly complex here. I am just looping through and checking to see how many items in the original document appear in the comparison.

In part four I will dive into the SOAP interface to OpenAmplify, which will be of particular interest to the Java and .NET developers where the environments are very heavily geared towards SOAP interaction.

Justin James is an employee of Levit & James, Inc. in a multidisciplinary role that combines programming, network management, and systems administration. He has been blogging at TechRepublic since 2005.



WORTHWHILE?

0

0 votes
Blog

Talkback 0 comments

There are currently no comments for this post.


Guest user

Guest user

Level: 
Joined: —
Already a member? Log in »



 

Loading...

Whitepapers/Case Studies

Downloads

Web Development News



Tech Jobs Now!

Tags

  1. business applications
  2. c#
  3. developer
  4. html
  5. industry
  6. java
  7. justin james
  8. microsoft .net
  9. microsoft corp.
  10. microsoft visual studio
  11. programming
  12. protocols and platforms
  13. server
  14. soa
  15. software engineering / development
  16. tool
  17. web
  18. web browser
  19. web services
  20. web sites