index button
julia button
demo button
tech button
faq button
about button
news button
blog button
contact button
Logosmall

Listening to the market, meeting its demands

Posted on August 07, 2009 by elena

The machine learning and natural language processing market has a pretty loud voice these days. I remember in November last year when ASI officially incorporated, my Google Alerts on these two subjects were relatively devoid of content for weeks on end. In the last few months however, I have seen an unprecedented increase in articles about new natural language processing and machine learning products on the market including of course, Microsoft Bing which is a natural language search engine and the soon to be released Google Wave which incorporates real-time machine translation.

Google Alerts have not been the only sign of a changing market in regards to products like ours. We've been encouraged by various advisers in the last few months for example, to capitalize on Twitter's success by building a sentiment analysis tool for the micro-blogging service. This push is a response to two market demands...to innovate in areas of machine learning and to monetize Twitter or at least refine increase its usefulness for business.

We're listening to our clients as well. Our publishers want more community tools that incorporate our technology. They want us to move beyond abusiveness filtering...and we are. Right now we're developing several new tools that will recommend community members and their contributions based upon the quality of their content.

We've also become aware of the problems reviews sites face moderating their content. As a result, we're developing a product that will help them moderate their own content and increase the effectiveness of their contextual advertising.

The loudest and clearest message from the market is the demand for ASI to remain flexible and responsive, and to innovate around existing problems. The simple truth is, the more we listen, the more we can solve.


Building consistency

Posted on June 13, 2009 by elena

I’ve been thinking a lot lately about our JuLiA training process—how to build a clear set of directions for freelance taggers, and how to ensure that tagging results in quality training documents. See, ultimately, JuLiA’s intelligence is entirely dependent upon the quality of human tagged documents. The nature of her intelligence is that of a very consistent categorizer, so it follows inconsistently tagged documents will build an inherent inconsistency. Here are some of the questions that kept me up at night:

How do we build an expert system to perform a task at “better than human” accuracy with human-tagged training documents? Will not the training documents themselves be inconsistent? What’s more, when training documents are submitted by multiple people, are they not likely to be exponentially more inconsistent? Different people have different styles. Will their combined tags create an inconsistent JuLiA?

My solution is to build as much opportunity for consistency into the process as possible. Here's how:

  1. Train consistently: everyone gets the same training documents, the same phone call, the same spiel.

  2. Audit results: everyone gets the same amount of documents to tag, each tagger stops upon reaching certain milestones, at each interval a consistent amount of documents are audited for accuracy and consistency by the project manager.

  3. Re-iterate training: during audits notes are taken and sent to the tagger, taggers correct their own mistakes.

  4. Be available: a means of reaching the project manager is always available and taggers are encouraged to contact the PM whenever they become unsure of their decision-making

  5. Iterate process: when unforeseen questions are asked or mistakes are made by one tagger, the rest will receive the same feedback regardless of their own performance.

  6. Alternate groups: because tagging is tedious and humans are prone to fatigue, boredom, and inconsistency when faced with boring tasks, no tagger will do more than 10,000 documents in a two month period.


To Each Their Own JuLiA

Posted on March 26, 2009 by elena

I was probably more excited than anyone has a right to be when we first launched our Wordpress plugin. It helped me meet two goals of mine. First of all, I had been wanting to develop a plugin or app for a while. The specifics weren’t set in stone, but the general outline was there. The second goal was to dust off my cobwebbed PHP skills. JuLiA for Wordpress is the first plugin of any kind that I have had a part in developing, and it not only dusted off my skills, but gave me an extra crash course in PHP. In short, it was fun and I’m proud of what we’ve done so far. The key phrase there is “so far.” Any developer worth his or ::her:: salt knows you need to iterate. We have plans.

To our dismay, we found out that many of our early adopters have non-English language blogs. Currently, JuLiA only works in English. Barring an all-caps “ENGLISH ONLY” note now attached to our plugin listing, there didn’t seem to be a lot we could do. Then, a couple bleary mornings ago, I found Microsoft’s live search translator API and tipped my coffee over in excitement. There is a chance that this might actually bear fruit…machine translation generally keeps the semantics of a phrase intact, but they are often out of order. This is a problem when humans are reading the results, but much less of a problem when you have a machine reading. Anyway, it’s a possibility…if this ends up being a dead-end I’ll look elsewhere.

Besides the issue of us existing in a multi-lingual world without a multi-lingual product as yet, we also realize we need to train a more generalized version of JuLiA for your average blogger. The current version has been trained on the semantics of political blogs. We recognize that not everyone is posting lengthy diatribes about Obama’s stimulus plan and need to account for that.

We also want to give individual bloggers more control over their particular version of JuLiA. Like larger publishers, they should be able to customize JuLiA to their subject matter and definition of abusiveness. Right now we do all the customization for our larger customers and it just isn’t scalable for us to train versions for everyone. So the other big addition is going to be a sort of “for each their own JuLiA” functionality. At this point we’re not sure if everyone will get their own JuLiA for free or whether people will have to pay a small fee to customize her. Much could happen between now and the next release to influence our decision, but we’ll keep you updated. Needless to say, we’re always thinking about ways to make her better.


JuLiA Public Demo Is Live

Posted on February 09, 2009 by elena

JuLiA is a comment moderation system for online publishers. She uses machine learning and natural language processing techniques to recognize abusiveness in blog comments. The idea is that online publishers can use her to automatically publish or delete these comments based on the amount of abusiveness she has identified.

Currently most online publishers use a combination of keyword filters, a staff of human moderators, and feedback from their own community in order to moderate incoming comments. There are problems with every one of these methods, and we (modestly) think we can do it a whole lot better.

Keyword filters especially are easy to trick. All you have to do is b r e a k u p a w o r d, repl@ce a letter with a symbol or num8er, or get creative with your insults, and you can get some pretty foul content past them.

When it comes to human moderators, bias and laziness (not to mention cost) tend to be really big problems, but even good moderators are fallible to a point. It’s really difficult to stay neutral when your job is to read through hundreds, maybe even thousands of reactionary, racist, threatening, stupid, nasty, and heinous comments every day. Even the best and most consistent moderators are given a Herculean challenge. Basically, online publications with user-generated content are the Aegean stables and human moderators are the ones who have to shovel that shit. I know how hard it is because in order to train JuLiA, I wind up tagging and auditing thousands of these nasty comments every week. It makes you doubt humanity, I swear.

And finally, there’s the community. Hmm…what to say about that? Relying on the mob to moderate itself is kind of a craps shoot, and you can see vastly different results when you look at different websites. To a certain extent, the success of this strategy relies on the kind of user your content is going to attract. For example, the comments section on USA Today is absolutely overrun with trolls. On the other hand, The Daily Kos has achieved some pretty decent results with their labyrinthine community moderation scheme, but that’s partially because their users are a bit savvier to begin with.

But there is a deeper problem with community self-moderation, and that is the fact that the users tend to form biased cliques, and eventually this bias becomes institutionalized. On The Daily Kos for example, you might be able to say something really nasty and offensive, and it will get through as long as you are attacking a group or an individual that their moderators hate as well. And the reverse situation is even worse…if you say something at all critical of a group or individual that their moderators love, then it is subject to censorship even if it contains no abusive language. The community moderation system on Slashdot is another famous example of this phenomenon.

So the basic jist is that all of the current methods have some serious limitations, and even when they are used in combination, a lot of abusive comments get by every day. This is a big problem for publications, because it reflects poorly on the publication as well as the community. Our demo lets you see a slice of the abusive content that gets by all these systems on a daily basis. We want to demonstrate the fact that JuLiA, unlike everything else out there, actually catches this stuff. And when you see the kind of stuff she’s catching, it’s actually sort of surprising to discover some of the truly awful things making it through the editors at these supposedly “respectable” publications.

So check it out...


Geotagging and sentiment analysis

Posted on January 18, 2009 by elena

I’ve been scouting out OpenCalais these last couple days as we think we may want to use their service as a jumping of point for some other products we’ve been thinking about developing. In case you’re unfamiliar with OpenCalais, they have a suite of tools and plugins that automatically generate rich metadata for your web content. Using a combination of natural language processing and machine learning, Calais analyzes your text and finds the entities (people, events, products, relationships) within it, generating semantic metadata in the form of RDF tags. You can then use this metadata to develop additional applications to enhance your web services. Check out their document viewer to get a sense of how Calais recognizes and tags named entities in text. You’ll notice that you can also view the RDF version of the text.

In the forums I found an excellent tutorial from Guilhem Vellut for text geotagging using Calais…the demo application can be found here. I’m intrigued by the possibilities for applications combining geotagging and mapping with sentiment analysis. Just off the top of my head:


  • Hyperlocal trend visualization

  • Political bias/opinion by locality

  • Hyperlocal product/company/celebrity opinion analysis

I’d also like to see ebooks with geotagged content. Wouldn’t it be nice to walk in an author’s or character’s footsteps?

In any case, I'm sure there are more applications for the combination of sentiment analysis and geotagging and visualization. Maybe you have some up your sleeve.


How good does Natural Language Processing need to get?

Posted on January 15, 2009 by elena

There is a lot of discussion in academic circles about diminishing returns when it comes to machine learning algorithms. Research often runs into a cap on accuracy, usually somewhere high, between 92-100%, at which point tweedy types stop scratching heads and start pounding chalkboards. (No knocks meant at the academic community. Professor brat here.)

At these percentages, the accuracy progress slows to a crawl and you’ve entered the domain of diminishing returns. You don’t have true artificial intelligence, but you have a very smart algorithm. You’re killing yourself for a point or two more.

I’ve spent the last week or so tagging thousands upon thousands of training documents, which we’ve used in turn to re-train our current algorithm. We are seeking our own version of the holy accuracy grail, the difference being, we are not after 100%. That’s just not our focus, and it doesn’t have to be.

At the moment we are focused on using the accuracy we have to do new, interesting, and useful things. The fact is, even without 100% accuracy, you can do just that. For example, sentiment analysis starts to get eerily good at 95%-98% accuracy.

At the Cogito blog, Luca Scagliarini has a semantic dream:

I have been looking forward to a successful semantic web application because the enterprise sector, where I think semantic technologies can really make the difference, needs a success in the consumer arena to move beyond its resistance to adopt these technologies.

Luca has it right when it comes to gaining traction with enterprise clients, and it’s the same with investors.

You have to build something that is adopted by consumers, which is why a lot of investment groups are watching semantic technologies with intense interest. You don’t need 100% accuracy with these technologies. In the end it’s about another kind of number…the number of people using your app.


sentiment analysis…where the action is

Posted on January 12, 2009 by elena

I recently read a post by Curt Monash at Text Technologies about the applicability of semantic technologies that got me thinking. His list emphasized areas where semantic search has potential: i.e. transactional, enterprise, and public-facing site search. Semantic search is a form of knowledge mining and extraction. The idea is that when you search for something on a complex site, you get more accurate results and even context for the subject you’re searching for. Semantic search is dependent on metadata such as tagged entities (people, places, items) and the relationships between them.

Initially, you need to analyze unstructured text (in ecommerce sites, company databases, and university websites for example) for entities, facts, and events. These entities are tagged, along with facts (Elena Haliczer: COO) and events (Recession: Layoffs). This whole process requires a significant corpus of unstructured text, which is why some initial stabs at semantic search engines are trained with Wikipedia text.

Think about it. Wikipedia is a huge connected textual database created by humans making human connections between entries. It’s perfect if you want to develop a search engine that “thinks” (or at least extracts connections) like a human. Cool right?

Curt emphasizes search in his list because so many sites need improved search and semantic technologies have a lot of scope there. However, he also says “the action is in sentiment analysis” and that’s where I actually get excited. When you’re successfully mining unstructured text for sentiment, you can apply this knowledge to simplify complex business practices and make better business decisions.

For example, we’ve trained JuLiA to recognize sentiment in user-generated content. Publishers can use this knowledge to simplify and even automate their moderation process according to their editorial policies.

There are a lot of other applications for sentiment analysis. My own applications list would include online reputation management, trend and buzz analysis, customer satisfaction, and national security threat monitoring and management. That’s where the heart of our business (and the real action) is at.


blog is live!

Posted on January 11, 2009 by elena

Hi everyone, and welcome to the blog. This is the place to be if you want to keep up on topics of interest in text mining, computational linguistics and natural language processing. Elena and I will be making regular updates about breaking news stories, general industry trends, and our own in-house research. Be sure to check in daily for updates, or you can sign up for our RSS feed at the top of this page.

Much thanks to Bill Harding at Bonanzle for his light-weight Rails blog plugin. If you're a Rails developer looking for an easy blog solution, Bloggity is it.


Recent Posts