Login

Three's A Crowd: New entry to my device family.

This particular post is more personal and opinionated than the technical posts I usually share. But don't worry, it's still geeky.

Recently, as a reward to myself, I've gotten myself a Nexus 7. It was certainly worth the hard work I've done over the past year for Soshio. It is very tempting to write this post as a review for the Google tablet. But, the biggest change for me is the acquisition of what can be considered as a unessential gadget. Having a third device gave the rise of a strong question for a self proclaimed minimalist: How can I do as much as possible while carrying as little / lightly as possible?

Entertainment and media consumption is obviously the tablet's job. Mobile is still clearly the king for the need-it-now information, e.g. maps, nearby venue, and next appointments.

The laptop, however, is harder to delegate. Of course, tasks such as coding, and downloading media makes sense for the laptop. However, as I sit here in a cafe typing on my tablet, I wonder what else can I use my laptop for. The 7 inch size makes it perfect for thumb typing when held vertically. The light weight, a third of my laptop's weight, makes it tough for myself to justify carrying both.

Perhaps, instead of finding a powerful machine for my next laptop, a simple machine that has a solid text editor and can run Python smoothly would suffice? Here's looking at you, Chromebook.

Full-time Entrepreneur: Year One

Time is a funny thing. They say time flies while you’re having fun. For such a joyous and challenging journey as a full-time entrepreneur, it’s uncanny how quickly a year can go by. It is now more than a year since I left my last job to start Soshio. I gave up on the magical city of Edinburgh and joined the millions of dreamers in New York City. (Sound the cliche alarm!) I've found new friends, reconnected with old, grew apart from past colleagues, and recruited new partners. I hope to share some of my thoughts and learned lessons, mostly for my self-satisfaction.

Startup versus Fundup

To be honest, one of the most often heard and most annoying questions I've gotten revolves around funding. Is it necessary to ask for someone else’s money to show that a business is legit? I would think that quitting a good job, spending life-savings, and work without pay is a significant enough proof of legitimacy. One of my great disappointments in the ecosystem today certainly is that a startup is less about business, but more about funding. There are enough people discussing this subject that I won’t go into it anymore than I already did.

On Asking for Help

Now, please allow me to go on a slight tangent. Have you seen or heard Amanda Palmer’s TED talk? It’s fantastic. She’s been a long time advocate of crowdfunding. I hope more musicians will follow in her footsteps.

Okay, back to startup lessons. Just as the aforementioned talk suggested, learning how to ask for help is both invaluable and underappreciated. As part of the tech team, questioners are seldom celebrated. During the interviews, you’re judged on whether or not you know the answer, rarely on how you would find out if you don’t. (Answer: Google is your friend. Or, Baidu if you’re in China.) As an entrepreneur, it’s also easy to fall into the habit of acting tough, as if everything is okay. This can caused by the urge to disprove the skeptics or to avoid disappointing your supporters.

It’s crucial to know when to ask for help, especially when the pressure of livelihood and fulling expectations build up through the days. It’s also critical to know who to ask for help. Skeptics and opportunists often don’t have your best interests in mind. I've found their solutions are often self-serving or driven by pride. In comparison, a stranger may provide more objective and useful advice. As they’re less familiar with your situation, newly acquaintances are less likely to expect returns from their suggestions.

A Needle of Advice in a Haystack of Success Stories

There are no shortage of people sharing their life lessons and secrets to success in the startup community. I've learned to take it with a grain of salt; and I would suggest so to others as well. To have a secret to success, one must first be successful in a repeatable way. One big exit does not provide enough data points, even if it is a billion dollar deal. Context is everything for an advice; and most of them are hard to reproduce. An entrepreneur who broke out 10 years ago may not be able to reproduce the same result today. More importantly, each person’s definition of success is different. I've learned to assess my own definition of success against the advice giver’s before taking their suggestions. I am not them, and what worked for them may not work for me. This thought process has treated me well so far.

Then again, what do I know? I've not succeeded yet.

Vacation Guru: Where Do You Wanna Go?

I was revisiting my old side project archive recently to see if there are any half-finished projects. There are a couple; but today I want to share a specific one with you. VacationGuru recommends travel destinations based a location input. For example, if you’ve enjoyed your time in Paris, you can use it to look up other destinations that you might also fancy a visit. This side project was influenced by my full-time job at the moment, at Skyscanner. Ultimately, despite a working prototype and a pitch presentation to the business team, it was not picked up and integrated into the main product :(

Internally, VacationGuru utilizes search engine technologies to determine similarities between destinations. During initialization, it indexes travel reviews and guides on WikiTravel. (Theoretically speaking, any reputable text-based travel guides with a good coverage, e.g. LonelyPlanet, will do.) By encoding travel guides as a bag-of-words, we can treat each destinations as a document in a corpus. Then, utilizing text analytics techniques such as cosine similarity or language model, we can retrieve documents (describing a destination) that are closely similar to the input. Intuitively, this is what we are proposing: If the travel guides of two destinations use the same descriptions, we can infer that these two destinations are similar to each other.

There are fine prints, however. When indexing a travel guide, not all sections of guide can/should be weighted the same. For example, the “Economy” and “History” sections of the guide may not be as relevant to travellers as the “Culture” or “Climate” sections. The symptom of this issue was very clear when I first indexed Wikipedia entry of the destinations instead of WikiTravel -- the recommendations were mostly based on geographical and historical similarities. This lesson reinforced the importance of data cleansing and feature selection steps for any data mining tasks.

If you are interested in playing with the code, you can go to my github repository. Unfortunately, with nearly 1-year of not maintaining the environment, the demo doesn't seem to be working. The slides at the end of the post is the Skyscanner presentation, highlighting some lessons learned and product potentials.

What other applications of text analytics and/or search engine technolgy have you been hacking up lately?

Language Model: Today's winning lottery is ...

One of my favorite technique for document similarity is Language Model. Like Vector Space Model, it attempts compare entities without semantic understanding. Also like Vector Space Model, it treats each document as a bag-of-words, without grammatical or ordinal influences. However, instead of vectors and angles, Language Model utilizes probabilities to express the similarity between documents.

Imaging each unique word as a lottery ball with its own number. Thus, each document is a bin of lottery balls. If a word occurred twice in a document, then there are two balls with that number in the bin. In this analogy, your search query would be the lottery ticket. In order to select best bin to maximize your chance of winning -- or finding relevant information -- we must calculate the probability of drawing the numbers on your lottery ticket for each bin.

Thanks to the lottery analogy (similar to bag-of-words), calculating that probability is easy. Because the probability of each ball being drawn is independent of one another, your odds of winning from a bin is the product of the odds of drawing each number on your ticket; that is, p(i,j,k|D) = p(i|D)p(j|D)p(k|D) for a bin D and balls i, j, and k. Furthermore, to find the odds of drawing a specific number is simply the count of that ball divided by the total balls in that bin, or p(i|D) = count(i)/count(D) for bin D and ball i.

Because odds of winning is the product of the individual probability, problem occurs when a word is completely missing from that document. In a more concrete example, if you want to search for "film review for Happythankyoumoreplease" (which I enjoyed), you don't want a document to be filtered out just because it doesn't have the words "film" or "for". In order to avoid such dreaded fate, smoothing functions are applied. Just as the name suggests, these techniques smooth out the sudden drop of probability to zero when one query word is missing.

The concept behind smoothing functions (in language model) are simple -- avoid 0 numerator and denominators. To do so, you add an arbitrary value to both the numerator and the denominator. Depending on that value, the winning odds behaves slightly different and becomes sensitive to different combinations of word occurrence odds. A smoothing technique I use quite often is Dirichlet Smoothing. It uses all the documents (often referred to as corpus in linguistics-land) to derive that smoothing value. That is, it takes account of how the document in question is related to whole corpus. As much as I'd like to, I cannot claim expertise on smooth functions, a quick search in Wikipedia should be a good starting point for all those that are interested to find out more.

While it sounds complicated, the implimentation of this is rather straight forward. To find out the winning odds of a bin -- or the relevancy of a document -- you just count up how many of your lottery numbers appear in that bin (n), then in all bins (m), and finally the total number of balls in that bin (x) and all bins (y). The score would be a product of (n +m)/(x+y) for every lottery number you picked.

On Mantaphrase and Communicating Across Languages

A while ago, I came upon an app on the iPhone called Mantaphrase. It caught my eyes especially for the problem it tries to solve, everyday conversation with a foreign language speaker.

On Mantaphrase

The app allows users to select from a predetermined list of questions or phrases to translate. The user can then hand their phone to the “listener” to select a reply. The app also support suggested common responses or follow-up questions. Essentially, it’s similar to a collaborative reading of a choose-your-adventure book.

The concept is great and I like the user experience. However the scalability of this app is certainly a concern. How does it anticipate most of the things I want to say? Even if it assumes that “phrase occurrence” follows a power law, e.g. 80-20 rule, it’s still a magnificent amount of phrases to accommodate. But, I digress. I am certain the developers have ways to solve such challenge. For example, mining those “Everyday [language]” books in your local Barnes & Noble.

What I really want to explore is the notion of understanding another person with a different culture and language. While precompiled “choose-your-conversation” apps are useful for small talk and general inquiries, it’s not enough for a deep and meaningful talk (not that it’s Mantaphrase’s goal anyway). To do so the speaker and the listener need to connect on emotional and contextual levels.

Emotional Communication

As Robert Plutchik observed in his studies, there are generically embedded emotions and bodily expressions within every person, regardless of their cultural backgrounds. You can read more about these basic emotions on my previous post. This shared platform is what makes communication in person easier than in text. Emotional signals are easier to communicate through body language. However, how do you share the context of your thoughts with another across different languages?

Contextual Communication

When I observed or experienced meeting others who do not speak the same languages as myself, I tend to recognize the same few tactics: One, speak louder and slower. But (spoiler!) that does not work, the listener isn’t deaf. In fact, it might just agitate them more. Two, hand gestures. This works rather well depending on the concept being expressed, as well as the skill of the “speaker”. However, even sign language varies between cultures and countries.

Interestingly, the highest “return on investment” I’ve noticed is a game many of us have played before. Yes, I am talking about Pictionary. Through analogy with objects most people recognize, surprising amount of deeper concepts can be expressed. Perhaps the true app to solve the challenge of cross-lingual communication is one that can solve Pictionaries?

This, of course, is my personal observation. If you know any studies in this area, please share. I’d love to read more about the subject.

 

Linear Regression, MLP, Gaussian Process

I gave a talk at Skyscanner a while ago about the various forms of regressions in Machine Learning (ML) and how they might be applicable. While I have yet to get confirmation that I can post the video of the talk here, I'd like to share a selection of the techniques in a more abstract form.

Linear Regression

Linear regression is the basic form of regression that we are most familiar with through our high school Algebra classes. Remember when you were given an equation y = a x + b and asked to find a and b such for a given set of (x,y)? That is regression at it's core. Through a set of training points, a machine learning engine would attempt to identify the best constants which fits those points. By definition, a linear regression problem is any problem that can be simplified into the form f(x) = w^T phi(x), where phi is a linear function. Since the constants a, b, c, ..., etc. in any functions such as a x + b, a x^2 + b x + c, or (a x + b)^4 can be captured in a vector of [a, b, c, ..., etc.]^T, they are all linear regression tasks. This vector is weight vector w in the generalized equation.

To optimize on a regression solution, the error function is estimated and minimized. An error function is the average difference between all the predicted and training points against the different w vectors. The optimized weights can be identified as a point where the derivative, or slope, of the error function is 0. Linear regression is great in that their error function have only one global minimum. However, they also suffer limitation as the types of function they can describe are rather limited. The most commonly seen implementation of linear regression is in spreadsheet applications, when you ask the program to find and plot the best fit line for your data points.

Multilayer Perceptron (MLP), aka Neural Network

To rid of the limitation of linear regression solution, ML scientists applies non-linear functions such as the Sigmoid and inverse tangent functions onto linear functions. Thus a perceptron is created and took the form h(x) = g(f(x)) where g(z) is a non-linear transformation of a linear function f(x) = w^T x. However, researchers soon realize an import draw back to a single perceptron solution. On the direction that is perpendicular (or orthogonal in multidimensional space) to the weight vector w, the predictions do not change. This flaw would greatly increase the error rate as a whole direction of predictions is error-prone.

Ultimately, the solution is simple: use more perceptrons! While each perceptrons has a blind spot, that spot would be covered by other perceptrons. This divide and conquer technique can be seen many other parts of Computer Science, such as quicksort in the sorting problem. To aggregate the different predictions g(f(x)) from multiple perceptrons, more perceptrons are used. This introduces a new layer of perceptrons that treats the output from the previous layer as the inputs. Thus, a MLP system is born.

Because the multilayer nature of the solution, calculating the derivative of a MLP error function is a difficult task. Fortunately, scientists realized that the weight vector in the lower, aggregating layers has a larger impact than the upper layers. Therefore, backward propagation techniques can be used to estimate where the error function has a slope of 0 and greatly speedup the optimization of the MLP system.

MLP is not without its weakness, however. The error function of a MLP solution tend to have multiple valleys, making optimization a form of gambling. If the starting weight vector used is within a local, instead of the global, valley, the "optimum" weight vector from the system is very likely to miss the global best.

That being said, MLP is a widely adapted as a commercial product due to its flexibility and efficiency in finding the "second-best" solutions.

Gaussian Process

Guassian process is one of the latest regression techniques. Instead of creating a single function (using an optimized weight vector), it produce a distribution of all possible functions given the training points. To do this, it leverages on the definition of a Guassian Process, such that "any collection of random variables where an arbitrary subset of variable have a joint Gaussian distribution". This, however, does mean that the technique assumes all the features, including all dimensions of x and y, have Gaussian distributions. While it appears limiting, this assumptions empowers the system to training using the covariance matrix of the joint distribution.

This covariance matrix encompasses all the interrelationships between any feature to any other features. When predicting a value, the regression system generates a Gaussian distribution with a covariance matrix produced by a Kernel function. The Kernel function describes how a feature y of a function changes depending on how all other features x change. In a Gaussian process, the system trains on the hyperparameters, or parameters of the Kernel function, instead of a weight vector as traditional regressions. The Squared Exponential Covariance and its various forms are the most popular choices for the Kernel function.

When visualizing all the possible predictions for y at all different points of x, what we get in return is a tubular plot where the variance of the prediction widens as the x goes away from a training point and narrows as the x is closer to a training point. This holistic prediction allow engineers to account for how the training data relates to the prediction depending their distances are.

On the flip side of the coin, the technique suffers greatly on speed. The operation of calculating multiple covariance matrices to optimize the hyperparameters is complex and expensive. This complexity may account for why this technique has yet to be widely adopted in commercial applications. Non-the-less, I have high hopes and excitement for Gaussian Process for the comprehensiveness of its predictions.

And well, it's just freaking sexy.

Cosine Similarity: Are you heading the right direction?

Today, I want to introduce a very simple, yet powerful, technique for recommendation -- Cosine Similarity. However, before I discuss about it, an important analogy -- Vector Space Model -- has to be introduced.

Vector Space Model

Generally, recommendation system, especially text based ones, are based around examining the "closeness" between two contents. That is, answering the question: Are they talking, or representing, the same thing? That question in itself is very semantic based. Unfortunately, assigning meaning to a list of characters (which is basically what a word is) with minimal human training is a tough challenge, one still trying to be solved effectively and accurately by leading scientists today. Meanwhile, there are numerous powerful mathmatical, statistical techniques within the Data Mining community that cannot be used because it is semantics free.

Vector Space Model (VSM) is a powerful analogy to examine semantic based problems. Instead of looking at a text article through the meaning of each word, VSM assumes that the article as whole is a bag of words, where each word occurs independently of each other. Therefore, if we treat each word as their own independent meaning, how many times the word occurs means how much far the article leans toward the word's meaning. With that analogy, we can map an article onto a multi-dimensional space as a vector, with each coordinate based on how often each word occurs. Furthermore, a collection of articles, such as the web, would become numerous vectors. Suddenly, all the powerful and mature linear algebraic techniques from the Data Mining community can be used.

Now, some intuitions regarding expressing the similarity between two vectors might point to using the end points for comparison. For example, it sounds like a great idea to compare two articles based on the the Euclidean Distance between their end points. However, that approach's flaw is exposed when an article is a copy of the other, but simply with the whole text duplicated multiple times. That is, if one article contains "x, y, z" as text, then the other contains "x, y, z, x, y, z, x, ,y, z". In this case, the Euclidean appraoch fails because even though they are exactly the same semantically, they are 2 units apart.

Cosine Similarity

To remedy that flaw, another approach is introduced. Cosine Similarity (CosSim) accounts for both the ratio between the different words, A:B:C in the previous example, while discarding the total word count of each article. In other words, CosSim normalizes all article vectors to have a magnitude of 1 unit while maintaing the ratio between the words. This appraoch is also referred to as Simplex. The key to enforce that is through calculating the angles of the vectors. This way, we can avoid normalizing each article vectors explicitly while factor out the repeating sets of words.

How about a meta-anology? Imagine a white paper as a pole sticking out of the ground, it is free from influences because no words are indicating what it's really about. As we write more words to the paper, each word is symbolized a rope pulling the pole toward that direction. If we happen to use a word more than once, the pull of that word's rope grows proportionally stronger. As we finish writing the paper, the final tilte of the pole is a representation of what the aritcle is about. Now imagine two articles, thus two poles, stem from the same point on the ground. A way to express how close these two polls are is through the angle of made from these two tilted poles. That is cosine similarity.

Okay, I think I have wet your appitite enough. Now that I've explained the reasoning behind using VSM and CosSim, I can reveal the greatest magic behind cosine similarity. The equation for calculating CosSim between two article vectors is A * B / ||A|| ||B||, or the dot product of vector A and B divided by the product of the magnitude (length) of vector A and the magitude of vector B. That's it!

There are many tools in all forms of programming languages to perform these operations; if not, they are very easy to implement as well. The specific difficalty I've come across while implementing this methold is to transform sparse representation, which takes less resource, to one which encompass all dimensions (words) from both article vectors. Yet, even this challenge can be easily overcome using set operations such as uniting two sets of words.

Cosine similarity, however, is no way the end game for recommendation. It is not the best solution, and many state-of-the-art techniques outshine CS in terms of accuracy. However, it's true strength is in it's similicity and flexibility. More often than not, it will not give you the best accuracy you are seeking, but it will provide an acceptable solution with minimal effort.

One Shortcoming of Agile Development

In an effort to migrate some of the relevant posts from my G+ account, here is an old (but hopefully interesting) post from Nov 2011:

We've been "practicing" Agile Development since this summer. Many things went right, many things were not used, and many things weren't implemented correctly. However, recent events pointed out a potential shortcoming with Agile Development I've not noticed before.

It strongly reminds me of an article by Jeff Atwood regarding A/B Testing. Simply put, the whole incremental improvement paradigm shared between A/B Testing and Agile Development assume the solution can be reached via hill climbing.

The concept of hill climbing is that you can find the peak, or best solution, by constantly finding better solutions around the current one. However, this is not necessarily true. As described by Jeff Atwood via a great parallel to Groundhog Day, hill climbing can only guarantee to reach the local optima instead of the global. This is because it does not accept steps back.

Similarly to the problem with A/B Testing, when an user story has already been delivered, any rework without substantial business value is difficult to take priority. Therefore, while a re-architecture job may greatly position the product for more potential, it's difficult to justify its value to the business and stakeholders. Furthermore, it's even more difficult to write atomic user stories when the same cards have already been played.

It would be great if I am proven wrong, because I am really enjoying the Agile Practices so far. Granted it's easier to act Agile than be Agile, When applied correctly, it's value brings more long term benefit than most alternatives.

Huh? From Thinkudo to Soshio

In case you haven't noticed, Thinkudo has changed quite dramatically over the summer. These changes reflect how busy we have been lately. We have found ourselves a business partner as part of Edge Collective's incubation program. Furthermore, the work that Thinkudo Labs has been doing have been moved under a new brand Soshio.

Before you ask, Soshio is our play on the accented "social" to reflect its nature in bridging across multiple cultures. Soshio will be the new face of Chinese social media analysis, an effort started by Thinkudo Labs. To learn more about what we have done with Soshio, from the values offered to a fresh look at the brand, check out the "Soshio" button to the left. Alternatively, you can use this link.

So what's going to happen to Thinkudo Labs? It's not going to go away. However, it will remain with me, Ken Hu, as a lab to launch future projects, share thoughts on text technologies, and be a window to reach me for consulting and contracting within relevant domains.

In the meantime, hope you enjoy Soshio. And don't be shy on sharing with me your feedback!

Emotion Analysis: Applying Emotion Detection

UPDATE: It was pointed out to me that the links to the papers was not working. They are fixed now. Sorry about that!

In my previous post, we covered the quest to identify the building blocks of all human emotions. Although the quest has yet  end, its applications and implications are already budding within the academic community. In this post, I will cover a few interesting researches and studies I’ve come across.

Call Centers

In 2005, Vidrascu and Devillers published their research on identifying emotions from call center conversations. They took the Medical emergency calls in France and attempted to build a classifier from the data.

What I found most interesting from this research is its observations on mixed and conflicting emotions. Vidrascu and Devillers used the Big Six (widely adapted six emotions identified by Paul Ekman) as the basic emotions to classify, but also used more complex emotions including relief and anxiety as well. When they offered the conversations to professional annotators, they made an interesting observation. There are significant (between 10 to 20%) amount of conflictual conversation segments which the speak expressed positive and negative emotions simultaneously. Because of the researchers utilized both lexical and tonal signals, the amount of concurrent emotions detected are much larger than purely text-based systems. For example, the caller may express relief in his words and embarrassment in his tone. Vidrascu and Devillers also recommended a list of features to consider when dealing with auditory analysis.

Despite that their goal is to improve the analytics, it’s not hard to imagine a system for review and improve service qualities being born from the research. If you are interested, you can find their paper here.

Instant Messaging

Due to the rise of popularity in instant messaging in the last decade, academic interests to apply text analytics to the medium also grew. I came upon two interesting researches from opposite directions of a similar vision.

In 2008, Bhattacharyya and Bhattacharyya published a paper on emotion detection on internet chat. Beyond the popular WordNet dictionary, they also employed other resources to introduce more features to their data. This approach is especially critical for internet slangs and acronyms including urban dictionary and noslang. Their list is useful for considerations when dealing with other internet-based corpuses. You can find their paper here.

The research project by Ma, Prendinger, and Ishizuka came from the application angle. They proposed an instant messenger which the avatar would express the emotion of its user. Like all other mentioned works in this post, they used the Big Six as the basic emotions to extract. Ma et al. built a lexicon from the basic emotions and their first degree synonyms in WordNet. Their paper is here.

The application for instant messaging, and SMS as mentioned by Bhattacharyya are especially intriguing for me. As most of in-person communications are non-verbal, the abilities to capture, interpret, and convey these hidden expressions are critical to improve cross-cultural and cross-lingual communications.