Collaborative messaging applications such as Slack (or it's open source counterpart Mattermost) are popular communication tools for organizations, and the logs of the converstions on these forums often contain very valuable question and answer sets. Unfortunately, these Q&A; sets can be difficult to identify and search since they are usually embedded in non-topical conversations. To help solve this problem I developed a search tool '/qlog' that operates as a mattermost slash bot and performs a natural language search for question and answer sets within the messaging conversation logs. I gave a youtube presentation on the project below.

Introduction

I recently completed a fellowship program called 'Insight Data Science'. The program is a seven week intensive postdoctoral training sprint focused on shaping the computational tools honed during a PhD into a data science skill set, ready for the fast-paised practical problems of industry. Insight has had remarkable success in facilitating the transition from academia to industry and many of its previous fellows are now leaders in the Data Science space .

A side effect of this success is a very active private slack network that insight hosts for all of it's alumni. Here alumni exchange messages on topics as wide ranging as "where to meet-up during the scipy conference" or "where is the best sushi in Austin?". In addition to these fun and logistical discussions, this network is often a place where many alumni go to ask technical questions of their peers. Because of the unique position of the alumni network at the leading edge of the field, these question and answer sessions often hold valuable information.

For my project as a insight fellow, I consulted with insight on an internal project to build a natural language tool to search and extract knowledge from this resource. A major challenge of this project was to deal with the fact that slack conversations are uncurated, and often consist of a sequence of short back-and-forth messages that are only loosely grouped by subject matter. Because the slack search tools do not take the message-to-message context of a question and answer discussion into account, it will often fail to return the key question and answer sets within the archives.

To solve this problem and make the discussion history available for search and retrieval, I had three main objectives:

  • Develop a pipeline to extract data from the slack logs and transform it into a format that can searched using machine learning tools.
  • Implement an algorithm that extracts valuable content from the the dataset.
  • Deploy a tool for searching that returns the question and answer sets from the archives. Since insight is planning on transitioning the alumni network from slack to mattermost, I built my tool as a mattermost slash integration, but the basic approach should work as a slack integration as well.

Extracting and transforming the data

The first step in creating a searchable resource, is to transform the strings of text data into a format suitable for machine learning algorithms that can make comparisons between the search query and the messages stored in the database.

A simple approach is to represent each message using an array of values where each value represents the number of times a given word appears in a the message. In order to work, this array needs to have an entry for all the possible words we might encounter in a message (the vocabulary). Or more practically, the array should contain a spot for the words that are important to my search and classification task. These counts are known as the term frequencies (TF). Typically, before performing these counts we want to pre-process our text data so that words that clearly contain very similar meaning are collected into the same counts. For instance, words that only differ in terms of capitalization, or by the way they are conjugated ('having' vs 'have') should be considered the same. This sorts of preprocessing steps are nicely implemented in python in the ntlk library which I took advantage of in building my search tool.

Even after performing this processing, the TF counts are often heavily biased towards words that occur frequently within a specific corpus of messages, but don't really help distinguish between messages of that corpus. For instance in my case the courpus is the insight fellows alumni network and the word 'insight' is likely to show up in many messages regardless if the message is part of a technical or non-technical discussion. In contrast, a term such as 'python' or 'spark' would provide much more identifying information. To incorporate this fact it is common to multiply the TF counts by what is known as the inverse document frequency of the term (IDF), in my case this is the reciprocal of the number of messages where a word appears. Together the process of counting and weighting terms is known as a TF-IDF embedding, which I implemented using the scikit learn TfidfVectorizer object.

Training question and answer detector.

The tf-idf representation of text is at the heart of many pre-built search algorithms. However, for this task I wanted a way of returning packets of messages that contain question and answer sets -- more complete discussions that are relevant to the search query than single message results. For instance, high quality results would consist of a sequence of messages where the first message asks a question, and the following message contains an answer to the question. In comparison, unanswered questions, or unrelated orphan statements should be ignored.

To bias the search results in this way, I built a question and answer detector using logistic regression. Since the slack data is unlabeled, I drew on an archive of stack exchange data released under the creative commons licence. This contains over 18K data science messages that have been pre-labeled as either questions or answers. I embed these stack exchange messages using the slack vocabulary, and then train a logistic regression model, being careful to control for overfitting by using 10-fold cross validation in my training set. This is particularly important here because I will eventually use the model in a different context (the slack message set) than it was trained on (the stack exchange data).

Evaluating the performance of the question detector is a bit tricky since what I really care about is how it performs on the slack data -- which is labelled -- but I can look at how the detector performs on a held out set of the stack exchange messages. To do this, I can examine the overall accuracy which comes in nicely at 91%.

Applying the detector to identify question and answer sets.

Now that I have a method for contrasting questions and answers, I want to apply this method to the sequence of messages posted on slack, but at the same time take into account the order in which these messages arrive. Before deciding if a message is a question or an answer, the logistic regression classification model predicts the probability of a message falling into one of these two classes. Since I trained the model on two classes, under this model, the probability that the message is an answer is 1 - probability that the message is a question. With this in mind the motif that would indicate that a given message (message n) is the posed question of a question and answer set are those where the model predicts a high probability of a question for message n, and and a low probability of a question for message n+1. Thus, I score each message n for membership in Q&A; groups by multiplying these probabilities together.

Natural language Q&A; search.

Having scored each message using a model that gives me some prior estimate of whether that message kicks off a valuable Q&A; discussion, I now want to implement a tool to search and extract these discussions. For this, again I use the TF-IDF approach to encode my question string as a numerical vector, and then compute a cosine similarity between this query vector and all the messages in the slack archives. To find the best search result I multiply each of these similarity values by the pre-computed Q&A; score I discussed above and then return the messages with the highest values to the user.


Comments

comments powered by Disqus