Text Classification

Felix Bogdanski, since 22.6.2016

I want to present a neural learning approach, whose underlying technique can be used for many different scenarios where raw, unstructured text data is the primary source. Together, we will build a model, that is able to infer the category of new, unknown content. A significant part of the process will be the feature extraction, because computers understand numbers, not words. One example for a use case is a news crawler, which constantly searches the web for new content. When new content is found, it tries to infer the category, e. g. technology, sports, business, et cetera, to add it to the right search index.

The classification should be sufficiently precise, yielding reasonable results for us humans. Further, it should be able to abstract over the writing style, since the internet is for all people. Unsurprisingly, a well-trained artificial neural network will be the tool of our choice to achieve this. They are strong universal approximators naturally coping with non-linear shapes. Humans are mostly non-linear thinkers, so ANNs are good at learning all the subtle differences regarding the way we express ourselves through language. I wrote several articles with examples using ANNs, so if you are new to the topic, you may read this introduction. We will not stay with the theory and actually code this intelligence in Scala using the lightweight library NeuroFlow. Everything is open source, so feel free to check the code and data on GitHub.

Words are Numbers

Neural nets and Big Data are really close friends, as the mathematical nature (no inherent state, except LSTM nets) of neural nets begs for huge data sets in order to generate natural results, that are not contrary to our expectations. Well, unless our expectations are 'wrong', of course. We need to throw a lot of data into our network, and luckily the internet is an inexhaustible source for human-written text. Let's use the common 20 Newsgroup data set1, hoping that within this set there are many different writing styles our net can learn from. Not only the amount of data is promising for this set, also the inherent categorical ordering of newsgroup messages makes this data set a convenient sparring partner, as we don't need to label data manually. On the other hand, this means we need to _trust_ that people don't mix categories too often when posting to newsgroups.
The story is this: I bought a car out of state, and I'm trying to get the safety inspection in Pennsylvania. The problem is that the car has aftermarket tint on all windows except the windshield. The tint is rather weak, and you can clearly see the inside of the car through the tint. The inspection garage said that they won't pass it unless I get a waiver from the state police. So I went to the state police - the officer told me that aftermarket tint is illegal, and I can get a waiver only for a pre-84 car or for a medical reason. I asked him to show me the section of the vehicle code that says it's illegal. He showed it to me and the paraghaph said that you can't have tint, if you can't see the inside of the car because of the tint. When I told him that you can in fact see the inside very well, he shut the book and said "It's just illegal, and in fact we can have someone give you a ticket for it right now." Well, won't argue with that... I do not have enough medical expertise to have much of an opinion one way or another on hidden candida infections. I can understand the skepticism of those who see this associated with various general kinds of symptoms, while there is a lack of solid demonstration that this happens and causes such general symptoms. (To understand this skepticism, one only needs to know of past failures that shared these characteristics with the notion of hidden candida infection. There have been quite a few, and the proponents of all thought that the skeptics were overly skeptical.) On the other hand, I am happy to read that some people are sufficiently interested in this possibility, spurred by suggestive clinical experience, to research it further. The doubters may be surprised. (It has happened before.)
rec.autos sci.med
Let's examine two samples from the set. On the left, there is a post from _rec.autos_, and on the right, there is one from _sci.med_. If we didn't know the respective category of both posts, still we would be able to roughly categorize, e. g. the left post is about car specific admission, thus Cl = cars, whereas the right post is about a nasty (better not google it!) infection, thus Cr = med. Both texts are clearly separable, although they share many words, like 'I', 'that', 'it'. These words are _noise_, because all texts would make use of them, so they scarcely contribute to the actual context. The difficulties in language processing are all the irregularities with respect to words that form the context, for instance using the word 'medical' in totally different contexts. We need to ensure that our net can grasp these subtle differences. The NLP community has different models for this kind of problem. The one that works good for our scenario is the well known _word2vec Skip-gram_ model, which is a shallow neural net that computes similarities between words, depending on their context, using a clever hot-vector encoding scheme and a linear bottleneck to produce results. It was ultimately refined by Mikolov et al.2 If you are interested in the details, I suggest to read their paper.

The key idea is to express a word as a vector of dimension _N_. Now, the question is, if words are vectors, how can we express their _meaning_, their _context_ in terms of numbers? If we simply count and index all words used in the data set, we get a numerical representation to compute a fingerprint of a given text corpus, but these numbers poorly interdepend, so we might not get natural results if we used these for training. A word alone may come with infinite meaning, so words only make sense if they stand in relation to other words. Let's express this in vector notation:

The size of the context window is 4 here. The larger the size, the larger the scope of a word. There will be greater and lesser interdependences among some of these words, since word vectors can be multiply referenced in many different contexts, e. g. the use of the same words for different idioms and phrases.

Further, if we have a vocabulary of _N_ words, we assign a unique hot vector of dimension _N_, determined by the distinct word count of the vocabulary, to each word, so the context window can be expressed as one summed up vector, as sketched in the example above. Then this context window is slid over all lines, yielding the training targets. So, what is the actual feature here? The unique identity of a word it is. Words with little interdependence only 'know their place' because they know what they are not. Look, we found symmetry, and in fact all words are mutually dependent within our vector space, their distance is a measure of similarity.

To meet all expectations, we need to minimize the degree of contrariness of our mathematical expression, where _E_ is some error function measuring the distance between all sampled context windows. Since words are used in many different contexts, finding the optimal solution will inevitably lead to a compromise for large texts.

The outcome is that words used frequently in similar contexts will huddle together, whereas words that don't share common contexts are far away within our space. This plot shows the visual representation of all words within our data set for _N=3_. Note that in practice, typically a much higher (20, 200, ...) dimension is used for the word vectors. But, humans as we are, plotting the third dimension explains the principle better than a thousand words. Two more cool things. One is the astonishing fact that we can perform simple algebraic operations, e. g. the addition of two word vectors, to solve equations like "Audi is to Car as Aspirin is to ???". The other one is that Mikolov et al. provide a very efficient implementation in C, which we will gratefully use for feature extraction.

Training the Net

Note that we focus on messages from categories cars and med, since we want to keep the overall training time short for this article. To build up our vocabulary, at first we need to compile the C implementation locally and then we have to merge all single text files of our data set into one huge text file, because _word2vec_ works with a single file. Let's build our feature model:

  bin/word2vec -train all.txt -output all-vec.txt -debug 2 -size 20 -window 10 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0
  
We set the word vector dimension to _N = 20_ and the window size to _cw = 10_, plus some other parameters. The output _all-vec.txt_ will give us a dictionary mapping from word to vector representation:

  the -0.127186 0.277165 -0.027995 -0.057839 0.131760 -0.279101 -0.412328 -0.299498 0.064663 -0.325453 0.196605 ...
  a -0.052591 0.278205 0.023183 -0.106372 0.146495 -0.280094 -0.375689 -0.279042 0.076930 -0.291453 0.167866 ...
  to -0.071252 0.301636 0.012357 -0.089838 0.157550 -0.289630 -0.425848 -0.288347 0.048744 -0.313050 0.204709 ...
  ...
  
Now is a good moment to enter Scala-land, so let's import everything we need to build our net architecture:

  import neuroflow.application.plugin.IO._
  import neuroflow.application.plugin.Style._
  import neuroflow.application.processor.Util._
  import neuroflow.core.Activator._
  import neuroflow.core._
  import neuroflow.dsl._
  import neuroflow.nets.cpu.DenseNetwork._
  
For the sake of completeness, let's briefly introduce the environment and all helpers we need:

  val netFile = "/Users/felix/github/unversioned/ct.nf"
  val maxSamples = 100
  val dict = word2vec(getResourceFile("file/newsgroup/all-vec.txt"))

  def readAll(dir: String, max: Int = maxSamples, offset: Int = 0) =
    getResourceFiles(dir).drop(offset).take(max).map(scala.io.Source.fromFile)
      .flatMap(bs => try { Some(strip(bs.mkString)) } catch { case _ => None })

  def readSingle(file: String) = Seq(strip(scala.io.Source.fromFile(getResourceFile(file)).mkString))

  def strip(s: String) = s.replaceAll("[^a-zA-Z ]+", "").toLowerCase

  def normalize(xs: Seq[String]): Seq[Seq[String]] = xs.map(_.split(" ").distinct.toSeq)

  def vectorize(xs: Seq[Seq[String]]) = xs.map(_.flatMap(dict.get)).map { v =>
    val vs = v.reduce((l, r) => l.zip(r).map(l => l._1 + l._2))
    val n = v.size.toDouble
    vs.map(_ / n)
  }
  
Here, _netFile_ is the place where we store our trained net. With _maxSamples_ we want to limit the training set to 100 messages per category. The _dict_ is our recently created word2vec model loaded into the JVM heap. Both functions _readAll_ and _readSingle_ take care of loading message files to _strip_ their context, because we are interested in plain words without illegal characters. The function _normalize_ will map a set of texts to a set of respective word sets. The function _vectorize_ tries to find the vector representations for words of a given word set. If we have a close look at the _vectorize_ function, we notice that there happens a little more than just finding the corresponding vectors. Our model assigns a vector to each word, so one single message file will contain _k_ vectors of dimension _N = 20_, where _k_ is the word count. This is a very interesting perspective concerning the whole Big Data discussion, since the _curse of dimensionality_ can turn even small data into big data quite quickly.


The mean word of a text, simplified in ℝ2

What can we do to reduce dimensionality to have a short training phase? Well, we can't reduce dimensionality, because we just defined it to be 20, but we can significantly reduce the amount of training samples. We can observe that for any given text corpus _t_ from, let's say class _C = cars_ with _k_ word vectors, we can either train the net for all individual words ki, which will take _a long_ time, or we simply compute the average (or mean) of all _k_ word vectors of _t_ in advance and take this average for training instead. This is a very elegant trick, since we would have to average the prediction result of a given text in any case, because our net is trained on words, not paragraphs (of course, we could average over paragraphs). This way we perform the averaging of word vectors _before_ training our net, instead of training our net for all word vectors and subsequently averaging the results. It turns out that this works really well, because we have a substantial amount of data. Isn't it a very exciting perspective which seems to arise naturally when we treat words as numbers? I mean, we can calculate the average word (that must not necessarily exist) of a whole text!

  val cars = normalize(readAll("file/newsgroup/cars/"))
  val med = normalize(readAll("file/newsgroup/med/"))

  val trainCars = vectorize(cars).map((_, ->(1.0, 0.0)))
  val trainMed = vectorize(med).map((_, ->(0.0, 1.0)))

  val allTrain = trainCars ++ trainMed

  println("No. of samples: " + allTrain.size)

  val L =
            Vector  (20)           ::
            Dense   (40, Tanh)     ::
            Dense   (40, Tanh)     ::
            Dense   (2, Sigmoid)   ::   SoftmaxLogEntropy()

  val net = Network(
    layout = L,
    settings = Settings[Double](iterations = 15000, learningRate = { case _ => 1E-4 })
  )

  net.train(allTrain.map(_._1), allTrain.map(_._2))

  File.write(net, netFile)
  
Our layout is of kind _[20, 40, 40, 2]_, because we need 20 neurons for a word vector _k_, two hidden layers with each 40 neurons to store the knowledge as well as 2 output neurons for classes _C = { cars, med }_. Because word2vec goes over full range [-1:1], we work with the _Tanh_ activator in the first layers, mapping to [0:1] with _Sigmoid_ and _Softmax_ in the last layers.



               _   __                      ________
              / | / /__  __  ___________  / ____/ /___ _      __
             /  |/ / _ \/ / / / ___/ __ \/ /_  / / __ \ | /| / /
            / /|  /  __/ /_/ / /  / /_/ / __/ / / /_/ / |/ |/ /
           /_/ |_/\___/\__,_/_/   \____/_/   /_/\____/|__/|__/   
                                                              1.5.7


              Network : neuroflow.nets.cpu.DenseNetwork

              Weights : 2.480 (≈ 0,0189209 MB)
            Precision : Double

                 Loss : neuroflow.core.SoftmaxLogEntropy
               Update : neuroflow.core.Vanilla

               Layout : 20 Vector
                        40 Dense (φ)
                        40 Dense (φ)
                        2 Dense (σ)
                      




      
                 O     O      
                 O     O      
           O     O     O      
           O     O     O      
           O     O     O     O
           O     O     O     O
           O     O     O      
           O     O     O      
                 O     O      
                 O     O      



  [run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:567] Training with 199 samples, batch size = 199, batches = 1.
  [run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:635] Breeding batches ...
  Mär 09, 2018 11:24:02 AM com.github.fommil.jni.JniLoader liberalLoad
  INFORMATION: successfully loaded /var/folders/t_/plj660gn6ps0546vj6xtx92m0000gn/T/jniloader5703846428977288045netlib-native_system-osx-x86_64.jnilib
  [run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:868] Iteration 1.1, Avg. Loss = 78,3524, Vector: 121.89397416123774  34.81072677614606  
  [run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:881] Iteration 2.1, Avg. Loss = 78,2962, Vector: 121.70090635111697  34.891518522053396  
  [run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:888] Iteration 3.1, Avg. Loss = 78,2376, Vector: 121.49919197504813  34.97606651932853  
  ...
  
To check the generality of our model we feed it with samples we did not use for training (see _offet_), and check their recognition and error rates:

  val cars = normalize(readAll("file/newsgroup/cars/", offset = maxSamples, max = maxSamples))
  val med = normalize(readAll("file/newsgroup/med/", offset = maxSamples, max = maxSamples))

  val testCars = vectorize(cars)
  val testMed = vectorize(med)

  def eval(id: String, maxIndex: Int, xs: Seq[Seq[Double]]) = {
    val (ok, fail) = xs.map(net.evaluate).map(k => k.indexOf(k.max) == maxIndex).partition(l => l)
    println(s"Correctly classified $id: ${ok.size.toDouble / (ok.size.toDouble + fail.size.toDouble) * 100.0} % !")
  }

  eval("cars", 0, testCars)
  eval("med", 1, testMed)
  
Crossing fingers ...

  Correctly classified cars: 98.98 % !
  Correctly classified med: 97.0 % !
  
To me, this is a good recognition rate, taking into consideration that we lost information by averaging over word vectors _(maybe this is even the reason for the good rate, god knows?)_. As a last point, let's see if our net would give a good news crawler correctly categorizing the news we saw at the very beginning of this article. We paste the respective text into _free.txt_ to evaluate it.

  val free = normalize(readSingle("file/newsgroup/free.txt"))
  val testFree = vectorize(free)

  testFree.map(net.evaluate).foreach(k => println(s"Free classified as: ${if (k.indexOf(k.max) == 0) "cars" else "med"}"))
  
Now, let's check the pharma text:

  "They say there’s no such thing as a free lunch, and for doctors fed by drug companies, the old adage might be true. Even cheap meals provided by pharmaceutical sales reps were associated with higher prescription rates for the brands being promoted, a new JAMA Internal Medicine study concluded.

  Researchers delved into data from the U.S. Open Payments database, mandated by the Sunshine Act, and matched it with Medicare Part D prescribing information on more than 276,000 doctors. They focused specifically on meals, rather than speaking fees or research payments.

  They also focused on four on-patent drugs from four different drug classes, three treating cardiovascular issues and one antidepressant. AstraZeneca’s Crestor represented statin meds; Forest Laboratories’ Bystolic represented beta blockers; Daiichi Sankyo’s Benicar, ACE inhibitors and angiotensin receptor blockers; and Pfizer’s Pristiq, SSRI and SNRI antidepressants."

  => Free classified as: med
  Source: http://www.fiercepharma.com/pharma/hey-doc-about-about-a-burger-a-side-branded-drug-scripts
  
Yep, correctly classified! How about the Tesla story:

  "For the budget-conscious Tesla fans who can't wait for a Model 3, a cheaper version of the Model S is now available. Tesla said it has released a new Model S 60 Thursday which starts at $66,000. An all-wheel drive version starts at $71,000.The '60' in the Model S name refers to the battery size, which determines the range of the car on a single charge.

  The new Model S 60 is estimated to go 210 miles on a single charge, which is a bit less than the range of the previous cheapest Model S. A slightly larger battery will let the '60' go 249 miles, but it'll cost you at least $74,500.

  Even the $66,000 starting price still puts it firmly in luxury car territory, even though it is about $23,500 cheaper than the next cheapest version of the Model S -- the longer-range, all wheel drive Model S 90D that starts at $89,500.

  The Model X, the Tesla crossover which debuted last year, has an $83,000 starting price."

  => Free classified as: cars
  Source: http://money.cnn.com/2016/06/10/autos/cheaper-tesla-model-s/index.html
  
Yep, the result is as expected for this one, too.

Final Thoughts

If we want to use Big Data to better understand how humans think and move, we need to find the right expression of data in terms of numbers. If we numerically graph the world, patterns emerge, and we can use deep learning techniques to teach these patterns to a computer, so it can take decisions, just as we humans do, every single day. Thanks for reading.
1: The 20 Newsgroups data set, Downloaded in June 2016 from http://qwone.com/%7Ejason/20Newsgroups 2: Tomas Mikolov, Greg Corrado, Kai Chen, Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Downloaded in June 2016 from http://arxiv.org/pdf/1301.3781.pdf.