The classification should be sufficiently precise, yielding reasonable results for us humans. Further, it should be able to abstract over the writing style, since the internet is for all people. Unsurprisingly, a well-trained artificial neural network will be the tool of our choice to achieve this. They are strong universal approximators naturally coping with non-linear shapes. Humans are mostly non-linear thinkers, so ANNs are good at learning all the subtle differences regarding the way we express ourselves through language. I wrote several articles with examples using ANNs, so if you are new to the topic, you may read this introduction. We will not stay with the theory and actually code this intelligence in Scala using the lightweight library NeuroFlow. Everything is open source, so feel free to check the code and data on GitHub.
The story is this: I bought a car out of state, and I'm trying to get the safety inspection in Pennsylvania. The problem is that the car has aftermarket tint on all windows except the windshield. The tint is rather weak, and you can clearly see the inside of the car through the tint. The inspection garage said that they won't pass it unless I get a waiver from the state police. So I went to the state police - the officer told me that aftermarket tint is illegal, and I can get a waiver only for a pre-84 car or for a medical reason. I asked him to show me the section of the vehicle code that says it's illegal. He showed it to me and the paraghaph said that you can't have tint, if you can't see the inside of the car because of the tint. When I told him that you can in fact see the inside very well, he shut the book and said "It's just illegal, and in fact we can have someone give you a ticket for it right now." Well, won't argue with that... | I do not have enough medical expertise to have much of an opinion one way or another on hidden candida infections. I can understand the skepticism of those who see this associated with various general kinds of symptoms, while there is a lack of solid demonstration that this happens and causes such general symptoms. (To understand this skepticism, one only needs to know of past failures that shared these characteristics with the notion of hidden candida infection. There have been quite a few, and the proponents of all thought that the skeptics were overly skeptical.) On the other hand, I am happy to read that some people are sufficiently interested in this possibility, spurred by suggestive clinical experience, to research it further. The doubters may be surprised. (It has happened before.) |
rec.autos | sci.med |
The key idea is to express a word as a vector of dimension _N_. Now, the question is, if words are vectors, how can we express their _meaning_, their _context_ in terms of numbers? If we simply count and index all words used in the data set, we get a numerical representation to compute a fingerprint of a given text corpus, but these numbers poorly interdepend, so we might not get natural results if we used these for training. A word alone may come with infinite meaning, so words only make sense if they stand in relation to other words. Let's express this in vector notation: The size of the context window is 4 here. The larger the size, the larger the scope of a word. There will be greater and lesser interdependences among some of these words, since word vectors can be multiply referenced in many different contexts, e. g. the use of the same words for different idioms and phrases. Further, if we have a vocabulary of _N_ words, we assign a unique hot vector of dimension _N_, determined by the distinct word count of the vocabulary, to each word, so the context window can be expressed as one summed up vector, as sketched in the example above. Then this context window is slid over all lines, yielding the training targets. So, what is the actual feature here? The unique identity of a word it is. Words with little interdependence only 'know their place' because they know what they are not. Look, we found symmetry, and in fact all words are mutually dependent within our vector space, their distance is a measure of similarity. To meet all expectations, we need to minimize the degree of contrariness of our mathematical expression, where _E_ is some error function measuring the distance between all sampled context windows. Since words are used in many different contexts, finding the optimal solution will inevitably lead to a compromise for large texts.
The outcome is that words used frequently in similar contexts will huddle together, whereas words that don't share common contexts are far away within our space. This plot shows the visual representation of all words within our data set for _N=3_. Note that in practice, typically a much higher (20, 200, ...) dimension is used for the word vectors. But, humans as we are, plotting the third dimension explains the principle better than a thousand words. Two more cool things. One is the astonishing fact that we can perform simple algebraic operations, e. g. the addition of two word vectors, to solve equations like "Audi is to Car as Aspirin is to ???". The other one is that Mikolov et al. provide a very efficient implementation in C, which we will gratefully use for feature extraction.
bin/word2vec -train all.txt -output all-vec.txt -debug 2 -size 20 -window 10 -sample 1e-4 -negative 5 -hs 0 -binary 0 -cbow 0
We set the word vector dimension to _N = 20_ and the window size to _cw = 10_, plus some other parameters. The output _all-vec.txt_ will give us a dictionary mapping from word to vector representation:
the -0.127186 0.277165 -0.027995 -0.057839 0.131760 -0.279101 -0.412328 -0.299498 0.064663 -0.325453 0.196605 ...
a -0.052591 0.278205 0.023183 -0.106372 0.146495 -0.280094 -0.375689 -0.279042 0.076930 -0.291453 0.167866 ...
to -0.071252 0.301636 0.012357 -0.089838 0.157550 -0.289630 -0.425848 -0.288347 0.048744 -0.313050 0.204709 ...
...
Now is a good moment to enter Scala-land, so let's import everything we need to build our net architecture:
import neuroflow.application.plugin.IO._
import neuroflow.application.plugin.Style._
import neuroflow.application.processor.Util._
import neuroflow.core.Activator._
import neuroflow.core._
import neuroflow.dsl._
import neuroflow.nets.cpu.DenseNetwork._
For the sake of completeness, let's briefly introduce the environment and all helpers we need:
val netFile = "/Users/felix/github/unversioned/ct.nf"
val maxSamples = 100
val dict = word2vec(getResourceFile("file/newsgroup/all-vec.txt"))
def readAll(dir: String, max: Int = maxSamples, offset: Int = 0) =
getResourceFiles(dir).drop(offset).take(max).map(scala.io.Source.fromFile)
.flatMap(bs => try { Some(strip(bs.mkString)) } catch { case _ => None })
def readSingle(file: String) = Seq(strip(scala.io.Source.fromFile(getResourceFile(file)).mkString))
def strip(s: String) = s.replaceAll("[^a-zA-Z ]+", "").toLowerCase
def normalize(xs: Seq[String]): Seq[Seq[String]] = xs.map(_.split(" ").distinct.toSeq)
def vectorize(xs: Seq[Seq[String]]) = xs.map(_.flatMap(dict.get)).map { v =>
val vs = v.reduce((l, r) => l.zip(r).map(l => l._1 + l._2))
val n = v.size.toDouble
vs.map(_ / n)
}
Here, _netFile_ is the place where we store our trained net. With _maxSamples_ we want to limit the training set to 100 messages per category. The _dict_ is our recently created word2vec model loaded into the JVM heap. Both functions _readAll_ and _readSingle_ take care of loading message files to _strip_ their context, because we are interested in plain words without illegal characters. The function _normalize_ will map a set of texts to a set of respective word sets. The function _vectorize_ tries to find the vector representations for words of a given word set. If we have a close look at the _vectorize_ function, we notice that there happens a little more than just finding the corresponding vectors. Our model assigns a vector to each word, so one single message file will contain _k_ vectors of dimension _N = 20_, where _k_ is the word count. This is a very interesting perspective concerning the whole Big Data discussion, since the _curse of dimensionality_ can turn even small data into big data quite quickly.
The mean word of a text, simplified in ℝ2
val cars = normalize(readAll("file/newsgroup/cars/"))
val med = normalize(readAll("file/newsgroup/med/"))
val trainCars = vectorize(cars).map((_, ->(1.0, 0.0)))
val trainMed = vectorize(med).map((_, ->(0.0, 1.0)))
val allTrain = trainCars ++ trainMed
println("No. of samples: " + allTrain.size)
val L =
Vector (20) ::
Dense (40, Tanh) ::
Dense (40, Tanh) ::
Dense (2, Sigmoid) :: SoftmaxLogEntropy()
val net = Network(
layout = L,
settings = Settings[Double](iterations = 15000, learningRate = { case _ => 1E-4 })
)
net.train(allTrain.map(_._1), allTrain.map(_._2))
File.write(net, netFile)
Our layout is of kind _[20, 40, 40, 2]_, because we need 20 neurons for a word vector _k_, two hidden layers with each 40 neurons to store the knowledge as well as 2 output neurons for classes _C = { cars, med }_. Because word2vec goes over full range [-1:1], we work with the _Tanh_ activator in the first layers, mapping to [0:1] with _Sigmoid_ and _Softmax_ in the last layers.
_ __ ________
/ | / /__ __ ___________ / ____/ /___ _ __
/ |/ / _ \/ / / / ___/ __ \/ /_ / / __ \ | /| / /
/ /| / __/ /_/ / / / /_/ / __/ / / /_/ / |/ |/ /
/_/ |_/\___/\__,_/_/ \____/_/ /_/\____/|__/|__/
1.5.7
Network : neuroflow.nets.cpu.DenseNetwork
Weights : 2.480 (≈ 0,0189209 MB)
Precision : Double
Loss : neuroflow.core.SoftmaxLogEntropy
Update : neuroflow.core.Vanilla
Layout : 20 Vector
40 Dense (φ)
40 Dense (φ)
2 Dense (σ)
O O
O O
O O O
O O O
O O O O
O O O O
O O O
O O O
O O
O O
[run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:567] Training with 199 samples, batch size = 199, batches = 1.
[run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:635] Breeding batches ...
Mär 09, 2018 11:24:02 AM com.github.fommil.jni.JniLoader liberalLoad
INFORMATION: successfully loaded /var/folders/t_/plj660gn6ps0546vj6xtx92m0000gn/T/jniloader5703846428977288045netlib-native_system-osx-x86_64.jnilib
[run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:868] Iteration 1.1, Avg. Loss = 78,3524, Vector: 121.89397416123774 34.81072677614606
[run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:881] Iteration 2.1, Avg. Loss = 78,2962, Vector: 121.70090635111697 34.891518522053396
[run-main-0] INFO neuroflow.nets.cpu.DenseNetworkDouble - [09.03.2018 11:24:02:888] Iteration 3.1, Avg. Loss = 78,2376, Vector: 121.49919197504813 34.97606651932853
...
To check the generality of our model we feed it with samples we did not use for training (see _offet_), and check their recognition and error rates:
val cars = normalize(readAll("file/newsgroup/cars/", offset = maxSamples, max = maxSamples))
val med = normalize(readAll("file/newsgroup/med/", offset = maxSamples, max = maxSamples))
val testCars = vectorize(cars)
val testMed = vectorize(med)
def eval(id: String, maxIndex: Int, xs: Seq[Seq[Double]]) = {
val (ok, fail) = xs.map(net.evaluate).map(k => k.indexOf(k.max) == maxIndex).partition(l => l)
println(s"Correctly classified $id: ${ok.size.toDouble / (ok.size.toDouble + fail.size.toDouble) * 100.0} % !")
}
eval("cars", 0, testCars)
eval("med", 1, testMed)
Crossing fingers ...
Correctly classified cars: 98.98 % !
Correctly classified med: 97.0 % !
To me, this is a good recognition rate, taking into consideration that we lost information by averaging over word vectors _(maybe this is even the reason for the good rate, god knows?)_. As a last point, let's see if our net would give a good news crawler correctly categorizing the news we saw at the very beginning of this article. We paste the respective text into _free.txt_ to evaluate it.
val free = normalize(readSingle("file/newsgroup/free.txt"))
val testFree = vectorize(free)
testFree.map(net.evaluate).foreach(k => println(s"Free classified as: ${if (k.indexOf(k.max) == 0) "cars" else "med"}"))
Now, let's check the pharma text:
"They say there’s no such thing as a free lunch, and for doctors fed by drug companies, the old adage might be true. Even cheap meals provided by pharmaceutical sales reps were associated with higher prescription rates for the brands being promoted, a new JAMA Internal Medicine study concluded.
Researchers delved into data from the U.S. Open Payments database, mandated by the Sunshine Act, and matched it with Medicare Part D prescribing information on more than 276,000 doctors. They focused specifically on meals, rather than speaking fees or research payments.
They also focused on four on-patent drugs from four different drug classes, three treating cardiovascular issues and one antidepressant. AstraZeneca’s Crestor represented statin meds; Forest Laboratories’ Bystolic represented beta blockers; Daiichi Sankyo’s Benicar, ACE inhibitors and angiotensin receptor blockers; and Pfizer’s Pristiq, SSRI and SNRI antidepressants."
=> Free classified as: med
Source: http://www.fiercepharma.com/pharma/hey-doc-about-about-a-burger-a-side-branded-drug-scripts
Yep, correctly classified! How about the Tesla story:
"For the budget-conscious Tesla fans who can't wait for a Model 3, a cheaper version of the Model S is now available. Tesla said it has released a new Model S 60 Thursday which starts at $66,000. An all-wheel drive version starts at $71,000.The '60' in the Model S name refers to the battery size, which determines the range of the car on a single charge.
The new Model S 60 is estimated to go 210 miles on a single charge, which is a bit less than the range of the previous cheapest Model S. A slightly larger battery will let the '60' go 249 miles, but it'll cost you at least $74,500.
Even the $66,000 starting price still puts it firmly in luxury car territory, even though it is about $23,500 cheaper than the next cheapest version of the Model S -- the longer-range, all wheel drive Model S 90D that starts at $89,500.
The Model X, the Tesla crossover which debuted last year, has an $83,000 starting price."
=> Free classified as: cars
Source: http://money.cnn.com/2016/06/10/autos/cheaper-tesla-model-s/index.html
Yep, the result is as expected for this one, too.