Computers read newspapers too.

Being up-to-date has always been the number one bread-and-butter activity for any media company. In the face of a true flood of data that is nowadays inundating the news networks and social media channels of the world, computer scientists have entered the field and are working to develop machine-reading software that can read up on the stories for which human researchers do not have time.

Pioneering research for the use of machine reading in media is being conducted at the London Media Technology Campus (LMTC), the main base of operations for the strategic partnership formed between the BBC and University College London (UCL) in 2014. This new research center has also given a new home to UCL’s machine reading group, led by German machine reading expert Dr. Sebastian Riedel. He states that models using artificial neural networks have made an impressive comeback by winning several machine translation contests in 2014 with almost no training data. The entire field of natural language processing (NLP) has been going through a revival ever since.

Machine Reading

Machine reading—also referred to as natural language understanding—is a branch of NLP that concerns itself with how textual information can be transformed into more abstract data representations that computers can manipulate more easily. In earlier times, the standard approach to machine reading was similar to a computational assembly line. Computer scientists would write down a set of predefined rules as to how a language functions according to its syntax, grammar, and semantics. These rules would then be used by the software to try to establish meaning. However, the problem with this approach is that languages are incredibly complex systems with a myriad of ambiguities, making it next to impossible to pre-define all of the rules.

Modern machine reading has therefore increasingly turned towards statistical methods to ‘learn’ the language in a more natural way. This is done by computationally examining a large volume of sample texts (called corpora) as ‘training data’ to establish statistical relationships between the elements in sentences (words, parts of speech, etc.). Statistical models that achieve this can generally be classified as either supervised or unsupervised learning models. Supervised models require so-called annotated corpora – essentially pre-chewed texts, where humans have already put in the solution answers that are expected from the model. While this approach is completely valid, its biggest flaw is that in order to improve the models and expand their capabilities, more annotated training data is necessary. And creating these datasets takes a lot of time.

Unsupervised learning models (like neural networks) are the big winners that have enabled the aforementioned advances in machine translation. While with the same amount of training data these models would inherently be less accurate than supervised methods, the great advantage of them is that they can also be trained on non-annotated data. This is what makes methods such as neural networks suitable to be directly transferred and applied to machine reading. The sheer mass of raw training data that is publicly available on the Internet can offset the inherent lower accuracy and allow for these models to be trained with relatively little annotated input.

Neural Networks in NLP

Generally, whenever an input is fed into a neural-net model, it is run through several ‘hidden layers’ of activation functions: see Figure 1. The simplest example of such a function is a ‘perceptron’, which returns 1 if the input is greater than some threshold and 0 otherwise (this is similar to the neurons in the human brain, which either fire or don’t – depending on the strength of the input). The outputs of the activation functions are passed on to the next layer until the end is reached, when a final output is computed. The output of the first round is nothing more than a random guess. Nonetheless, the model can calculate how far it was from the correct result by comparing the outputs of the model to the actual solutions. This then enables the model to adjust its activation functions to achieve a lower error rate for the next round of inputs. This technique is called back-propagation and is used to fine-tune the hidden layers over the course of many iterations. In this training phase, the model requires some human input to provide the solutions. However, after its training is complete, it can process data and generate outputs without the need for further human help.

Neural Net Image
Figure 1: The structure of a neural network model. Note that in reality, the number of hidden layers can vary.

 

In NLP, such neural networks are used to create ‘word embeddings’—see Figure 2—by reading hundreds of thousands of sentences. The goal of this activity is to situate any word in a vector space that stores how far away (mathematically) a word is from any other word. Moreover, it has been discovered that the exact distance between words can be related back to the nature of this difference. An example for this are gender differences: the mathematical distance from “queen” to “king” is approximately the same as the distance from “aunt” to “uncle”. This ‘encoding’ of relationships is a general property of word embeddings, which makes them a very natural representation of language data.

 

Word embedding graph
Figure 2: Visualisation of a word embedding in the jobs region. (Source)

Applications

Newspapers, magazines, and broadcasters alike are increasingly dependent on external news agencies to supply them with both breaking news and factual data to support their articles. In a constant race not to miss out on any big stories, broadcasting giants are forced to spend millions of pounds per year to get access to current information. In-house research and information gathering has become increasingly impractical and unreliable. Part of the LMTC research group is therefore attempting to use machine reading to automatically read though the news reports and social media feeds of the world, and to then create knowledge databases about entities such as countries, companies, and influential people, aggregating news stories and constantly updating the information.

“We want our software to not only understand language, but also to answer questions”, says Dr. Riedel. He states that the expression of such ambitions invokes frequent comparisons to present day personal assistants like Apple’s Siri, Google Now and Microsoft’s Cortana. However, the technology at work within these applications is a world apart from what runs inside modern machine-reading software. While the algorithms behind smart assistants, with their ability to recognise and extract meaning from anybody’s voice, represent a remarkable (and at times quite goofy) achievement of computer science, these systems lack the necessary knowledge to answer any “real” questions that cannot easily be looked up online.

Riedel believes that much more interesting applications of intelligent personal assistants begin to arise when coupled with machine reading algorithms. For the years to come, he plans to drive the lab’s research towards building an expert assistant system that can be commanded to familiarise itself with any topic the users might need assistance with, not only the aforementioned medial knowledge bases. After the AI has finished reading everything there is to read about the topic, it should then be able to answer even complicated questions. “Lawyers will be able to task their personal assistant to read the legislation of an entire country. The AI will then be able to answer specialised questions and find the excerpts the lawyer is looking for”.

To make this vision a reality, the research group is currently building a machine-reading AI that can pass elementary level science exams just by reading the relevant textbooks. This may sound like child’s play in comparison to reading a 2000-page law code, but one must not forget that, for a computer, a complex subject matter is potentially more computationally intensive, but not inherently difficult. Bluntly put, software doesn’t care whether it is reading a complicated manual for an MRI scanner or a collection of children stories. And it will read everything.

Books