Our technology is a software system for extracting named entities (people, companies, locations) mentioned in news articles, together with readers’ comments. The system implements a fast and effective named entity recognition and linking techniques. It also offers an interactive visualization of the results and the linking process, with associations of the named entities.
The software system is developed in Python language. It uses Standford NER to extract entity mentions in text, then classifies each of them into person, location, organization, or miscellany category. It implements Pair-Linking algorithm for fast disambiguating each of the extracted mentions into its corresponding profile in Wikipedia. The front-end web-service is based on Flask.
The system’s input and output are described as follows:
Input: The system takes the raw text of news articles and their associated comments as input. Each article and its comments are formatted in JSON, and an example can be found in the folder “core/sample_documents”.
Output: The system outputs entity mentions extracted from each article and comment. Each mention comes with a label (person, location, organization, or miscellany category) and a link to Wikipedia (if the mentions are likely to be associated with the entity profile).
Entity extraction is usually the first step to analyze text documents. It enables wide range of applications and use cases such as resolving a person’s identity for government security and fraud detection, tracking customer sentiment around products and companies, providing targeted search for content publishers and recommendation engine.
People are interacting with digital news on daily basis and it creates a huge source of information for mining. However, existing systems have their own limitations, and they are not specialized for news articles and user comments as software system can do.
Different from existing systems on the market, our software system especially focuses on extracting entities in news articles and user comments. It collectively links concepts across articles and comments for better accuracy. The visualization in our software system is unique and it provides customers with an interactive view of the extraction results. The whole system is modularized into smaller sub-components that can be modified and maintained easily.