In this section, wediscuss related work in three aspects: text segmentation, POS tagging, andsemantic labeling. Text Segmentation. We consider text segmentation as todivide a text into a sequence of terms. Statistical approaches, such as N-gramModel 21, 22, 23, calculate the frequencies of words co-occurring asneighbors in a training corpus. When the frequency exceeds a prede?nedthreshold, the corresponding neighboring words can be treated as a term.
Vocabulary-based approaches 18, 19, 20 extract terms by checking fortheir existence or frequency in a prede?ned vocabulary. The most obviousdrawback of existing methods for text segmentation is that they only considersurface features and ignore the requirement of semantic coherence within asegmentation. This might lead to incorrect segmentations as described inChallenge 1. To this end, we propose to exploit context semantics whenconducting text segmentation. POS tagging. POS tagging determines lexical types(i.e., POS tags) of words in a text.
Rule-based POS taggers attempt to assignPOS tags to unknown or ambiguous words based on a large number of hand-crafted10, 11 or automatically learned 12, 13 linguistic rules. StatisticalPOS taggers avoid the cost of constructing tagging rules by building astatistical model automatically from a corpora and labeling untagged textsbased on those learned statistical information. Mainstream statistical POStaggers employ the well-known Markov Model 14, 15, 16, 17 which learnsboth lexical probabilities and sequential probabilities from a labeled corporaand tags a new sentence by searching for tag sequence that maximizes thecombination of lexical and sequential probabilities. Note that both rule-basedand statistical POS taggers rely on the assumption that texts are correctlystructured which, however, is not always the case for short texts. Moreimportantly, existing methods only considers lexical features and ignores wordsemantics. This might lead to mistakes, as illustrated in Challenge 3. Our workattempts to build a tagger which considers both lexical features and underlyingsemantics for type detection.
Semantic labeling. Semantic labeling discovershidden semantics from a natural language text. Named entity recognition (NER)locates named entities in a text and classi?es them into prede?ned categories(e.g., persons, organizations, locations, etc.) using linguistic grammar-basedtechniques as well as statistical models like CRF 1 and HMM 2. Topic models3 attempt to recognize “latent topics”, which are represented asprobabilistic distributions on words, based on observable statistical relationsbetween texts and words.
Entity linking 5, 6, 7, 8 employs existingknowledgebases and focuses on retrieving “explicit topics” expressed asprobabilistic distributions on the entire knowledge base. Despite the highaccuracy achieved by existing work on semantic labeling, there are still somelimitations. First, categories, “latent topics”, and “explicit topics” aredifferent from human-understandable concepts. Second, short texts do not alwaysobserve the syntax of a written language which, however, is an indispensablefeature for mainstream NER tools. Third, short texts do not contain suf?cientcontent to support statistical models like topic models. The work most relatedto ours are conducted by Song et al. 19 and Kim et al. 20 respectively,which also represent semantics as concepts.
19 employs the Bayesian Inferencemechanism to conceptualize instances and short texts, and eliminates instanceambiguity based on homogeneous instances. Kim et al. 20 captures semanticrelatedness between instances using a probabilistic topic model (i.e.
, LDA),and disambiguates instances based on related instances. In this work, weobserve that other terms, such as verbs, adjectives, and attributes, can alsohelp with instance disambiguation. We incorporate type discernment in to ourframework for short text understanding of conduct instance disambiguate basedon various types of context information.