Achievements

Our datasets, softwares, and talks here.


Datasets

FewJoint

Link: https://github.com/AtmaHou/MetaDialog

Tags: Task-oriented Dialogue, NLU

Intro: FewJoint is a novel FSL benchmark for joint multi-task learning, to promote FSL research of the NLP area. To reflect the real word NLP complexities beyond simple N-classification, FewJoint adopts a sophisticated and important NLP problem for the benchmark: Task-oriented Dialogue Language Understanding. Task-oriented Dialogue is a rising research area that develops dialogue systems to help users to achieve goals, such as booking tickets. Language Under-standing is a fundamental module of Task-oriented Dialogue that extracts semantic frames from user utterances. It contains two subtasks: Intent Detection and Slot Tagging. With the Slot Tagging task, FewJoint benchmark covers one of the most common structure prediction problems: sequence labeling. Besides, thanks to the natural dependency between Intent Detection and Slot Tagging, FewJoint can embody the multi-task challenge of NLP problems. To conquer randomness and make an adequate evaluation, FewJoint includes 59 different dialogue domains from real industrial API, which is a considerable domain amount compared to all existing few-shot and dialogue data.

Few-NERD

Link: https://github.com/thunlp/Few-NERD

Tags: Named Entity Recognition

Intro: Few-NERD is a manually annotated few-shot named entity recognition (NER) dataset, consisting of 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities and 4,601,223 tokens. There benchmarks are included to assess different levels of generalization fo NER models. Few-NERD (SUP) is a fully supervised NER benchmark that is verified to be more difficult that conventional datasets. Few-NERD (INTER) and Few-NERD (INTRA) are episodic benchmarks that adopts N way K~2K shot sampling in this sequence labeling task.

FewRel

Link: https://github.com/thunlp/FewRel

Tags: Relation Extraction

Intro: FewRel is a large-scale few-shot relation extraction dataset, which contains more than one hundred relations and tens of thousands of annotated instances cross different domains. The benchmark is established by N way K shot sentence-level classification. The second edition of the dataset, FewRel 2.0, adds domain adaptation (DA) and none-of-the-above (NOTA) detection challenges to evaluate cross-domain generalization more comprehensively.

Softwares

Meta Dialog Platform (MDP)

Link: https://github.com/AtmaHou/MetaDialog

Intro: Meta Dialog Platform: a toolkit platform for NLP Few-Shot Learning tasks of Text Classification & Sequence Labeling. It provides state-of-the-art solutions for Few-shot NLP: supporting Few-shot Learning for sequence-labeling task with state-of-the-art methods: CDT, semantic within label name or label description, various deep pre-trained embedding compatible with huggingface/transformers, such as BERT and Electra and pair-wise embedding mechanism.

Talks

车万翔:任务型对话系统中的小样本自然语言理解

Link: https://hub.halobug.cn/view/3855

Intro: 小样本学习(Few-shot Learning)希望计算机能像人一样只用几个样本学习新的任务,近年来已成为机器学习社区的热点研究问题,并被看作是让机器智能接近人类智能的关键方向。因为任务型对话系统经常需要频繁适应新领域、新需求,而新的领域往往数据不足,所以为小样本学习技术提供了一个绝佳的应用场景。自然语言理解作为任务型对话系统的关键模块,主要包括用户意图识别和语义槽填充两个任务。我们分别探索了这两个任务如何应对小样本的挑战:(1)语义槽填充:小样本文本序列标注;(2)用户意图识别:小样本文本分类与多标签分类;(3)此外,现在自然语言处理中的小样本学习缺乏一个统一的、能反映真实世界任务挑战的基准测试,我们为此标注了一个全新的小样本数据集FewJoint,并组织了SMP 2020技术评测,希望能以此推动自然语言处理中小样本学习研究的进展。