SIPHS Consumer Health Vocabulary

SIPHS-CHV: A Lexical Database of Consumer Health Terminology

SIPHS Consumer Health Vocabulary (SIPHS-CHV) is a dataset of layman medical terminology. SIPHS-CHV has been collected by analysing four years of content in 68 health-themed subreddits and annotating the most frequent with their corresponding SNOMED-CT entities. Each term is assigned two annotations: a General SNOMED-CT identifier and a Specific one, denoting respectively the literal and contextual meaning of the term.

COMETA: A Corpus for Medical Entity Linking in the Social Media

COMETA is built over SIPHS, and provides four different biomedical Entity Linking scenarios for training and evaluation of machine learning algorithms, based on two different sampling strategies (stratified and zero-shot) and on SIPHS' General and Specific annotations. You can learn more about COMETA here.

Obtaining SIPHS-CHV and COMETA

You can request a copy of SIPS-CHV and COMETA by contacting Prof. Nigel Collier.

Please provide details about you, i.e. your affiliation, your role, a brief research statement (4 to 6 lines) about the intended use of the corpus, references to any past related past research you conducted in this area, and the names of other investigators involved in your project.

If you use COMETA in your research, please cite:

    title = "{COMETA}: A Corpus for Medical Entity Linking in the Social Media",
    author = "Basaldella, Marco  and Liu, Fangyu, and Shareghi, Ehsan, and Collier, Nigel",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    publisher = "Association for Computational Linguistics"

Online preview

Search for terms by using the textbox below; the results will show automatically. You can look terms both by their surface, their preferred SNOMED definition, or their SNOMED ID.

Please not that this facility is provided for demonstration purposes only and it won't display the full content of SIPHS-CHV/COMETA.

Surface General SNOMED ID Specific SNOMED ID Sentence Subreddit

If you are the author of one of the messages in our dataset and you want to withdraw your content from the dataset, please email us and we will process your request as soon as possible; please provide your Reddit username and the URL of the post for verification purposes.