HLP Research Center API

Kusuri API

Description

Kusuri, is a Named Entity Recognizer optimized to detect and extract drugs mentions in tweets. Kusuri's interface is RESTful service. Kusuri performs its classification using either a CNN or a lexicon + BERT classifier. Kusuri has been optimized for speed in order to process millions of tweets in hours. This version is an updated and simplified version of the first system described in [Weisenbacher et al., JAMIA 2019] (https://doi.org/10.1093/jamia/ocz156).

Input Specification

Kusuri provide one service predict through a POST request. The parameters of the prediction should be posted as a JSON with the following format:

{'tweets': [{'tweet_id': 1, 'text': 'the first tweet'}, {'tweet_id': '2', 'text': 'the second tweet'}, ...], 'classifier':'classifier_available','lexicon': [{'drug':'drug1'}, {'drug':'drug2'}, {'drug':'drug3'}]}

tweets: the keys 'tweet_id' and 'text' are expected. Other keys/values can be included, they will be ignored by the API but returned in the JSON output
classifier: classifier_available are lexicon, CNN, BERT. lexicon will apply a baseline lexicon classifier on the input tweets. All tweets with a phrase matching an entry in the lexicon will be labeled 1. CNN will apply a CNN trained to detect drugs in a corpus of tweets with the natural distribution, i.e. where tweets mentioning drugs occur rarely. BERT will apply a BERT classifier trained to detect drug in a corpus of tweets mentioning drug names. This classifier has been optimized to disambiguate drug homonyms and will not perform well on a corpus of tweets with the natural distribution. This classifier can be applied on a corpus with the natural distribution if the tweets are pre-filtered by the lexicon classifier first.
lexicon (optional): By default Kusuri uses a lexicon made from a list of entries from RxNorm (2018), a list of variants of these entries manually curated and a list of generic terms to refer to drugs manually generated. A lexicon defined by the user can be passed to Kusuri and applied in place of the default lexicon. The new lexicon should be formated as a list.

Ouput Specification

If the prediction is successful, Kusuri will return a JSON with the following format:

If the lexicon classifier is used:

{'tweets': [{'tweetID': 1, 'text': 'the first tweet', 'drugDetected':None, 'prediction':0}, {'tweetID': '2', 'text': 'the second tweet'}, , 'drugDetected':['the second'], 'prediction':1]}
If the CNN or BERT classifier are used:

{'tweets': [{'tweetID': 1, 'text': 'the first tweet', 'drugDetected':None, 'prediction':0}, {'tweetID': '2', 'text': 'the second tweet'}, , 'drugDetected':['the second'], 'prediction':1]}

Python Example

'''
    Created on Aug 24, 2020
    
    @author: dweissen
    '''
    import requests
    import pandas as pd
    import json
    from sklearn.metrics import confusion_matrix, classification_report
    
    import logging as lg
    import sys    
    
    if __name__ == '__main__':

          
        exJSON = {}
        exJSON['lexicon'] = [{"drug":"first"}, {"drug":"drug2"}, {"drug":"drug3"}]
        exJSON["classifier"]="lexicon"
        #exJSON['tweets'] = json.loads(examples.to_json(orient="records"))
            
        #For testing the service if needed:
        exJSON = {
            "tweets": [
        {"tweetID":"10", "text":"My first tweet","label":"1"},
        {"tweetID":"11", "text":"My second tweet","label":"2"},
        {"tweetID":"12", "text":"My third tweet","label":"1"},
        {"tweetID":"13", "text":"My fourth tweet","label":"3"}],
            "lexicon": [
        {"drug":"first"}, 
        {"drug":"drug2"}, 
        {"drug":"drug3"}],
            "classifier":"lexicon"
        }
        
         #    POST
        resp = requests.post('https://hlp.ibi.upenn.edu/kuuri/v0.1/predict', json=exJSON)
        if resp.status_code != 200:
            raise Exception(f'POST /predict/ ERROR: {resp.status_code}')
        else:
            exJSON = resp.json()
            if 'errors' in exJSON:
                raise Exception(f'POST /predict/ ERROR: {resp.json()}')
            elif 'tweets' in exJSON:
                #TODO: to see what is the format sent gand normalize that in a dataframe!
                df = pd.json_normalize(exJSON['tweets'])
                lg.info(f'Prediction done for {len(df)} tweets.')
                lg.info(f'First 10 tweets:\n{df.head(10)}')
                df.to_csv('/tmp/outDrug.tsv', sep='\t')
            else:
                raise Exception(f'POST /predict/ ERROR: unexpected json received from the rest: {exJSON}')