Skip to content

Streamlit app to detect "Relevant" chemical compound structures and other entities from patent documents.

License

Notifications You must be signed in to change notification settings

shashank524/patent_analysis

Repository files navigation

Chemical Structure Extraction

A web app that leverages Deep Learning to detect relavant chemical structures in patent documents.

This project was built during the course of my internship at iReadRx

Learn more by checking out the blog posts linked below.

Models Used

v1 (chemical structure detection):

Dataset

I trained a YOLOv5 model on images (pdf pages) containing organic compound structures annotated by our chemistry team.

Transfer learning along with Hyper parameter evolution using a genetic algorithm provided great training results.

Training Notebook

v2 (Distinguishing between reactions, relevant structures, intermediates, etc)

Dataset

As YOLOv5 didn't perform well on the new dataset, Detectron 2 was used.

Training Notebook

Getting Started

To get a local copy up and running follow these simple example steps.

git clone https://github.com/shashank524/patent_analysis.git
cd patent_analysis

Colab

If you want to train the models, or just learn more about how this project works hands on, colab would be the best place to do so. All the required colab notebooks are here.

The first few cells take care of the installation.

Docker

The Dockerfile by default builds the image to run both YOLO and Detectron 2.

If you want to use Detectron2 alone, or both YOLOv5 and Detectron2 together, run the following commands directly.

docker build -t compoundextractor .

If you don't want to use Detectron2, comment the following lines in the Dockerfile. It is okay if you don't do so, doing this will only reduce the size of the docker image.

RUN pip install --user torch==1.9 torchvision==0.10 -f https://download.pytorch.org/whl/cu111/torch_stable.html
RUN pip install --user 'git+https://github.com/facebookresearch/fvcore'
RUN pip install cython pyyaml==5.1
RUN pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
RUN python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Finally run the container:

docker run -p 8501:8501 compoundextractor:latest

Without Docker

Run the following commands in your terminal to setup everything required without docker.

pip install -r requirements.txt

pip install --user torch==1.9 torchvision==0.10 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install --user 'git+https://github.com/facebookresearch/fvcore'
pip install cython pyyaml==5.1
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

Run the Streamlit app

streamlit run app.py

Results

How do you build on top of this?

I have divided each part of this project from training to deployment into seperate colab notebooks.

Let's say you wanted to know why I chose to perform inference a certain way, you can look at the relavant colab notebook and perhaps find a better way to do the same thing.

iReadRx Blog

🤝 Contributing

Contributions, issues, and feature requests are welcome!

Feel free to check the issues page.

Show your support

Give a ⭐️ if you like this project!

Acknowledgements

📝 License

This project is MIT licensed.

About

Streamlit app to detect "Relevant" chemical compound structures and other entities from patent documents.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages