Solution description and data story
- The project’s primary objective is to find new ways to improve PROV’s search functionality.
- Generate new metadata describing an image to expand its searchability. The metadata that we generate is a combination of a caption and responses to questions about the image (tags).
- An app combining generative ai models and human feedback to generate correct and relevant image captions and tags.
- A dataset that will allow for future development by taking advantage of techniques such as model fine-tuning and Reinforcement learning from human feedback (RLHF), a technique used in improving ChatGPT’s performance.
Our proposed solution can be divided into several components including:
- Using public APIs to access image data.
- Using state-of-the-art generative ai models to automatically make image captions and tags.
- An app that puts humans in the loop to enhance the model’s outputs.
- A demonstration of how the refined data can be used to self-improve.
Using public APIs to access image data
This section includes a data pipeline written in Python using the PROV API to collect image links and descriptions. The pipeline allows the user to query either on serial number or keyword.
Using state-of-the-art generative ai models to automatically make image captions and tags.
For our machine learning project, we chose to work with the BLIP model, an open-source vision-language model from Salesforce. We focused on two main tasks:
- Image Captioning
- Image Tag Generation
Using the BLIP model, we generated captions for images retrieved from the PROV API. For generating tags, we applied BLIP's Visual Question Answering capability. This allowed the model to answer a set of predefined questions and provide corresponding tags. Future iterations of this project can have more context-specific questions to generate context-specific tags.
In our implementation, we refrained from using third-party APIs. Instead, we downloaded the model and executed it locally within our environment. This approach ensured that no data was transmitted outside the system, thereby preventing potential data leakage to the public.
An app that puts humans in the loop to enhance the model’s outputs
Our app has the following functionalities:
- Keyword search on PROV’s image repository. The search will load images using the API.
- Interactive presentation of the model-generated description and tags.
- Generate descriptions and tags from any image URL.
A demonstration of how the refined data can be used to self-improve
In this section, we've adopted principles from Reinforcement Learning from Human Feedback (RLHF). The essence of this approach is to collect user feedback to elevate our model's performance in subsequent iterations. We've designed an intuitive user interface that facilitates users in providing feedback and making adjustments to model outputs. By leveraging this feedback, we aim to accumulate more accurate descriptions and tags, setting the stage for enhanced model performance in the future.
The combination of these four components gives a complete solution while also providing a systematic way for self-improvement. The refined captions and tags can be added to the metadata of the images which will allow for better searchability. These can also be used to improve the model’s performance such that it is more aligned with the objectives of an organization.