DataTrans

Project Info

Team Name


NetMine


Team Members


ARIO , Dorna

Project Description


Our Team, NetMine, has two members; Dorna Heidari, and Majid Zarrinkolah (Ario). We participated at Deakin University. We entered the challenge "Making public archives more accessible" and developed a project named DataTrans.
Our goal was to make public archives more accessible. The technical aspects of DataTrans includes how we extracted information from various documents using OCR and transformed it into vectors. We then utilised Large Language Models (LLM) to ask questions and gain insights from the data.

"We collectively contributed to coding this project segment, which focuses on extracting information from a multitude of documents, encompassing tasks like tagging, title retrieval, and more. For this purpose, we've implemented an Optical Character Recognition (OCR) function. This function adeptly extracts data from various document types, including PDFs that contain both images and text. Subsequently, the extracted data is organised within documents, which are then divided into chunks. These chunks are further transformed into vectors, serving as retrievable data points. These vectors are crucial for interfacing with large language models.

When selecting suitable large language models, we had a variety of options at our disposal. It's important to note that these models require substantial memory resources. For our implementation, we collectively decided on utilising Llama 2, which boasts billions of parameters. In scenarios where exceptional output quality is desired, we can opt for a more extensive model, such as the one equipped with 70 billion parameters. With the vectorised data safely stored in our database, we feed it into our language model (LLM), Llama 2. This enables us to pose inquiries to the AI model. These queries can encompass topics like titles, brief overviews, summaries, terms, and keywords. The results offer insights akin to human comprehension.

Furthermore, the amassed information can be archived within a database. This strategic approach empowers efficient searching and data retrieval from the database, streamlining access to the associated documents.


#ai #information_retrieval #ocr #llm

Data Story


DataTrans is an AI platform that retrieves information taken from digital documents such as images and pdf files, using OCR and LLM.


Evidence of Work

Video

Homepage

Team DataSets

ACT Memory

Data Set

Challenge Entries

Making public archives more accessible

Online catalogues, like ACT Memory, provide information about government records and, where possible, provide copies of the records themselves. These records are generally in PDF or JPEG format. This makes the documents difficult to search for, access, and use. How might governments with record catalogues, like ACT Memory, solve this problem and make these rich sources of information more useful?

Go to Challenge | 7 teams have entered this challenge.