Intellectual Property Office
Non-Confidential Disclosures
“Automatic Categorization of Figures in Scientific Documents”
PSU Invention Disclosure No. 3195
Field of the Invention/Key Words:
Automatic Image Categorization; Feature Extraction; Document Searching
Background:
Figures are an integral part of documents. In scientific documents especially, figures such as graphs, flow charts, diagrams, drawings, and photographs are often used to illustrate the key ideas and findings, and to help readers understand the technical details of the work. Human beings can interpret figures quickly and can often perceive the ideas hidden within the figures without reading the details about the figures - “A picture is worth a thousand words.” The critical role of figures in understanding the contents of scientific documents warrants more effective use of them in scientific digital libraries. While text within documents, including the captions of figures, are typically indexed for retrieval purposes, current digital library end users are not equipped with search engines or tools to look for information within the figures. Ideally, search engines should use both textual and figure information to assist the users to find relevant documents.
Invention description:
We have created an architecture (Figure 1) for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We have developed a machine-learning-based approach for automatic categorization of figures using several categories based on their functionalities in scholarly articles. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. This approach has been evaluated on a test-bed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for real world use. Our tools will be integrated into a scientific document digital library
Advantages:
- Automatic categorization of figures for indexing
- Enables search of information within figures
- Potentially may be used to extract data within figures
- Automated extraction of figures within scientific documents
Contact:
Bradley A. Swope
Sr. Licensing Officer
Intellectual Property Office
The Pennsylvania State University
113 Technology Center
University Park, PA 16802-7000
Phone: (814) 863-5987
Fax: (814) 865-3591
E-mail: bradswope@psu.edu
|