SWG for Utilizing Documents Printed on Paper

Some information useful for AI is not available as digital data but is printed on paper. It is thus desirable to digitize such information so that AI can fully utilize it. The purpose of this SWG is to provide foundations for such digitization.

This SWG plans to begin with reviewed books and peer-reviewed articles owned by publishers and academic societies but plans to gradually expand the scope to include corporate documents. Books and articles will be collected in cooperation with the Japan Electronic Publishing Association (JEPA). Intra-company documents will hopefully be provided by companies, public offices, local governments, and research institutions.

Although this SWG uses document recognition technology for digitization, it has some special requirements. First, it is important that the document be digitized in a form that is easy to use by AI. Second, it doesn’t necessarily have to be editable. Third, you don’t need to be able to reproduce the layout of the paper, but you do need structures for information access.

This SWG considers three types of players: paper document providers, electronic document providers, and information users.

A paper document provider is a player who provides documents printed on paper.
An electronic document provider is a player who receives documents printed on paper and provides digitized documents.
An information user is a player that receives digitized documents and utilizes them for AI processing.

This SWG will serve as a hub for these three types of players, and will provide some recommendations are

  • Choose image formats of documents scanned from paper
  • Choose character image format as a sample such as Gaiji
  • Choose or develop document formats after digitization

Although document recognition technology will be used for digitization, research and development of document recognition will not be conducted in this SWG. It is something that electronic document providers should research and develop at their own risk. However, this SWG may handle common parts or mere information sharing where intellectual property rights do not arise.

