NSF Grant Funds Research on Islamicate Manuscript Transcription Methods
August 02, 2022
The innovative humanities-computer science collaboration will enable tools for accurate and automatic transcription of Arabic script.
By Jessica Weiss ’05
The University of Maryland has received a nearly $300,000 grant from the National Science Foundation that will support efforts to improve the way handwritten documents from the premodern Islamicate world—primarily in Persian and Arabic—are turned into machine-readable text for use by academics or the public.
Assistant Professor Matthew Thomas Miller and Mellon Postdoctoral Fellow Jonathan Parkes Allen, both of the Roshan Institute for Persian Studies, will work with researchers at the University of California San Diego (UCSD), led by computer scientist Taylor Berg-Kirkpatrick, on the innovative humanities-computer science collaboration. UCSD received its own $300,000 award.
Over three years, the researchers will work in the domain of handwritten text recognition, which are methods designed to automatically read a diversity of human handwriting types with high levels of accuracy.
“This work has the potential to remove substantial roadblocks for digital study of the premodern Islamicate written tradition and would be really transformative for future studies of these manuscripts,” Miller said. “We are very grateful to the NSF for its support.”
This latest research proposal builds on a number of ongoing efforts to develop open-source technology to expand digital access to manuscripts and books from the premodern Islamicate world in Arabic, Persian, Ottoman Turkish and Urdu; Miller currently leads an interdisciplinary team of researchers on a $1.75 million grant from the Mellon Foundation as well as a $300,000 grant from the National Endowment for the Humanities.
There are hundreds of thousands—perhaps even millions—of premodern Islamicate books and manuscripts spanning over 1,500 years, from the 7th–19th centuries, forming perhaps the largest archive of cultural production of the premodern world. Scanning and digitization efforts over the last decade have made images of Islamicate manuscripts in a large number of collections available to the public. However, they remain mostly “locked” for digital search and manipulation because the text has not been transcribed into digital text.
The task is made more difficult by the diversity and intricacy of many Arabic manuscripts, said Allen, who is a historian of early modern Ottoman religious and cultural history. They may be written alongside diagonal notes, annotations and corrections, in multiple colors and “hands.”
Under the NSF grant, researchers will develop new techniques that remove the need for extensive manual—or human—labor, a method known as “unsupervised” transcription. Eventually, the tools under development will produce models that will be able to automatically transcribe large quantities of Persian and Arabic script in a multitude of different styles with substantially higher degrees of accuracy than is currently possible.
“The Arabic script tradition is so extensive and so broad,” Allen said. “People need to be able to read these manuscripts, search within them, and integrate them into their research.”