Workshop on Chinese Historical Databases: Sources, Methods, Prospects held at HKUST, January 11-12, 2024

Participants at the workshop Chiense Historical DatabasesL Sources, Methods, Prospects held at HKUST on January 11 and 12, 2024

Cameron Campbell organized a meeting on Chinese Historical Databases: Sources, Methods, Prospects on January 11 and 12, 2024 at the Hong Kong University of Science and Technology.

The meeting is one in a series of activities intended to promote the development of research infrastructure for studying China’s past organized under the auspices of and with support from the RGC Areas of Excellence Project Quantitative History of China (Chen Zhiwu PI). Staff from the HKUST School of Humanities and Social Sciences, including Lee-Campbell Group RA Shengbin Wei, provided logistical support.

The meeting brought together historians and social scientists constructing databases suited for the quantitative analysis of Chinese history. Participants from Hong Kong, mainland China, and Europe introduced their databases. These included projects that were already complete, others were in progress, and some were in the planning stages. Presentations and discussion focused not only on the content of the databases and prospects for analysis, but nuts and bolts issues related to the construction, preservation, documentation and dissemination of the databases. Several presentations covered techniques being used to automate the creation of databases, including OCR, tokenization, entity recognition, and record linkage.

Lee-Campbell Group members including Cameron Campbell, Dong Hao, Gao Shuaqi, Chen Jun, Wu Yibei, James Lee, Hou Yueran and Matt Noellert made presentations introducing their databases.

In addition to the presenters, other faculty and students attended as observers.

The meeting concluded with the development of plans for training workshops for historians to help them learn how to construct databases and make use of existing ones.

Christian Henriot has written a more detailed discussion of the Chinese historical databases meeting at the ENEP website.


Introductory Remarks by Chen Zhiwu, Cameron Campbell

Session 1 – New Approaches

Chair: Cameron Campbell

Lin Zhan
Content and Value of the Chinese Genealogy Database

Guenther Lomas
The Process of Building the Chinese Genealogy Database

Chen Yuqi
Geocoding the Past World: Unearthing Coordinates of Early China from Texts Using Large Language Models

Session 2 – Geographic, Economic, and Other Context

Chair: Chen Zhiwu

Hu Heng

Ma Debin
Quantifying Living Standards, an Overview

Ziang Liu
Early Modern Wages: Data and Limits

Gao Shuaiqi

Session 3  – Late Imperial China I

Chair: James Lee

Ma Min

Dong Hao
East Asian Population Databases

Christian Henriot
Modern China Historical Database: Current Status and Future Prospects

Session 4 – Late Imperial China II

Chair: Debin Ma

Cameron Campbell
CGED-Q: Current Status and Future Plans

Chen Jun
CGED-Q ZSBL: Military Officials

Fu Haiyan

Session 5 – ROC

Chair: Dong Hao

Yibei Wu
Late Qing and Beiyang Student Records, and Beiyang and ROC Officials

Hou Yueran
Construction of Occupational Database of Tsinghua Students Studying in America with Boxer Indemnity Fund (1909-1944)

Lik Hang Tsui
Ink Trails: Correspondence and Connections in a Dataset of Epistolary Manuscripts from Song China

Session 6 – ROC and PRC

Chair: Christian Henriot

Matthew Noellert
Lee-Campbell Group Post-1949 Rural Datasets

James Lee
Lee-Campbell Group PRC and ROC Educational, Academic, and Professional Datasets

Chen Ting
Post-1949 County Gazetteers

Pierre Landry
China’s provincial CCP élite since 1921

Future Directions

Panel with remarks by Cameron Campbell, Zhiwu Chen, Christian Henriot, and James Z. Lee

Closing panel with opening remarks by Cameron Campbell, Zhiwu Chen, Christian Henriot, and James Lee
James Lee, Christian Henriot, Zhiwu Chen and Cameron Campbell at the closing session. Photo by Xue Qin.

Participant Roster

TsuiLik Hang徐力恒

Tutorial for using R to analyze the CGED-Q JSL Public Releases

Chen Jun, my MA student at Central China Normal University, has shared slides and sample code he produced to help anyone planning to use R to analyze the CGED-Q JSL public releases. The materials are all in Chinese. They introduce how to import the public data into R, create and transform variables, process strings to create variables, and tabulate and graph results. We hope that this will be useful to users of the data.

New paper by others using CMGPD-LN

We were pleased to learn that Yu Bai, Yanjun Li, and Pak Hong Lam had just published a paper “Quantity-quality trade-off in Northeast China during the Qing dynasty” in the Journal of Population Economics using the public release of the CMGPD-LN! We hope their paper along with other recent publications by others using the dataset will inspire others to use it.

Here is a link to their paper:

We are eternally grateful for the support from NICHHD that allowed us to prepare the CMGPD-LN for release, and to ICPSR for hosting the dataset.

CGED-Q JSL receives Best Project Award (最佳项目奖 ) at China Digital Humanities 2022 Annual Meeting

We are pleased to report that the China Government Employee-Qing (CGED-Q) Jinshenlu (JSL) dataset was one of four to receive the Best Project Award (最佳项目奖 ) at the China Digital Humanities 2022 Annual Meeting held at Renmin University on November 26 and 27.

For more information about the award, please see the final report of the CDH 2022 meeting.

For more information about the CGED-Q JSL, please see the project page at the Lee-Campbell Group Website.

CGED-Q Jinshenlu 1850-1864 Public Release now available

We just made available for download the China Government Employee Database-Qing (CGED-Q) Jinshenlu 1850-1864 Public Release.This release consists of 341,092 quarterly records of 37,632 (by our linkage) officials who served between 1850 to 1864. The information is drawn from 26 quarterly editions.

We chose 1850-1864 as the next period for a release since it includes the Taiping Rebellion, a major event in 19th century Qing history.

Each record includes information about the post, and if it was occupied, the holder, including their name, province and county of origin, qualification, and other information.

Together with our previous release of 686,945 records for the period 1900-1912, we have now released publicly more than 1,000,000 records from the CGED-Q.

The 1850-1864 and 1900-1912 releases may be downloaded at the HKUST Dataspace, the Harvard Dataverse, and the mirror site at Renmin University Institute for Qing History:

HKUST Dataspace

Harvard Dataverse

Renmin University Institute for Qing History


Github repository with code for the CMGPD Public Release

We created a repository to share the STATA code that processes the original CMGPD data from the Excel spreadsheets produced by our coders and turns it into the working file that is the basis of our analysis and the public release available at ICPSR. This is intended to help users of the data better understand the process by which it went from the spreadsheets transcribed by the coders to the datasets available at ICPSR. The code for linking individuals to their kin may be of particular interest.

Major phase of data entry for the China Government Employee Database-Qing Jinshenlu (CGED-Q JSL) completed

In November 2021, our coders completed the entry of virtually all the quarterly editions of the rosters of Qing civil officials 縉紳錄 and military officials 中樞備覧 available to the Lee-Campbell Group, including all the editions from the published Tsinghua University Library collection and other editions from  the Columbia University and Harvard University libraries, as well as the National Library and Shanghai Library. We are grateful to the staff of all these libraries, in particular the Columbia University Library, for their cooperation in making their library holdings available.  We have also located a number of other editions in the Peking University library and the Palace Museum Library, but do not yet have access to these data.  We are not aware of any other readily accessible editions in other collections.

The CGED-Q JSL now consists of 4,433,600 records of 327,618 officials for the period between 1760 and 1912. 3,843,644 are records of civil offices in editions of the jinshenlu and 589,956 are records of military offices in editions of the zhongshubeilan. The data are most complete for the period 1830 to 1912. According to our analysis based on our most recent record linkage, of these officials, 261,451 were civil officials, 58,482 were military officials, and 7,685 made appearances as both civil and military officials. Please note that since these counts of numbers of officials are based on record linkage, they may change as we adjust our nominative linkage procedures.

Figure 1 (below) summarizes the coverage of the entered 縉紳錄 editions by decade (black bar) and compares it to the potential coverage if all the editions in different collections were entered. In the 1840s, and then from 1870 to 1912, we have entered at least one edition per year. In the 1830s, and then from 1850 to 1869, we have at least one edition entered for 9 out of 10 years in each decade. Between 1800 and 1830, the coverage of our entered data is spottier. We have at least one edition in 7 out of 10 years in the 1800s, 4 out of 10 years in the 1810s, and 6 out of 10 years in the 1820. From 1760 to 1800, our coverage is less complete, with at least one edition entered every 2 to 4 years per decade.

Figure 1. Entered and Available Editions

Based on our review of the catalogs of other collections, it should still be possible to improve coverage of the last half of the 18th century and first half of the 19th century. The heights of the green bars represent the numbers of years for which at least one edition appears to exist in other collections. Most of these are in the Peking University Library and the Palace Museum. We hope very much to gain access to these collections at some point in the future.

Figure 2 presents a more detailed view of the coverage of the editions so far. From about 1865 onward, we have 3 or 4 editions per year entered all the way to 1911. From 1830 to 1865 or so, we have at least one or two editions per year entered, except for one year each in the 1850s and 1860s where we have no editions at all. Before 1830, it is more common to have one or two editions entered, or none at all.

Figure 2. Entered editions by year

For more details about the CGED, please see the project page.

Addendum – 30 April 2022

Since November 2021, we found five more editions that had been entered but not added to our central work file. This post and the content of related pages has been accordingly updated.