Creating synthetic data for testing clerical linkage interface
- Generate ~10,000 fake people into the Person Index
- Generate 5,000 fake households into Household Index
- Generate 500 fake CEs into CE Index
- Assign people from the Person Index to each Household or CE
- Households should contain between 1 and 5 people
- CEs should contain between 6 and 49 people
- Create some twins
- Create households with shared surnames and with few shared first names
- Copy a random 94% from the Person Index into the Census table and remove 3% of households/CEs
- Copy a random 94% of the Person Index into the CCS table and remove people from 3% households while keeping the address info in CCS household/CE list ('dummy households')
- For records that appear in both census and CCS perturb on either record:
- 35% of forename variables
- 32% of surname variables
- 1% of sex variables
- 15% of day of birth
- 13% of month of birth
- 13% of year of birth
- 4% of postcodes
- Swap the geography (i.e. edit household table) for 50 of the CCS addresses (to mimic people moving house).
- 2% of people duplicated between 2 and 5 times in the same household (with a different person ID) and 5% to different households. Each instance should have a separate person ID
- Set a relationship status for each of the assigned people in a census household. Note that a resident can only have a relationship status if in a household not if in a CE.
- Add 200 Visitors records and link them randomly to the census households. (todo - make some of the census visitors (say 50) person records in CCS)
- Add 200 Visitors records and link them randomly to the CCS households. (todo - make some of the CCS visitor records (say 50) person records in Census)