Skip to content
Snippets Groups Projects
Commit e6a03208 authored by Marie Adler's avatar Marie Adler
Browse files

Update README.md

parent 383d1a3e
Branches main
No related tags found
No related merge requests found
......@@ -29,17 +29,20 @@ The following instructions are intended to enable the use of the Python script a
Libraries:
In order for the program to run, additional libraries may need to be installed locally. The libraries used are specified in the first lines of the files.
Creating the standard form:
The program is designed to merge two tables with personal data. To do this, it is important to create tables in standardized form. The design of these tables is explained in the corresponding article. As the underlying data can be very unique, the conversion into the standardized form is individual as well. As an example, the creation of the standard form for the Leipzig data sources used in the corresponding article is described here. In the respective programs ("normform_KLF.py" and "normform_KLK.py"), the underlying data is uploaded, processed and output as a CSV file. These programs cannot be used for other data tables. However, they provide a suitable basis for adapting to other data.
Creating the standard form (Normform):
The program is designed to merge two tables with personal data. To do this, it is important to create tables in standard form. The design of these tables is explained in the corresponding article. As the underlying data can be very unique, the conversion into the standard form is individual as well. As an example, the creation of the standard form for the Leipzig data sources used in the corresponding article is described here. In the respective programs ("normform_KLF.py" and "normform_KLK.py"), the underlying data is uploaded, processed and output as a CSV file. These programs cannot be used for other data tables. However, they provide a suitable basis for adapting to other data.
Input files:
The CSV files resulting from the Normform programs must be named "normform1.csv" and "normform2.csv". If persons from one data source are to be merged, there may be two files with identical content.
Shortening the processing time:
For particularly large tables, the processing time can be shortened by setting the variable sortingBySurnameGiven to 1 at the start of "main.py". As a result, the data tables are sorted alphabetically according to the variable surnameGiven. If "normform1.csv" and "normform2.csv" have identical content, only people whose surname (variable surnameGiven) has an identical first letter will be merged. This entails a disadvantage for cases in which the same persons have different first letters in the spelling of their name (e.g. "Bauer" and "Pauer").
Output files:
As a result, the file "tableResult.csv" is generated. The respective columns are separated from each other with tab stops. The structure of the table corresponds to the standard form. However, the columns "idSource1" and "idSource2" are also available, which can be used to trace which entries have been merged. The "id" column is also called "idGlobal" in the results table.
Further iterations:
If a data table is to be merged, the program generally only recognizes and merges two identical persons. However, if persons are present more than once in the basic table, the results table can be added to another iteration as "normform1.csv" and "normform2.csv". Please note, however, that the columns "idSource1" and "idSource2" are deleted from the results table. Moreover, the column "idGlobal" must be renamed to "id".
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment