From 7a3a8180144fed572455ceea2ed77dc25811d834 Mon Sep 17 00:00:00 2001
From: Marie Adler <adler@hab.de>
Date: Tue, 9 Jan 2024 09:32:27 +0000
Subject: [PATCH] Update README.md

---
 README.md | 40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

diff --git a/README.md b/README.md
index e5d85be..84fa315 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,5 @@
+*Please scroll down for the English version.*
+
 # Goldberg - Urbanonyme
 
 Die folgende Anleitung soll eine Benutzung des Python-Skripts und eine Interpretation der Ergebnisse ermöglichen. Der Programmcode ist kompatibel mit der Python-Version 3.6.
@@ -21,7 +23,43 @@ Die Clusterbildung von Orten nimmt bei deren Identifizierung eine bedeutende Rol
 Ausgabedateien:
 Das Programm produziert drei Dateien, in der die einzelnen Spalten per Tabstopp voneinander getrennt sind. Die Datei „quality.csv“ gibt Auskunft über die Beschaffenheit und Qualität der Informationen in den GEDCOM-Dateien. Pro GEDCOM-Datei existiert eine Zeile mit Angaben zum Dateinamen, der Anzahl der Ortsangaben in der Datei, dann diese Anzahl der Ortsangabe aufgeteilt in Orte ohne Treffer (noHit), Orte mit mehr als einem Treffer (moreThanOneHit) und Orte mit genau einem Treffer (definitely coordinates), den Mittelpunkt der Längen- sowie der Breitengrade, die Anzahl existierender Cluster, die Anzahl relevanter Cluster, sowie eine Liste der Koordinaten der Mittelpunkte relevanter Cluster.
 Die Datei „provincesdict.csv“ enthält vier Spalten: Die unveränderte Ortsbezeichnung einer Quelle, den Dateinamen, die GOV-ID und die zugeordnete Provinz. Sie hat den Zweck, dass doppelt vorkommende Ortsbezeichnungen in einer Datei nicht doppelt verarbeitet werden müssen.
-Die Datei „placefinder.csv“ enthält zu jeder Ortsangabe Informationen über die ID (GOV-ID), die Koordinaten, eine Information wie die Zuordnung zur GOV-ID stattgefunden hat, die bereinigte Version des Ortsnamens, den originalen Ortsnamen sowie den Namen der Datei, in der der Name vorkommt
+Die Datei „placefinder.csv“ enthält zu jeder Ortsangabe Informationen über die ID (GOV-ID), die Koordinaten, eine Information wie die Zuordnung zur GOV-ID stattgefunden hat, die bereinigte Version des Ortsnamens, den originalen Ortsnamen sowie den Namen der Datei, in der der Name vorkommt.
 
 
 Jan Michael Goldberg, 30. Juni 2022
+
+__________________________________________________________________________________
+# Goldberg - Urbanonyms
+
+The following instruction is intended to enable the use of the Python script and an interpretation of the results. The program code is compatible with Python version 3.6.
+
+## Libraries:
+In order for the program to be able to run, additional libraries may need to be installed locally. The libraries used are specified in the first lines of the respective files.
+
+## Input files: 
+The program processes location data from GEDCOM files. The GEDCOM files must be named with consecutive numbers ("1.ged", "2.ged" etc.). Digits must not be used twice. These files are placed in a subfolder "data". If sources other than GEDCOM files are to be used, the program must be modified. It is not advisable to use only a single list of location data, as the program is based on relating location data to a context. Context here means that these locations are named in the same source (i.e. in the same context), which implies geographical proximity. A separate file should therefore be created and processed for all location data in a context.
+
+The "data" subfolder also contains the Mini-GOV files. The mini-GOVs for Germany, Poland, Austria, Switzerland, the Czech Republic, Denmark, France and the Netherlands are included here by default.
+The program also opens the files "quality.csv", "placefinder.csv" and "provincesdict.csv", which are located in the same folder as the "main.py" file. These are also the output files of the program (see below). If these do not exist, they will be generated. If they do exist, the existing data is used so that the GEDCOM files that have already been processed are not executed again. This is particularly useful in cases where the program crashes occasionally due to an unstable Internet connection (see next section).
+
+## Unstable internet connection: 
+The program uses the GOV web service to retrieve information on individual locations. This requires a permanent Internet connection. However, as there can be temporary disconnections, particularly via WLAN access, which can cause the program to crash, a delay is programmed into the program in the event of any internet disruptions. This can be switched on and off manually. The variable withSleeping is located in the file "provincefinder.py" at the beginning of the function "provinceFinder()". If it is set to 1 and a connection to the web service cannot be established, the program pauses for one second. However, this also means that the program takes longer to run. This function is not activated by default.
+
+## Parallelization:
+The processing of GEDCOM files runs simultaneously to increase speed. For this purpose, you can specify how many computer cores are used. To do this, the "Pool()" parameter must be changed in the main. If it remains empty, all available calculation engines are used. In the script, the number of cores is set by default to use all available cores.
+
+## Province assignment: The location details are assigned to different provinces. By default, provinces before 1871 and after 1990 are assigned in the "provincefinder.py" file. Provincial assignment is not possible for the period in between. However, this can be adapted and extended as needed. The reference time can be changed in the Main in the "parallel()" function using the referencetime variable. It is set to the year 1800 by default.
+
+## Cluster:
+The clustering of locations plays an important role in their identification. The minimum distance and the minimum number of locations in a cluster can be varied. The minimum distance between two clusters can be changed in the "qualitychecker.py" file using the "qualityChecker()" function via the IF query "if distance <= 50:". In the same function there is the variable minimumClusterSize, which can be used to vary the minimum size of a cluster. This is set to 6 locations by default.
+
+
+## Output files:
+The program produces three files in which the individual columns are separated from each other by tab stops. The "quality.csv" file provides information about the nature and quality of the information in the GEDCOM files. Each GEDCOM file contains a line with information on the file name, the number of locations in the file, then this number of locations divided into locations without hits (noHit), locations with more than one hit (moreThanOneHit) and locations with exactly one hit (definitely coordinates), the center of the longitude and latitude, the number of existing clusters, the number of relevant clusters and a list of the coordinates of the centers of relevant clusters.
+The file "provincesdict.csv" contains four columns: The unchanged location name of a source, the file name, the GOV ID and the assigned province. Its purpose is to ensure that duplicate location names in a file do not have to be processed twice.
+The "placefinder.csv" file contains information on the ID (GOV-ID), the coordinates, information on how the assignment to the GOV-ID took place, the corrected version of the place name, the original place name and the name of the file in which the name appears for each place name.
+
+
+Jan Michael Goldberg, June 30, 2022 *(translation by Marie Adler, January 09, 2024)*
+
+
-- 
GitLab