The procedures used to recover the historical weather and climate data can be divided into three main parts. The first is the pre-transcription processing, which includes locating the observations, obtaining digital image files, processing the image files and configuring the app for data transcription (Fig. 4a). The second part is the actual transcription process. The third part is the post-transcription processing of the now machine-readable weather observations (Fig. 4b).
Fig. 4
The alternative text for this image may have been generated using AI.
The pre- and post-processing climatic data extraction procedures. (a) Pre-transcription processing steps focus on obtaining and cataloguing images of the weather observations and formatting the transcription app. (b) Once the data is transcribed, it is verified using various techniques and converted to standard modern units.
A traceable transcription and validation process is critical to maintaining transparency in data records. Here, traceability starts with the maintaining of a connection to the original archival record source through the medium of the digital image file of the original meteorological observations. The code for the transcription app can be found on the GitHub platform39.
The image files
The image files obtained from the NARA M1958 repository13 were organized by station location and renamed according to register type, page type and date. Three considerations went into naming the image files. First, we wished to embed metadata of interest, such as location and time period covered by the data on the image files, to make it easy to locate the image file when at a later stage in the project weather data obtained from the image file were examined. Second, we wished to create a unique identifier for each image file. Finally, we wanted to maintain traceability of data from the original archive identifier to the data export in flat files. To accomplish these three goals, the elements listed in Fig. 5 were combined to form each image file name.
Fig. 5
The alternative text for this image may have been generated using AI.
Image file nomenclature for a traceable data cycle process.
Each image file was given a unique identifier composed of the station location, the observer’s name if necessary for disambiguation, the register type, the originating archive identifier, the date of the observations and the page type. A typical file name is YorkFactory_USSS-316_M1958_1883-06-01_OBS-1.jpg. The image files were then uploaded to the web app, and appropriate transcription environments created to replicate each register type.
Images files were examined for quality and notes made on quality issues. Notes include comments on the condition of the image or of the original pages, such as “major ink smear”, “badly scanned, but mostly legible”, “pasted in values”, or “heavy bleedthrough.” The image files from NARA and NCEI were obtained from microfilm images of the original documents, so are at several removes from the original document (Fig. 2). In some cases, the original documents were not accessible to, due to either their fragility or their location in an overseas archive. Problems with the image files originated from two main sources. The first, which is inherent to the original document, was the quality of the original documents and conditions under which the weather observations were recorded and transmitted to the Smithsonian Institute. Tears in the pages were common in older documents and in pages from trading posts in Canada’s interior such as Fort Simpson or Michipocoton, which presumably had long postal journeys over rough terrain. A substantial number of pages had ink blots or bleedthrough obscuring sections of the page. At Wolfville, two months of humidity observations had recalculated values pasted over the originally recorded values. Some of these issues, particularly the bleedthrough and pasted sections, may have been easier to resolve if the images were available in a colour format, rather than black and white microfilm. The second major quality issue occurred when part of the document is obscured what appears to be tape or bindings on the document itself and possible microfilm issues. These problems can lead to irrecoverable portions of the observations on the page.
The images from UKMO were taken directly from the original sources, either by photograph or by high-quality scan. The photographs had issues with the page binding of the original document, making the values towards the edges of the pages where they were bound into a volume difficult to read and distorted. The images scanned professionally by the UKMO library and archives staff were the highest quality images with few issues.
The register types
As most the records were inspired by the Smithsonian volunteer weather observing network, the observers recorded their observations on pre-printed forms and distributed first by the Smithsonian Institute (Fig. 2), and later by its various successors such as the US Signal Service. Similarly, the observations from the UKMO archives were largely taken by military observers with standard printed forms. Although an advantage of the printed forms is that they provide uniformity across stations and time, there is nonetheless some variety in forms. Formats changed over time as new observing variables were added or removed, or observing instructions were updated with evolving needs and improving instrumentation. Different forms were sent to different observers depending on the types of observations made. The forms were catalogued and given code based on the number of pages in the form: the “100” code family denotes a one-page form, “200” a two-page form, and so on (Table 1). A new register type was coded if the variables recorded changed or if the layout of the printed form changed. A subtype was noted if the observer added handwritten modifications or additional observations. This structure is designed to be flexible as new register types are continually being identified with new source materials.
Table 1 Register and Page Types.
Some of the observations are recorded in personal diaries or in handwritten tables. These are unique, and as such do not conform to register page type cataloguing. These are given register types with the abbreviation of their location followed by a numerical designation for each change in the variables or layout of the diary or table. For example, the Amherstburg register14 changed formats several times, and the designations assigned to the register types are Amherstburg_AM-1, Amherstburg_AM-2, and Amherstburg_AM-3.
The page type
Each register type can have one or more page types. A page type has a specific organization of information, both meteorological observations and metadata such station location, observer, date and variables observed. It should be noted that not all observers had sufficient time or the necessary instruments to record all variables listed in the forms, thus not all variables listed on a register type were necessarily recorded. The Smithsonian and later US Army Signal Service observation forms were sheets designed to be folded and sent by mail13. Their format changed over time: at times the form consisted of one sheet folded in two, with instructions printed on the reverse side. Later, the form consisted of four pages: a page of instructions, two inside pages for observations, and page for recording remarks and casual phenomena. The page types are divided into observations pages (OBS: Fig. 2a,b), casual phenomena pages (CP; Fig. 2c), and instruction pages (INS; Fig. 2d). On some forms the casual phenomena and instructions appear on the same page (CP-I).
Each page and register type have a specific combination of observing time and meteorological variables recorded. An example of the meteorological variables and the original measurements units for Register Type USSS 316 is shown in Table 2, along with modern equivalents and units where possible.
Table 2 Example of Meteorological variables for Register types USSS-316 with units and abbreviations: barometer, thermometer, cloud and wind, precipitation, humidty and weather remarks.
The microfilming process of the original documents led to some documents being photographed as one image, but at other times be split into two separate images, a left-hand side of one original document page and a right-hand side another original document page. In order to capture this diversity of formats, the observations pages are subdivided into full observations pages (OBS-F), left-side pages (OBS-L) and right-side pages (OBS-R). The register types for the US Signal Service 314 and 316 had distinct pages, so these were named 1 and 2 rather than left and right (Fig. 2a,b). Each register type has one or more page type associated to it (Table 1).
The transcription process
As observations are transcribed into the web app, they are saved directly into a database. Both our transcription environment and our data output are designed to resemble the original observations as closely as possible, for reasons of error reduction and conservation of scientific heritage.
The transcription interface for each register type is therefore built up to reflect the observation groupings in the original register pages. Within the user interface (UI) on the administrator pages, a field group is created and named “Clouds”. Fields are created and named “Cloud direction”, “Cloud amount” and “Cloud kind” (Fig. 6). Field values are then created for the field “Cloud kind” which include options for the drop-down menu such as Cumulus, Nimbus, etc. These field values are then linked with the field “Cloud kind” in the UI (Fig. 7), the field “Cloud kind” is linked to the field group “Clouds”, and the field group clouds is linked to a register schema, such as USSS-316.
Fig. 6
The alternative text for this image may have been generated using AI.
Example of the relationship between the register types, page types, field groups, fields and field options.
Fig. 7
The alternative text for this image may have been generated using AI.
Example of the transcription app (https://eccc.opendatarescue.org/). Values from the image file (background) of the original weather register are entered into the transcription bar. The values are then saved directly into the database and can be verified by looking at the transcription data. Each field group is colour coded.
All linkages made in the UI are reflected in the back-end database. Fields values, fields and field groups can be used in more than one register schema. The field options for variables that have technically limited values, such as cloud type or wind direction, are accessed by a drop-down menu to limit transcription or interpretation errors (Fig. 7). Fields that are not constrained to limited options, such as barometer observations, have a free-text entry field.
Occasionally, modifications by the original observers necessitated the addition of new fields and sometimes even the creation of new register types during the transcription process. The observer for York Factory, for example, added minimum and maximum thermometer and supplementary barometer observations to the register. These additional observations, as well as additions to the printed observation forms, account for the differences between register types USSG-314 and USSG-316.
Data transformation to modern standards
Table 2a–f gives an overview of the historical variables, the modern equivalents, historical and modern units, and the internationally agreed upon abbreviations for these variables where they appear in standardized filename datasets. Information on the specific observations, instruments and variables, such as the wind scale, are found in historical technical documents, such as Instructions to Observers pamphlets or articles38,40.
Not all historical variables have yet been given designated recognized modern equivalents. Most historical observations from the 19th century are not recorded in standard SI units accepted in internationally exchanged data files (Table 2a–f). The information needs to be transformed into modern units as designated by, for example, the World Meteorological Organization (WMO) standards. Conversion values for pressure (Table 2a), temperature (Table 2b), precipitation (Table 2d), humidity (Table 2e) and precipitation are well-known. Wind and cloud directions are transformed from cardinal directions to degrees (Tables 2c, 3). Variables which are recorded in ordinal scales, such as wind force (Tables 2c, 4) or cloud velocity (Table 2c, are more difficult to transform into modern equivalents. The Smithsonian Institute developed a wind scale which was contemporaneous with, but not completely equivalent to, the Beaufort wind scale.
Table 3 Wind and cloud direction conversions from cardinal and intercardinal directions to degrees.Table 4 Smithsonian and USSG Wind Scale conversions.
The Royal Engineers were requested to measure the wind force in pounds per square foot. These were sometimes recorded in pounds and ounces, such as at Halifax (e.g. 3 15; Fig. 8a), and others in decimal pounds such as at Kingston (e.g. 3.8, Fig. 8b). At still other stations, such as New Westminster, the engineers measured the wind force using the Beaufort scale. Some of the wind force observations are further complicated by observers changing methods of recording partway through their records (e.g. from Beaufort to miles per hour; see Fig. 8c). With up to five different methods of recording, the wind force field is one of the most difficult to interpret correctly. As the SEF file formatting standard requests wind speed in m/s rather than wind force scales, wind force was one of the most complicated variables to address.
Fig. 8
The alternative text for this image may have been generated using AI.
(a) Wind force measurements taken by the Royal Engineers at Halifax, in pounds per square inch, measured in pounds, ounces, and fractions of ounces; (b) wind force measurements by the Royal Engineers in Kingston, in pounds and decimal tenths of pounds, (c) original “wind force” measurements for Winnipeg, where an abrupt transition takes place from a one to ten scale (Smithsonian scale) to values up to the high 30 s.
Most conversions were applied using the standard functions contained in the lmrlib.py routine produced by the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) from NOAA41. The formula used for converting pounds per square inch to meters per second is given by Equation 1:
Equation 1. Conversion from wind force measured in pounds per square inch to metres per second
$${\boldsymbol{ws}}={\bf{0.44704}}\left(\sqrt{\frac{{\boldsymbol{wf}}}{{\bf{0.00256}}}}\right);$$
where ws is wind speed in m/s and wf is wind force in lbs/in2
Cloud velocities are also recorded in a scale of 1 to 10 (Tabe 2c). Cloud amounts, or cloud cover, were commonly recorded in tenths in the 19th century, whereas the units prescribed by the SEF standards are octets. Cloud types used in the Smithsonian and other records are listed in Table 5, along with equivalents from the International Cloud Atlas42.
Table 5 Cloud types used in the Smithsonian registers and International Cloud Atlas equivalents.
Historical weather remarks are more difficult to translate to modern synoptic weather codes (Tables 2f, 6). The relationship between the weather conditions described in the historical registers and the modern Canadian synoptic is not fully equivalent. There are conditions described in the historical documents which have no parallel in the synoptic codes and similarly, some synoptic codes which will have no exact equivalent in the historical wording of past weather conditions.
Table 6 Historical Weather remarks and equivalent Canadian Synoptic Weather Codes46.