SciGRID_gas - Data Model of the European Gas Transport Network.

The current transition in the European energy sector towards climate neutrality requires detailed and reliable energy system modeling. The quality and relevance of the energy system modeling highly depend on the availability and quality of model input datasets. However, detailed and reliable datasets, especially for energy infrastructure, are still missing. In this contribution, we present our approach to developing an open-source and open-data model of the gas transport network in Europe. Various freely available data sources were used to collect gas transport datasets and their attributes. The resulting datasets of the various data sources were processed, and unique elements were merged together. Statistical and heuristic methods were used to generate missing network element attributes. As a result, we successfully created a gas transport network model only using open-source data. The SciGRID_gas model contains 237.000 km of pipeline data which is a very good approximation to know length values. In addition, datasets of compressor stations, LNG terminals, storage, production sides, gas power plants, border points, and demand time series are provided. Finally, we have discussed data gaps and how they can potentially be closed.


Introduction
The most significant shares of gross energy consumption in Europe in 2019 were held by oil and petroleum products (34.5 %), followed by natural gas (23.1 %) representing almost 60 % of energy consumption [Eur21b].The high share of natural gas in the energy mix reflects its essential role as an energy carrier and the need to decarbonize sectors where it is used as a fuel source.This will be achieved mainly by ramping up the integration of renewable energy sources (RES) in the energy system.In this context, new flexibility concepts are needed to integrate higher RES shares while maintaining energy supply stability and reliability.Such flexibilities are the Power-to-X (P2X) technologies combined with energy storage [SB15].P2X refers to various processes to convert and "store electricity" using surplus RES electric power to seasonally and spatially balance energy.A promising P2X technology is power-to-gas (P2G) which has a vast potential in decarbonizing different energy sectors such as heating and transport.P2X uses surplus electrical power to produce hydrogen via water 1/13 arXiv:2201.08827v1[physics.soc-ph]21 Jan 2022 electrolysis [SSZ21, KKT + 21, SGR + 15].Hydrogen can then be used as a fuel directly or converted to LPG, syngas, or methane.The produced gas can then be transported using the current gas transport network.
Despite the importance of modeling and analysis of the gas sector and its interactions with other energy sectors, no reliable datasets describing the gas transport grid exist.Examples of available datasets are limited to single countries and do not provide details of the grid components.Such examples are the LKD-EU dataset for Germany [KKS + 17b] and the National Grid dataset for the UK [nat20].The lack of datasets motivated us to initiate the SciGRID_gas project [PM18] at the DLR Institute of Networked Energy Systems.The goal of the SciGRID_gas project is to derive a reliable and detailed dataset for the gas transport grid in Europe which can be used for modeling and analysis purposes.In practice, the source code of the data model, the geo-referenced datasets describing the gas transport grid as well as the documentation, are made available under open source licences.In order to use SciGRID_gas in the simulation of energy systems, the integration of the datasets in existing energy system models such as open_eGo, PyPSA, and pandapipes is planned.With the SciGRID_gas data model, we would like to answer the following research questions: • Can we build a reliable data model for the European gas transport grid using only publicly available data?
• Is the amount of available parameter data sufficient to estimate missing parameter data via heuristics and statistical methods?
This contribution is structured as follows: In Chapter 2, we discuss the data sources used for constructing the open-source gas transport network model.This is followed by the discussion of the model architecture in Chapter 3. Chapter 4 gives a short overview of some suitable methods for creating the model.Due to the page number limitation, more detailed information is also available in the respective model documentation, which is accessible online [PM18].Chapter 5 presents our model's graphical and statistical results.This is followed by the discussion in Chapter 6 and the conclusion and outlook in Chapter 7.

Data sources
Obtaining reliable open-source data of the gas transport system is a challenging task.The grid data of gas Transmission System Operators (TSOs1 ) are commonly not standardized, nor are they freely accessible [Ent20].Data are generally not geo-referenced and mainly available as PDF maps.Most individual TSOs are not willing to share their data due to competitive reasons.Within the SciGRID_gas project, we have gathered freely available data from different sources.The most relevant are presented below.In the subsection name, we first indicate the source of the dataset (ex.Web Search) followed by the name (ex.INET) we gave the dataset in the SciGRID_gas project.

Web Search -INET
We carried out a web search on all gas network components and compiled the gathered data into the INET dataset.The data stems from TSO press releases, TSO transparency platforms, and TSO public data.Some TSO information had to be made available by the TSOs due to EU regulations [The13].Other information has been made public as part of a company's self-presentation and advertisement.The collected information includes data on network components.This includes their positions but also relevant energy modeling parameters, like diameter, capacity, power, pressure, etc.

German gas model -LKD
The long-term planning and short-term optimization dataset (LKD) [KKS + 17a] contains geo-referenced data on gas facilities in Germany.It was created by several German research institutes and includes information on gas pipelines, production sites, storage, compressor locations, and nodes.The SciGRID_gas project was granted the right to use, change and redistribute the LKD data under an open license from the LKD project members.

EntsoG -EMAP
The development of the European gas transport network is coordinated by the European Network of Transmission System Operators for Gas (EntsoG)2 .The EntsoG is an association of 44 European TSOs, three associated partners, and nine observers.EntsoG members are required to publish certain information according to EU directives.A significant amount of this information is incorporated in the freely available and regularly updated map of the gas pipelines, drilling platforms, and storage facilities.The SciGRID_gas project extracted the rough course and location of the depicted gas pipelines, storage, and production facilities using the map3 of 2019.

Eurostat -Cons
The European Statistical Office (Eurostat) collects and publishes data on energy supply, transformation, and consumption on a monthly and yearly basis.Eurostat statisticslike the complete energy balances dataset [Eur21a], and others -provided the data foundation for one of our studies regarding the European gas demand.We derived gas demand time series with a daily resolution covering the years 2010-2019 with a NUTS 3 spatial resolution4 for 27 European countries.To provide detailed information for modelers and dataset flexibility the time series distinguishes between the sectors households, commercial and industry.Figure 1 provides an exemplary data plot for the annually averaged residential gas demand in Europe (2010-2019) disaggregated into NUTS 3 regions.The data derivation techniques showed good benchmarking results against three existing time series of gas demand in Germany [San21a] originating from the DemandRegio project [F.20].Significant insights were obtained when analyzing the times series concerning the seasonal, geographical, and sector-specific variability of the gas demand in Europe [San21a].

OpenStreetMap -OSM
OpenStreetMap (OSM) [Hel18] is a freely modifiable and accessible geo-data database with steadily increasing data coverage and data quality.In the past, OSM data has contributed to the field of energy system modeling for example in the creation of power grid models [MMS + 17] or the optimization of flexibility options for urban areas [AMVA17a,AMVA17b].With the esy-osmfilter [PL20] we created a Python library to easily access and filter data from OpenStreetMap.We used esy-osmfilter to analyze the European gas pipeline data in OSM.Our analysis showed that data of gas transport pipeline is strongly represented and rapidly growing.However, some countries show significant data gaps.Moreover, data relating to system-relevant components like compressors stations or storage is clearly missing.With osmscigrid [AO20] we created another library to convert OSM pipeline data directly to SciGRID_gas data format for easier integration of OSM datasets.OSM data has the downside of being licensed under ODbL, which is not compatible with CC-BY license, a hurdle that can be overcome using a collective database.However, we have decided that the current data will not be used directly in our model but only to validate the topology.

Landsat 5
In order to address missing data, we conducted a feasibility study on gas pipeline detection using remote sensing and Artificial Intelligence (AI) methods.We used an approach based on detecting the 16-28 m wide gas pipeline construction lane in order to detect pipeline routes.We then trained a convolutional neural network to discriminate between pixels labeled as Pipeline and pixels labeled as Background.For this purpose, we have used Landsat 5 imagery.Training and tests on the British gas transmission network and the NEL pipeline showed good evaluation scores.They proved the concept of using AI and remote sensing methods on historic open-source satellite imagery to detect pipeline pathways [DPZM21].

Model Architecture
The SciGRID_gas data model network consists of several component classes, each representing a list of objects.The following component classes have been implemented: PipeSegments (PS), BorderPoints (BP), Compressors (CS), LNGs (LNG), PowerPlants (PP), Productions (PO), Consumers (CO), Storages (ST).Any object which is a member of a component class is defined as an element of that respective class and can therefore be described by a common component-specific set of attributes.To keep track of the data processing steps, we have also included parameter information for each attribute of a specific element.
To make this data structure suitable for gas networks, we need to restructure the data using nodes and edges.These are connected to the elements in our dataset by a unique ID.In that way, all components with the exception of PipeSegments are implemented as nodes.However, PipeSegments are also related to a start and end node, so they were implemented as edges.Intermediate pipeline points, reflecting the geographical course of PipeSegments, are stored in separate lists.

Methodology
In this section, we describe the methodology we developed to create the gas network dataset.This addresses, in particular, the release of datasets of various sources, the creation of a merged network dataset, the post-processing, and the visualization.

Data Basis
We have created and released datasets from the various data sources mentioned in Chapter 2 on the project website [PM18] under the section downloads, which were converted from their original format to the SciGRID_gas format described in Chapter 3. Table 1 gives an overview of the data sources constituting the IGGIELGNC-1 dataset.

Data Merging Process
In the next step, we have merged the various datasets we obtained.This task required identifying duplicate elements that exist in more than one dataset.For this task, we rely on the criteria of spatial and name similarity using the fuzzywuzzy python package [Inc14].The python algorithm will evaluate the identity of objects with an identity score between 0 and 100, where 0 indicates no similarity and 100 indicates apparent duplicity.Components are merged if they exceed a component-dependent 5/13 threshold between 80 and 95.In case that the likely duplicates do not share the same attributes, the attributes of the subjectively most trustworthy source from Chapter 2 are adopted.This process works for all components which are implemented as nodes.The process of merging pipelines is more complex.For edge,s the process is built around a similarity check of the start and end node positions and comparisons of the diameter, pressure, capacity, and length values.The respective algorithm is described in more detail in the documentation of the final dataset on the website [PM18].In terms of data, we have used the pipeline data from the EMAP dataset as a basis for the pipeline system and combined them with pipeline data from the INET and LKD datasets.

Attribute Generation
Once a merged dataset has been compiled, we focus on predicting missing data on the attribute level.Depending on the specific attribute, various approaches produced different results regarding their suitability to estimate missing values.One approach would be to use heuristics to determine missing values.For example, in the case of missing pipeline capacity values, one could use the capacity of an adjacent compressor station to derive this value under the consideration of all other incoming and outgoing pipelines.This approach, of course, requires the construction of meaningful heuristics and sufficient data.
Another approach is exploiting linear relations between different parameters of the same element.For example, one can identify a linear regression between maximal power and maximal capacity of a compressor.For this purpose, we have used the Lasso-linear regression method from scikit [sl19].However, a meaningful linear correlation is not identifiable in most cases, or the data density is insufficient.Thus, we must rely on a statistical approach and calculate mean or median values.
We have also used some of our unused data from Chapter 2 to derive statistical correlations and heuristics in some particular situations.For all derived values, the data generation process will also store the applied method in the corresponding metadata of this value.This is especially useful if a user wants to distinguish raw and generated data.The user can identify further heuristics and develop other data generation methods based only on the original attribute data.

Post-Processing and Visualization
Finally, we have added our artificial Consumer (or demand) nodes to the network and connected them to the nearest pipeline.We have compiled the final dataset for the three different consumer aggregation levels NUTS 1, NUTS 2, and NUTS 3, which resulted in the datasets: IGGIELGNC-1 [DPS + 21a], IGGIELGNC-2 [DPS + 21b] and IGGIELGNC-3 [DPS + 21c]; respectively.Further, some cleanup routines have been implemented, e.g., for removing or connecting unconnected isolated elements to create a coherent network.Also, during post-processing, the elevation of each node was determined with the help of Bing Maps Elevation API [Mic20].Additionally, we have released qplot [Plu20], a matplotlib [Hun07] based visualization library for SciGRID_gas data.The library was used for the creation of Fig. 4.

Results
We have released our final results under the name IGGIELGNC-1 on our website [PM18], where it is linked to its respective Zenodo repository.The data is licensed under CC-BY and available in CSV and geojson formats, and accompanied by a methodical documentation.We want to emphasize that the following results are from version 1.1 of the dataset and are subject to changes in future updates.
We have merged the pipeline systems of INET with a total length of about 60.000 km, EMAP with a total length of about 207.000 km, and LKD with a total length of about 27.000 km to a network which finally contains 237.000 km of gas transport pipelines.This pipeline system is plotted in Fig. 2. For comparison, the extrapolated pipeline validation dataset from OSM in 2020 only contains a total length of 108.000 km.In Fig. 3 we show the pipeline system with all other components that sum up to 109 BP, 248 CS, 32 LNG, 314 PP, 102 PD, 108 CO, and 294 ST (refer to Section 3 for the nomenclature).A country-wise overview of all components for some EU countries is shown in Table 2 .Next, we looked at the attribute data for PipeSegments, Storages, LNGs and Compressors.For this purpose, we have chosen up to three of the most relevant attributes for each component and determined their respective data density.Tab. 3 shows the result in percent.We have not considered BorderPoints or Consumers for this analysis as these components are based on data aggregation and have no real physical counterpart.We have visually validated the topology of our network datasets with the OSM pipeline data.This process is illustrated in Fig. 4 for the validation of the INET data for the region of Spain.Our overall impression was that the topology of all major pipelines is in good agreement with the currently available OSM data.

Discussion
We presented our approach to create a gas transport network data model of Europe from publicly available data sources.In terms of pipelines, our data model has a total length of about 237.000 km, which is roughly in accordance with the commonly assumed length of 200.000 km [CLM + 14].The slight overestimation is probably a direct result of our broader definition of the European gas network, which also incorporates the Western part of Russia and some Northern African States and Turkey.However, the total length is a good indicator for the overall success of modeling the European gas transport network from open-source data.
Further, we have used OpenStreetMap data to validate our network's topology visually.Our data show good accordance in this regard.Nevertheless, the OSM dataset currently contains only about 45 % of all pipelines due to significant data gaps in some regions.Also, this method can not be used to validate parallel pipelines because they are not explicitly specified in OSM.The examination of other components like Powerplants and Productions has shown data gaps in some regions stemming from incomplete data sources.We have further noticed that some countries still have missing Borderpoints.Since our algorithm generates these artificial nodes, this will be updated in future dataset versions.
Further validation of our dataset is currently not feasible due to a lack of relevant open-source validation datasets.This is also why we have avoided evaluating our attribute generation methods, which are described in more detail in the dataset documentation.We rather want to discuss the potential accuracy of such methods.The results of any attribute generation method, designed to predict partially unknown data, will scale in accuracy with the percentage of known data.We have analyzed our data regarding important parameters of different network components.We therefore can state that the generated attributes for LNGs, Storages and PowerPlants are more trustworthy, than for PipeSegments, Compressors and Productions.
From our perspective, more focus needs to be put into the data acquisition for these components.Such data might be provided by OpenStreetMap (OSM) soon.Pluta and Lünsdorf [PL20] have stated that the gas transport data content was rapidly growing between 2014-2019.If this trend continues or will even be supported by TSOs, OSM might become a good source for this data.At some point, it might even be possible to create an entire network from OSM data like it was done for the open-source power transmission dataset SciGRID_power [MMK16].The reason why this is currently not possible for the gas grid is mainly rooted in the fact that gas transport pipelines are buried underground.This makes the direct identification of their position and additional properties difficult for OpenStreetMap mappers.

Conclusion and Outlook
This contribution shows that creating a European gas transport network model is possible using only open-source data sources.Further, we have used statistical and 9/13 heuristic methods to generate missing network elements attributes.Nevertheless, complete data validation was not possible due to the lack of verification data.However, our analysis of the underlying component attribute data showed that gas transport pipelines and compressor station data show low density in terms of their main attributes.Since both components are critical in modeling gas flows, future work will diminish these data gaps.We have discussed how some data gaps can potentially be closed either by a steady growth of OSM data or by remote sensing methods.However, some data, especially on the pipeline materials and roughness values, are not accessible without the assistance of TSOs.We believe that our data model is a valid approximation of the current European gas network.It can potentially encourage TSOs to make their data open-source, which in the long term will result in a more precise representation of the gas grid and more suited energy scenarios.

Figure 2 .
Figure 2. The pipeline system of the European gas transport grid in the IGGIELGNC-1 dataset.Pipelines are colored according to their diameter values.

Figure 3 .
Figure 3. Extract of the European gas network of the IGGIELGNC-1 dataset with all its components.

Figure 4 .
Figure 4. Comparison of pipelines from OSM (black) and the INET dataset (colored with diameter [mm]).

Table 1 .
Overview of the available data sources for different gas transport components.

Table 2 .
Total pipeline (PS) length and compressor station (CS) count for some countries.Data source: IGGIELGNC-1