Application of remote sensing methods for statistical estimation of organic matter in soils

Maintaining soil fertility and optimizing land management strategies on a global scale necessitates a comprehensive understanding of soil’s physicochemical properties. This study proposes the deployment of remote sensing techniques as a more efficient, accurate, and cost-effective alternative for exploring the properties of chernozem soils and for predicting soil organic matter (SOM) in specific regions of Kyiv, Ukraine. The spectral properties of the chernozem soils were examined by Landsat 5 TM and 8 OLI satellite imagery. A mosaic of the mean spectral reflectance values for the study period (1986-2015) was created on the Google Earth Engine. The construction of the predictive model employed reflectance values across six bands, a range of vegetation indices (RNDSI, NDWI, NDBI, BAEI, MSAVI, and NDVI), topographic data (slope), and climatic parameters (temperature, precipitation, soil moisture) as regressors. The optimal model was subsequently determined using the Forward stepwise method, which is predicated on F-test and Akaike criteria. This model yielded notable metrics of Multiple R = 0.651, R 2 = 0.424, and Adjusted R 2 = 0.397. The most significant predictors within the model were determined to be RED, NIR, slope, and soil moisture. Upon validation, the model displayed no evidence of heteroscedasticity, multicollinearity, or potential outliers, thereby underscoring its reliability and robustness. The proposed model serves as a valuable tool, offering both reliability and precision in predicting SOM in specific regions of Ukraine.


Introduction
The significance of soil organic matter (SOM) research cannot be overstated, being at the forefront of solutions to pressing global issues.As a substantial carbon repository, SOM is indispensable to strategies aimed at mitigating climate change, with research on its dynamics providing invaluable insights for enhancing carbon sequestration efforts (Baveye et al., 2020;Navarro-Pedreño, Almendro-Candel and Zorpas, 2021).Moreover, SOM bolsters soil fertility, a critical factor in augmenting agricultural productivity, thereby addressing concerns of food security in the face of an escalating global population (Khangura et al., 2023;Parikh and James, 2012;Das et al., 2021;Chatohin and Lindin, 2001).Beyond that, SOM serves as a vital energy source for soil microorganisms, fostering soil biodiversity (Bach et al., 2020;Wang et al., 2022;Zhang et al., 2021).Recent advancements in remote sensing technologies offer transformative possibilities in the field of SOM research, allowing for large-scale, non-invasive, and temporal assessments of soil conditions (Omia et al., 2023;David, 2013;Das et al., 2021;Bouzekri et al., 2015).The ability of satellites to rapidly cover extensive and often inaccessible geographical regions presents an unrivaled advantage over traditional, labor-intensive soil sampling methods.Combined with ground-based measurements, satellite data offers improved accuracy of SOM estimates, providing a comprehensive analysis of soil fertility indicators (Belenok et al., 2021;Huang et al., 2018;Weiss, Jacob and Duveillerc, 2020;Chen et al., 2006;Zhang et al., 2021;Yuzugullu et al., 2020).Satellite imagery, particularly multispectral and hyperspectral images, are essential tools in assessing SOM due to their ability to capture reflectance and absorption properties of different soil constituents (Luo et al., 2023;Gopp et al., 2017;Gopp et al., 2019;Khangura et al., 2023).Organic matter influences reflectance in the visible (VIS) and near-infrared (NIR) regions, making these spectral bands key in SOM evaluation (Reis et al., 2021;Bouasria et al., 2020;Yu et al., 2021;Yuzugullu et al., 2020;Zhang et al., 2021).
The spectral characteristics of bare soil are shaped by several factors, most notably the content of organic matter, often referred to as humus or SOM, and the level of soil moisture.Specifically, increases in either SOM content or soil moisture are associated with reduced spectral brightness coefficients, a pattern particularly pronounced in the red band of the spectral range (Zhang et al., 2021;Wang et al., 2022;Kravtsova, 2005;Shikhov et al., 2020;Prudnikova and Savin, 2021).Dry soils exhibit a common trait: a monotonic escalation in spectral brightness coefficients as the wavelength extends from 0.4 to 2 microns.As the SOM content rises, the spectral brightness of the soil correspondingly diminishes (Rukhovich et al., 2021).Further, the soils structure can influence its spectral brightness.Unstructured soils, for example, reflect approximately 10-15% lighter than well-structured soils, underscoring the intricate interplay of soil properties on its spectral characteristics (Kravtsova, 2005;Jing et al., 2020;Wang et al., 2022;Yang et al., 2021;Zhang et al., 2021).
The evaluation of SOM using remote sensing techniques utilizes specific spectral indices such as the Ratio Normalized Difference Soil Index (RNDSI) or the Normalized Difference Vegetation Index (NDVI).These indices help in amplifying the SOM signal while simultaneously suppressing the effects of other variables (Yang et al., 2021;Deng et al., 2015;Montero et al., 2023).While the primary application of NDVI lies in vegetation analysis, it can indirectly infer SOM levels as high vegetation areas often correlate with increased SOM content.Many techniques underscore the use of the NDVI as an identifier for bare soil surfaces (Ding et al., 2016;Tian et al., 2021;Lu, Liu and Liu, 2022).Guidelines have been established, which suggest scales of NDVI values for reflectance in the RED (0.25) and NIR (0.3) ranges for open ground (the average NDVI value is thus is equal to 0.025) (Koroleva et al., 2017;Kulyanitsa et al., 2017;Luo et al., 2021).According to the study (Gasmi et al., 2021) suggested using an NDVI threshold value of 0.22 to separate bare soils from other land types.Further, ground cover studies utilize mathematical models built upon parameters like fractional vegetation coverage, nitrogen reflectance index, yellow leaf index, bare soil index, and slope, achieving an accuracy of approximately 90% (Das et al., 2021).Land degradation hot spots and soil salinity are studied using spectral characteristics like surface albedo, the Salinized Land Degradation Index (SDI), the Soil Surface Moisture Index (LSM), Enhanced Vegetation Index (EVI) (Rukhovich et al., 2021;Higginbottom and Symeonakis, 2014;Jiang et al., 2008;Yuzugullu et al., 2020;Gopp et al., 2019).To minimize the effect of some vegetation spectral indices on soil brightness, based on the wavelengths of the RED and NIR ranges bands, it is proposed to use indices like the Soil-Adjusted Vegetation Index (SAVI) (Huete, 1988) and the Modified Soil Adjusted Vegetation Index (MSAVI) (Qi et al., 1994), as well as the NPV-Soil Separation Index (NSSI), developed to distinguish non-photosynthetic vegetation soils using two NIR bands (Tian et al., 2021).In the study (Chen et al., 2006), authors proposed the utilization four vegetation indices -NDVI, the Normalized Difference Water Index (NDWI) (Gao, 1996), the Difference Built-up Index (NDBI), the Normalized Difference Bareness Index (NDBaI) -for differentiating bare soils from other land use and land cover (LULC) types, setting threshold values for each.These indices are based on the spectral properties inherent to all LULC categories (Chen et al., 2006;Li and Chen, 2014).
A multitude of studies have established a correlation between SOM content in soils with pixel values of multispectral satellite data, often employing multiple linear regression (MLR) or alternative mathematical methodologies (Lu, Liu and Liu, 2022;Luo et al., 2021;Ahmed and Iqbal, 2014;Bouasria et al., 2020).Some authors (Gasmi et al., 2021;Demattê et al., 2007) have successfully used MLR to predict clay content.In a distinct approach (Mirzaee et al., 2016) authors applied geostatistical methods, specifically different types of the kriging method, to predict SOM content in soils.In the study (Jing et al., 2020) authors examined the impact of time gaps between field sampling and the acquisition of Landsat TM/OLI satellite data on soil nutrient predictions, using both MLR and artificial neural network methodologies.According to study (Ahmed and Iqbal, 2014;Fiorio and Demattȩ, 2009), authors employed MLR to correlate soil surface variables with spectral data.These studies demonstrate the diversity of approaches and tools used in exploring the link between SOM and spectral data.
The technique of this study consists in the correct separation of the bare soil cover from other LULC in order to establish an excellent linear relationship between the spectral indices and the actual soil parameters.It is this indicator and the calculated vegetation indices that will be the basis for forecasting the fertility of agricultural land in the Kyiv region.
The aim of the study is to establish a statistical relationship between SOM and the mean spectral reflectance extracted from satellite images.This process based on using MLR and further assessing the accuracy and significance of this relationship for the chernozems of Kyiv region.The establishment of such a relationship will make it possible to forecast the SOM content in the soil for further assessment of soil fertility.

Characteristics of the study area
Located in the Dnipro River basin in the northern part of Ukraine, the Kyiv region encompasses an area of 28.1 thousand square kilometers, excluding the Kyiv.This region constitutes approximately 4.7% of Ukraine's total land area, with the inclusion of Kyiv bringing the total to 28.9 thousand square kilometers.From an administrative perspective, the region is divided into 25 districts, along with 13 cities of regional subordination.In addition, it includes 30 cities of district subordination, classified as urban-type settlements, and 1,182 rural settlements.As of the start of 2021, the population of the Kyiv region stood at approximately 1,788,530 thousand individuals (Website of the Kyiv Regional State Administration, 2021).
The Kyiv region's lands are uniquely positioned within a transitional soilclimatic zone.This zone serves as an interface between Polissya, characterized by soddy-podzolic soils, and the Forest-steppe region, known for its transitional chernozem soils.This distinctive location contributes to the diversity of soil types in the Kyiv region and influences its ecological and agricultural potential (State Department of Environmental Protection in Kyiv region, 2021).
According to the Main Directorate of the State Land Agency, the total land area within the administrative boundaries of the Kyiv region is 2816.2thousand hectares.The region's agricultural land spans 1658.9 thousand hectares, representing approximately 58.9% of the region's total land area.Built-up areas account for 137.4 thousand hectares, which equates to 4.9% of the total.Industrial lands -22.9 thousand hectares (0.8%), while lands designated for transport and communications represent 29.7 thousand hectares (0.9%).Land allotted to law enforcement agencies covers 26.3 thousand hectares (0.9%).The region designates 56.0 thousand hectares for environmental protection.Recreational areas cover a total of 1.8 thousand hectares, with 0.4 thousand hectares intended for general recreation and an additional 1.4 thousand hectares set aside specifically for recreational facilities.Finally, 1.2 thousand hectares are reserved for historical and cultural purposes (State Department of Environmental Protection in Kyiv region, 2020; Main Department of the State Land Agency in Kyiv region, 2012).
Figure 1 illustrates the territory of the Kyiv region with an emphasis on 15 highlighted districts (excluding city councils).For these highlighted districts, a statistical relationship will be established between the humus content (SOM) in the soil and the Landsat 8 OLI satellite data.The creation Figure 1 employed geoinformation technologies, following the methods delineated in the study (Belenok et al., 2021;Liashenko et al., 2020) in ArcGIS Pro software.

Figure 1. Study area
The Kyiv region boasts a diverse soil cover, with chernozems, constituting approximately 50% of the region's arable land, being the most prevalent.Other soil types found across a significant portion of the region include podzolized soils.This diverse mix of soil types underscores the rich and varied terrain of the Kyiv region (Figure 2).The degree of plowing of the territory exceeds 60 % (State Institution "Soils protection Institution of Ukraine", 2016).In the Kyiv region, an estimated 1353.7 thousand hectares of land are degraded, representing 48.1% of the region's total land area and 81.4% of its agricultural land.This level of degradation is among the highest in the Forest-Steppe and Steppe regions of Ukraine.One of the key contributing factors to these negative processes is the excessive pressure exerted on the region's land resources, which includes intensive agricultural activity and widespread plowing of the territory (State Service of Ukraine for Geodesy, Cartography and Cadastre, 2020).The overall condition of the soils in the Kyiv region is currently deemed satisfactory.Nevertheless, in order to maintain and enhance this status, it is imperative to implement consistent, reliable, and timely soil cover monitoring (State Institution "Soils protection Institution of Ukraine", 2020).
It is important to note that since February 24, 2022, a full-scale war has been unfolding in Ukraine.Consequently, the current statistical data concerning land resources may be subject to change depending on the impact of the ongoing conflict.Therefore, when interpreting the data, the influence of these extraordinary circumstances should be taken into consideration.

Conducting ground surveys and obtaining factual data on the percentage of SOM
Agrochemical research is conducted in Ukraine every five years.To date, ten complete rounds of such studies have been completed, and the 11 th round is currently underway with data still being gathered.The State Institution 'Soils Protection Institute of Ukraine' carries out agrochemical studies on soils and produces accompanying reports (State Institution "Soils protection Institution of Ukraine", 2020).In our study, we utilized data on the humus content, also known as Soil Organic Matter (SOM), from the 5 th to the 10th rounds of investigation.It is pertinent to note that the 5 th round spanned the years 1986-1990, the 6th round covered 1991-1995, the 7 th round included 1996-2000, the 8 th round ranged from 2001-2005, the 9 th round incorporated 2006-2010, and the 10 th round covered 2011-2015.Thus, the SOM percentage represents the average value over five years for each round, broken down by district.During the 5 th round, 1094.8 thousand hectares of agricultural land were surveyed, with an average SOM percentage of 2.7% for the entire Kyiv region.In the 6 th round, the surveyed area contracted to 770.8 thousand hectares, and the average SOM decreased to 2.6%.For the 7 th round, the surveyed area expanded to 939.3 thousand hectares while the average SOM remained at 2.6%.In the 8 th round, the surveyed area began to shrink again, reaching 843.2 thousand hectares, with a SOM of 2.87%.This decrease continued into the 9 th round (793.2 thousand hectares, SOM -2.9%) and the 10 th round (761.4 thousand hectares, SOM -2.98%) (Romanova et al., 2018;Yacuk and Balyk, 2013).As a result, the SOM data for each district were extracted from statistical reports provided by the State Institution 'Soils Protection Institute of Ukraine.'This information can also be found in the published article by (Romanova et al., 2018).
The State Institution 'Soils Protection Institute of Ukraine' collects soil samples from March to May.Following this, laboratory tests are conducted and the SOM data is calculated as an average value spanning five years.Consequently, we selected satellite images for the months of March through May from 1986 to 2015.It is important to note that many of the images from May display vegetation.This article proposes a method for estimating the predicted SOM indicator based on spectral characteristics.To ensure the reliability of the study, we decided to focus on a single soil type, specifically chernozems, which predominate in the southern part of the region, as depicted in Figure 2. The 15 selected districts from the Kyiv region, along with their corresponding humus content (SOM) in the soils and the soil types, were identified by the authors and are presented in Figure 3.
Drawing from empirical data (Romanova et al., 2018), a significant increase in SOM over time was reported, with marked alterations especially observable in the southern part of the region.Despite the ongoing intensification of agricultural land use in the region, it is evident that the SOM data can be enhanced through the application of various organic fertilizers.Specifically, within the Kyiv region, the incorporation of by-products and residual vegetable matter from agricultural crops is identified as the primary mechanism for replenishing SOM.Romanova's research provides further validation, confirming the annual increase in organic contributions.The second factor contributing to the rise in SOM within the region is the strategic removal, initiated in 2001, of highly eroded and unproductive lands with low SOM content from active cultivation.The area of the investigated lands decreased by 30.5 % (from 1986 to 2015).SOM content in soils of Kyiv region according to (Romanova et al., 2018) by survey rounds  in 15 selected districts shown in Table A1 (Appendix A).

Landsat data
In this study, we utilized data from the Landsat 5 TM ("USGS Landsat 5 Level 2, Collection 2, Tier 1" collection in GEE) and Landsat 8 OLI ("USGS Landsat 8 Level 2, Collection 2, Tier 1" collection in GEE) satellites, both of which are publicly accessible as remote sensing materials.As the groundbased measurements for the SOM data were collected between 1986 and 2015, we ensured that the satellite data also spanned this time interval.Among the Landsat satellite system, three missions, specifically Landsat 5, 7, and 8, were operational during this period.However, it is worth noting that data acquired after May 31, 2003, from these missions may contain data gaps (Landsat Missions, 2021).
For the 5 th through 9 th rounds (1986-2010), we exclusively used data from Landsat 5 TM.The 10 th round, which spanned the interval from 2011 to 2015, was covered by data from both Landsat 5 TM (which operated until 2012) and Landsat 8 OLI (which has been operational since 2013).When data from the Landsat 5 TM were loaded for what were considered the most favorable periods -March and May of 2011 and 2012, the resulting images were found to be of poor quality.This could possibly be due to the deterioration of the detectors as the Landsat 5 TM satellite mission approached its end.As a result, the decision was made to use data from the Landsat 8 OLI satellite.The Landsat 8 OLI data offers a spatial resolution of 30 meters in the visible, near-infrared (NIR), and shortwave infrared (SWIR 1 and SWIR 2) ranges, and it has a 16-day temporal resolution (Landsat Missions, 2021).
The number of images that were used for conducting the research in each of the rounds are displayed in Table 1.

Satellite elevation and climate data
Elevation and climate data from GEE were utilized to construct a regression model in this study.The 'NASA SRTM Digital Elevation 30 m' collection served as the source for terrain slope data, expressed in degrees.Additionally, the 'TerraClimate: Monthly Climate and Climatic Water Balance for Global Terrestrial Surfaces, University of Idaho' collection provided data on precipitation accumulation, measured in millimeters, as well as soil moisture, also measured in millimeters, which was derived using a one-dimensional soil water balance model.Temperature data (ºC) were also sourced from this collection.

Satellite Data Preprocessing in GEE
Landsat 5 TM and Landsat 8 OLI data can be freely downloaded from GEE (Google Earth Engine, 2021).Hence, the study employs the GEE cloud computing platform, offering powerful tools for processing remote sensing data and conducting geoinformation analysis.Using GEE mitigates challenges associated with loading, storing, and processing satellite data at different times.This issues commonly encountered when dealing with large volumes of spatial data (Mateo-García et al., 2018; Shetty Ai., Umesh and Shetty A., 2021).To access GEE over the internet, a Google account is required for coding operations.The data catalog houses a vast repository of publicly available geospatial datasets.These incorporate observations from a variety of satellite and aerial imaging systems spanning both optical and non-optical wavelength ranges.The catalog also includes environmental variables, weather and climate data, land cover, topographic information, and socio-economic data.All this data is preprocessed and made available in a ready-to-use format (Gorelick et al., 2017).The GEE platform comprises two principal components that function in tandem -the Google Earth Engine Explorer, which is used for viewing datasets, and the Google Earth Engine Playground.The Google Earth Engine Playground application employs a JavaScript API to download and analyze extensive satellite images, as well as to execute complex geostatistical and geospatial operations (Shetty Ai., Umesh and Shetty A., 2021).
In GEE Landsat 5 TM and Landsat 8 OLI data were downloaded from Collection 2 Level-2 already with surface reflectance values.That is, they are atmospheric corrected.Landsat 5 TM and Landsat 8 OLI surface reflectance data (Level 2 Science Product -L2SP) were generated using the Land Surface Reflectance Code algorithm.This algorithm converts pixel values to Top of Atmosphere (TOA) Reflectance and Brightness Temperature values using calibration parameters from metadata.Then the TOA Reflectance values are subjected to atmospheric correction using additional data (Landsat 8 Collection 2 (C2) Level 2 Science Product (L2SP) (Guide, 2020; Vermote et al., 2016) The subsequent step in processing Landsat data in GEE involves masking clouds and their shadows.Pixels obscured by cloud cover and cloud shadows do not provide useful information about the Earth's surface.Therefore, their inclusion in data processing would lead to deliberately inaccurate results.GEE masks cloud pixels according to the C function of the mask (CFmask) algorithm using data from the QA_PIXEL (Pixel Quality Assessment) band.The CFmask algorithm can be found, for example, in the study (Joshi, Wynne and Thomas, 2019;Zekoll et al., 2021).
Finally, employing a date filter, mosaic Landsat images were created for the months of March through May for rounds 5 th through 10 th , encapsulating 15 months across five years for each round.These mosaic Landsat images consist of images with atmospheric corrected pixel values (surface reflectance) that have been averaged over all scenes of this satellite covering the 15 districts of the Kyiv region for the designated time interval, with clouds and their shadows masked.

Snow masking
Images from March and May were considered for this study because during this time, the snow in the area would have already melted and the primary growing season for most plants would not have begun yet.Extensive vegetation on the soils can lead to reflections that distort the soil's spectral curve.To account for any areas where snow might still be present, a trait common in March and April for this latitude, the Normalized Difference Snow Index (NDSI) was utilized.We employed the NDSI, as described in (Riggs, Hall, and Salomonson, 1994) to mask areas that might still be covered with snow.This specific threshold was established after examining the satellite data for the study area.The calculation of the NDSI was conducted using the following formula: where Green and SWIR 1 are the pixel values in Landsat's bands (2 and 5 for Landsat 5 TM and 3 and 6 for Landsat 8 OLI, respectively. The threshold NDSI < 0 was utilized for snow masking, with positive NDSI values indicating areas covered with snow.The mean spectral reflectance (MSR) value was obtained for pixels unaffected by snow, clouds, and cloud shadows.Subsequently, the resulting mosaic was cropped to the precise area of the polygonal feature representing the Kyiv region.Finally, the processed mosaic was uploaded to the GEE platform from the Humanitarian Data Exchange (The Humanitarian Data Exchange, 2021).

Determination of bare soils from remote sensing data
According to (Romanova et al., 2018) soil samples were collected from bare soils in the Kyiv region for further determination of SOM.Consequently, the subsequent step involved developing a method to isolate bare soils using the mosaic of Landsat satellite data.Figure 2 indicates that chernozem soils, which are the focus of this article's spectral analysis, are predominantly located in the south and southeast of the Kyiv region.Therefore, for the upcoming investigation, the corresponding 15 districts of the Kyiv region, as depicted in Figure 1 and Figure 3 were specifically chosen.The polygon data for these districts were also obtained from the Humanitarian Data Exchange (The Humanitarian Data Exchange, 2021).
The conventional approach for identifying different types of land use, known as LULC (Land Use/Land Cover), involves satellite data classification.Both supervised and unsupervised classification methods are commonly employed, as described in the scientific literature (Mekuriaw, Cherinet and Tsegaye, 2021;Olthof and Fraser, 2014).Nevertheless, the classification method is rather cumbersome and can be effectively used to distinguish all types of LULCs typical for a given territory.In our study, we need to select only bare soils, and we are not interested in selecting other LULC types.Therefore, it is advisable to use a technique based on the reflective spectral properties of bare soils, making it possible to decipher this particular type of LULC.These data for the first seven bands of Landsat 8 OLI in GEE are shown in Figure 4.
The spectral reflectance properties of soils in the study area were examined by analyzing spectral profiles and using different combinations of bands from the Landsat 8 OLI satellite.(Figure 5).
By analyzing Figure 5a and Figure 5b, it is possible to determine the color representation of bare soils in composite R-G-B = 4-3-2 and decipher them.It is important to consider the soil property described in (Shikhov et al., 2020), as mentioned earlier, which states that soils of different genetic affiliations exhibit a monotonic increase in reflection coefficients as wavelength increases in the range from 0.4 to 2 μm.In our study, this corresponds to the B1-B6 band of Landsat 8 OLI.By comparing the spectral profiles of different bare soils (Figure 5a) and overlaying a thematic map of soils in the Kyiv region (Figure 2) onto Figure 5a it can be concluded that the last two spectral profiles in the "Profiles" list do not correspond to chernozem soils.Therefore, for further analysis, they should be masked.
Based on the findings from previous studies regarding the spectral reflectance properties of chernozems and the obtained spectral profiles, the authors chose to utilize specific threshold values for vegetation indices to isolate bare soils chernozems, as presented in Table 2. Additionally, the Built-Up Area Extraction Index (BAEI) was employed to separate bare soils from built-up areas.The vegetation indices were calculated using the formulas provided below.NDVI differentiates green vegetation from other surfaces, and elevated NDVI values indicate a high presence of leaf biomass, dense canopy coverage, or a large leaf area (Jasinski, 1990;Tucker, 1979;Wilson and Sader, 2002): (2) NDBI (Normalized Difference Built-Up Index) is utilized to emphasize built-up areas while mitigating variations in surface illumination and atmospheric effects (Zha, Gao and Ni, 2003): Normalized Difference Water Index (NDWI was introduced as a novel index for remote sensing applications, specifically for detecting vegetation liquid water and assessing the spectral reflectance properties of soils (Gao, 1996): where ρ is surface reflectance value at the wavelength specified in μm; MSAVI2 (Modified Soil Adjusted Vegetation Index) expands the dynamic range of the vegetation signal, thereby reducing the impact of soil background and enhancing the sensitivity of vegetation.This sensitivity is determined by the ratio of the vegetation signal to the soil noise (Qi et al., 1994): Built-Up Area Extraction Index (BAEI) is calculated by the equation (Bouzekri et al., 2015): Vegetation indices were computed based on the average spectral reflectance of Landsat 5 TM and Landsat 8 OLI, following the formulas -.Using the predetermined threshold values of vegetation indices (Table 2) and the polygons representing the territories of city councils acquired from (The Humanitarian Data Exchange, 2021), a binary raster mask was generated.In this mask, a value of 1 indicates bare soils chernozems, while 0 represents other types of LULC.The generated mask was then applied to the previously obtained MSR values.As a result, images of bare soils were obtained, where the pixels retain the MSR values.

Results
To establish the statistical relationship between the MSR value obtained from Landsat and SOM, it was decided to employ a multiple linear regression (MLR) model.Since each of the 15 selected districts in the Kyiv region is characterized by a single the SOM value, this value should correspond to a single mean spectral reflectance value from Landsat for each band.Consequently, within GEE, zonal statistics were utilized to calculate the average MSR values for the seven selected Landsat bands, the average values of vegetation indices for each area, as well as additional data on relief and climate factors such as slope, precipitation, soil moisture, and temperature.
Indeed, MLR enables the assessment of the individual impact of each factor and their collective influence on the target variable, in this case, the SOM data.MLR allows for the determination of the degree of influence that each factor exerts on the productive trait and provides insights into the combined effect of these factors on SOM.By analyzing the MLR model coefficients and statistical significance, we can understand the relative importance and contribution of each factor in explaining the variations in SOM.In the MLR model, SOM was considered as the dependent variable.As predictors or regressors in the MLR analysis, we utilized spectral bands with wavelengths common to Landsat 5 TM and Landsat 8 OLI, which are associated with the spectral reflectance properties of soils.Additionally, vegetation indices capable of detecting bare soils were employed as predictors.Furthermore, topographic data, specifically slope, and climatic data including temperature, precipitation, and soil moisture, were also included as predictors in the model (Table 3).The MSR value in Table 3 indicates the MSR obtained from Landsat 5 TM and Landsat 8 OLI for each of the six selected bands, band combinations, and vegetation indices.These MSR values were calculated within the 15 districts of interest.In each of the six rounds of ground surveys (5 th -10 th rounds), there is one averaged value of SOM.Considering that ground surveys were conducted in 15 districts of the Kyiv region, each MLR regressor model consists of 90 values (6 rounds × 15 districts = 90 values).These input data for the MLR calculations are presented in Table A2  .Scatterplot matrix for correlation between model variables a Pearson's correlation analysis was conducted to examine the relationships between the variables used in the MLR model.The Pearson's correlation coefficient (Pearson's r) was employed to evaluate the strength and direction of the linear association between the variables (Mendenhall and Sincich, 2013).
The results of the Pearson correlation analysis between MSR and SOM values are presented in Table 4. Marked * correlations are significant at p < 0.05.For clarity, the data from Table 4 are shown on Figure 6.
The analysis of Table 4 reveals a notable and statistically significant negative relationship between SOM and MSR in the Green and Red bands (r ≈ -05 ÷-0.6 ).There is also a slightly weaker negative relationship between SOM and MSR in the Blue and NIR bands (r ≈ -04).Additionally, an inverse relationship is observed between SOM and MSR in the SWIR 1 band (r ≈ -0.28).Regarding the climatic data and topography, the analysis indicates a significant negative relationship between SOM and Slope, as well as between SOM and Soil moisture.However, the correlations between SOM and the other MSR variables are relatively weak and not statistically significant at p < 0.05.
At the initial stage, a total of 13 predictors for the MLR model were selected.Subsequently, employing the method of least squares, the MLR coefficients and other regression parameters were calculated.In our analysis, it was employed two methods to determine the most significant variables: the Forward stepwise method and the Backward stepwise method.The Forward stepwise method involves iteratively adding or removing independent variables from the model at each regression step based on the results of the F test.Variables are added to the model if they contribute significantly to improving the regression model's fit.This process continues until a better regression model is achieved.On the other hand, the Backward stepwise method removes independent variables from the regression equation one at a time, again based on the F test results.Variables are eliminated from the model if their inclusion does not significantly improve the fit of the regression model.This process continues until a better regression model is obtained.The essence of the F test in testing the significance of MLR is to compare two variances: model-explained (MSR) and unexplained (MSE) (Mendenhall and Sincich, 2013).If MSR ≈ MSE, then the regression is insignificant (the constructed model does not allow explaining the behavior of the predicted value depending on the the predictors' values).If , then the regression is significant.Two hypotheses are considered: H 0 (1) -regression is not significant, H 1 (1) -regression is significant, which is what it is used for F test: Next, the significance is checked F (value is calculated F 0 ) through the F-distribution (Fisher-Snedecor distribution) for the obtained values F, DFR, DFE.If, F 0 > α where α = 0.05 is significance level, then the hypothesis H 0 (1) , is accepted, otherwise, the hypothesis H 1 (1) is accepted.For a detailed step-by-step procedure of the Forward stepwise and Backward stepwise methods, please refer to Appendix B. As a result of these methods, we obtained two models, and their respective parameters are presented in Table 5.
In order to determine the best model among the two obtained, we utilized the Akaike Information Criterion (AIC) as proposed by (Akaike, 1973).The AIC serves as a heuristic measure that aims to balance the reduction in the number of model parameters and the quality of model fit.According to this criterion, the model with the smaller AIC value should be selected, irrespective of its absolute value.The obtained values for the AIC criterion are as follows: AIC I = -240.023 Based on the selection process, Model I derived from the Forward stepwise method was chosen for predicting SOM.To assess the normality of the values of the model regressors and the residuals, we conducted the Shapiro-Francia test and the Kolmogorov-Smirnov test.The Shapiro-Francia test, which is an extension of the Shapiro-Wilk test suitable for samples larger than 50, was used for this purpose.These tests help us determine if the values of the model regressors and the residuals follow a normal distribution (Shapiro and Francia, 1972).The results of the regressor test are shown in the following Figure 7.
The values of all four regressors in Model I are normally distributed at a significance level of α = 0.001, as a. ∀ p -value: p ≥ a.As a limitation of our model at this stage, it should be noted that the NIR and Slope regressors will not be normally distributed if we consider a significance level of α = 0.01 During the assessment of the residuals' normality using the Shapiro-Francia test and the Kolmogorov-Smirnov test, the results depicted in Figure 8 were obtained.It can be observed that the residuals are normally distributed at a significance level of α = 0.001 (p = 0.951 > 0.001).The effect size KS -D, which measures the difference between the sample distribution and the normal distribution, is very small, specifically 0.04269.This suggests that the deviation from normality is negligible.To test our model on heteroscedasticity (the hypothesis H 0homoscedasticity, error variances are equal) we used The Breusch-Pagan test (Breusch and Pagan, 1979).The test result showed that there was no heteroscedasticity: To test our model for multicollinearity, we used the Variance Inflation Factor (VIF) (Mendenhall and Sincich, 2013).At first glance, there is a suspicion of multicollinearity in the model, since according to the Table 5 p-value of t Stat for Soil moisture and Slope greater than α = 0.05 (0.075 and 0.131, respectively).But the VIF calculation showed that there is no multicollinearity: VIF Red = 1.24,VIF NIR = 1.18,VIF Slope = 1.30,VIF Soil moisture = 1.03 According to (Mendenhall and Sincich, 2013) a severe multicollinearity problem exists if the largest of VIF is greater than 10.
To identify potential outliers, we calculated Cook's distance.All obtained values of Cook's distance are approximately equal and are in the range from 0 to 0.16, which indicates the absence of potential outliers (Mendenhall and Sincich, 2013).
Finally, we have constructed graphs that illustrate the relationship between the Predicted vs. Observed Values (Figure 9) and Predicted vs. Residual Scores (Figure 10).Confidence bands were constructed at a probability level of 0.95.
In Figure 9 and Figure 10 the confidence bands line (red dashed curve) shows the confidence level (0.95), which is the probability that the "true" fitted line (in the population) falls between the bands.

Discussion
The findings from the constructed MLR model for predicting SOM in the chernozem soils of the Forest-Steppe zone of Ukraine provide significant insights into the region's soil fertility.The application of a combination of reflectance values in six bands of Landsat 5 TM and Landsat 8 OLI satellites, vegetation indices (NDSI, NDWI, NDBI, BAEI, MSAVI, and NDVI), topographic data (slope), and climatic data (temperature, precipitation, soil moisture) as regressors has contributed to the development of an effective predictive model.The Forward stepwise method (Mendenhall and Sincich, 2013) was instrumental in achieving the optimal model, as it allowed for the selection of the most impactful regressors based on F test and Akaike criteria (Akaike, 1973).The model yielded a Multiple R = 0.651 and an R 2 = 0.424 and the most significant predictors of an effective predictive model are RED, NIR, slope and soil moisture.These metrics suggest that the model can explain approximately 42.4% of the variation in the SOM in the soils of the examined regions.The adjusted R 2 of 0.397 further supports this and confirms that the model maintains a decent level of predictive power even when adjusting for the number of predictors in the model.Fairly similar results were obtained in the study (Yu et al., 2021) where the accuracy of the model was improved by adding the same environmental factors and spectral combination.However, the estimation effect of back propagation neural network (BPNN) model (with R 2 being 0.947) was better than that MLR model (with R 2 being 0.410).Similar to the findings of this study (Forkuor et al., 2017;Wang and Ge, 2012) also recorded elevation, slope, RED and NIR as the most important variable influencing soil indicators.According to the study (Luo et al., 2023) adding the multiyear average temperature effectively improves the SOM mapping performance.The authors used Sentinel images (10 m) and the random forest algorithm, which obviously affects the accuracy of the model.Some studies (Lu, Liu and Liu, 2022) suggest that the environmental variables (mainly clay index) have the greatest influence on the prediction accuracy of SOM.Using time series 7 MOD09A1 images (Zhang et al., 2021) noted that the pixel dates of the training samples and precipitation data were the main factors controlling the model performance.The simultaneous combination of dead fuel index (DFI), NDVI and environmental factors improves the SOM mapping performance (Yang et al., 2021) especially in infertile soils.Undoubtedly, the accuracy of SOM mapping performance will be affected by the spatial resolution of the images.In our study, the period 1986-2015 was selected, which is decisive for why we used the Landsat system.
Given that the RED and NIR bands emerged as the most significant spectral predictors in our analysis, it would be valuable to compare our findings with the results of Ukrainian scientists who have studied the same type of soils.
According to the study (Chornyy and Abramov, 2016;Achasov and Bidolah, 2008;Chatohin and Lindin, 2001;Sakhatsky, 2008;Truskavetsky, 2006;Byndych, 2021) the humus content (SOM) is most linearly related to RED, NIR and GREEN band.In the study (Abramov, 2012) author built a parabolic relationship between spectral index RED/NIR and the humus content (SOM).Long-term studies of soils in Transcarpathia made it possible to conclude that the best relationship was traced between the humus content (SOM) and the spectral index NIR/RED (Hebryn-Baidy, 2017).According to study (Truskavetsky et al., 2015) there is a linear relationship between SWIR1,2 bands and the humus content (SOM).The fact that in the listed studies authors used exclusively spectral information is interesting.In our study, we used environmental factors in addition to spectral data.It should be noted that the mentioned authors did not specify the correlation between the bands.According to our research, for the selected area, there existed a high positive correlation between the GREEN and RED bands (R = +0.96).Therefore, including both of them simultaneously in the model would yield an incorrect result.This increased the accuracy of our research to some extent.Some studies prove that there is a relationship between SOM and other bands.Based on the MLR analysis, the authors (Ahmed and Iqbal, 2014) establish a significant relationship (R 2 = 0.55) between SOM and spectral reflectance in the BLUE, Thermal and Mid-Infrared bands.Whereas in our study, the BLUE band showed worse significance, and others band were not considered.In the study (Demattê et al., 2007;Fiorio and Demattȩ, 2009;Reis et al., 2021), authors demonstrate that SOM, clay and sand are the most important attributes that can be predicted by MLR.It is important to note that the validation procedure compared the content of each soil attribute determined by the spectral models with laboratory traditional analysis.In our study, the accuracy of the obtained models was evaluated exclusively on the basis of mathematical calculations.A noteworthy approach is shown in the study (Bouasria et al., 2020), authors increase the resolution of the image (from 30 m to 15 m) by pansharpening method.Thus, the significance of the relationship between SOM and spectral reflectance in the multispectral range increases.
It should be noted that the research period and soil conditions vary depending on the natural and geographical location of study area.The authors also note that the soil should be in an air-dry state, the surface roughness should be minimal, and there should be no vegetation on the soil surface (Gopp et al., 2017;Gopp et al., 2019;David, 2013;Wang and Ge, 2012;Wang et al., 2022;Savin et al., 2021).These conditions were met in our study by using NDSI, NDWI, NDVI, NDBI, and BAEI masks and climatic data.Stable relationships between SOM and spectral reflectance exist only at medium and high content of SOM (1-5%) (Sullivan et al., 2005;Reis et al., 2021).Indicators of SOM in our study fluctuate within the limits from 2.69% to 3.69%.It is also worth noting that the sample size can affect the accuracy of the model.However, this factor is worth a separate study.Additional factors not considered in this study, such as other soil components, the impact of human activity, and biodiversity changes, could further refine the model's predictive capacity.Also, the model was built and tested only in the Kyiv region of Ukraine, primarily on chernozem soils, which may limit its applicability to other soil types and regions.In conclusion, the model serves as a useful tool for understanding and predicting SOM in chernozem soils.With the continuous advancements in satellite technology and increasing availability of detailed environmental data, refining this model for higher accuracy and broader applicability is a promising avenue for future research.

Conclusion
This article focuses on constructing a MLR model to predict the SOM in chernozem soils within the Forest-Steppe zone of Ukraine.The statistical data used for analysis consisted of percent values of SOM obtained from six rounds of field studies conducted between 1986 and 2015.The model development was carried out for 15 districts located predominantly on chernozem soils within the Kyiv region, Ukraine.
As regressors in the model, reflectance values from six bands of Landsat 5 TM and Landsat 8 OLI satellites, vegetation indices for bare soil detection, topographic data (slope), and climate data (temperature, precipitation, soil moisture) were employed.The optimal model was obtained through the Forward stepwise method, based on F-test and Akaike criteria.The model exhibited the following metrics: Multiple R = 0.651, R 2 = 0.424, and Adjusted R 2 = 0.397.Validation of the model demonstrated the absence of heteroscedasticity, multicollinearity, and potential outliers, indicating its reliability.The most influential predictors in the effective predictive model were identified as RED, NIR, slope, and soil moisture.The developed model can be utilized for predicting the SOM content in chernozem soils based on remote sensing data.Future research efforts could be directed towards exploring the efficacy of nonlinear regression models for predicting SOM in chernozem soils and evaluating the model's performance on other soil types.

Figure 2 .
Figure 2. Soil of Kyiv region, constructed by the authors based on the data described in (Atlas of natural conditions and natural resources of the UkrainianSSR, 1978)

Figure 4 .
Figure 4. Landsat 8 OLI satellite data mosaic plotted in GEE for the month's March-May for round 10 th (2011-2015): a -for the territory of the Kyiv region (False Color Composite R-G-B = 5-4-3), b -for the territory of the selected 15 districts of the Kyiv region (Composite R-G-B = 7-6-5)

Figure 5 .
Figure 5.The determination of the LULC type is indicated by points (a), and the corresponding spectral profiles (b) are provided.A fragment of the Landsat 8 OLI image for the study area is presented as a composite with the bands R-G-B represented by 4-3-2.

Table 1 .
The number of Landsat images used.

Table 2 .
Threshold values of vegetation indices for bare soils detection.

Table 3 .
Predictors of MLR model

Table 4 .
in Appendix A. Afterwards, Correlation matrix model variables