Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
Author:
Padarian JoséORCID, Fuentes IgnacioORCID
Abstract
Abstract. A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings, which encode information about a word and its linguistic relationships with other words, lay on a multidimensional space where angles and distances have a linguistic interpretation. We used 280 764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. As this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9 %. We also presented an example were we successfully emulated part of a taxonomic analysis of soil profiles that was originally applied to soil numerical data, which would not be possible without the use of embeddings. The resulting embedding and test suite will be made available for other researchers to use and expand upon.
Publisher
Copernicus GmbH
Reference58 articles.
1. Arrouays, D., Leenaars, J., Richer-de-Forges, A., Adhikari,
K., Ballabio, C., Greve, M., Grundy, M., Guerrero, E., Hempel, J., Hengl, T., Heuvelink, G.,
Batjes, N., Carvalho, E., Hartemink, A., Hewitt, A., Hong, S., Krasilnikov, P., Lagacherie, P.,
Lelyk, G., Libohova, Z., Lilly, A., McBratney, A., McKenzie, N., Vasquez, G., Mulder, V.,
Minasny, B., Montanarella, L., Odeh, I., Padarian, J., Poggio, L., Roudier, P., Saby, N., Savin, I., Searle, R., Solbovoy, V., Thompson, J., Smith, S., Sulaeman, Y., Vintila, R., Rossel, R.,
Wilson, P., Zhang, G., Swerts, M., Oorts, K., Karklins, A., Feng, L., Navarro, A., Levin, A.,
Laktionova, T., Dell'Acqua, M., Suvannang, N., Ruam, W., Prasad, J., Patil, N., Husnjak, S.,
Pásztor, L., Okx, J., Hallett, S., Keay,<span id="page186"/> C., Farewell, T., Lilja, H., Juilleret, J., Marx, S., Takata,
Y., Kazuyuki, Y., Mansuy, N., Panagos, P., Liedekerke, M., Skalsky, R., Sobocka, J., Kobza, J.,
Eftekhari, K., Alavipanah, S., Moussadek, R., Badraoui, M., Silva, M., Paterson, G., da Gonçalves, M., Theocharopoulos, S., Yemefack, M., Tedou, S., Vrscaj, B., Grob, U., Kozák, J.,
Boruvka, L., Dobos, E., Taboada, M., Moretti, L., and Rodriguez, D.: Soil
legacy data rescue via GlobalSoilMap and other international and national
initiatives, Geophys. Res. J., 14, 1–19, 2017. a 2. Baroni, M., Bernardi, R., Do, N.-Q., and chieh Shan, C.: Entailment above the
word level in distributional semantics, in: Proceedings of the 13th
Conference of the European Chapter of the Association for Computational
Linguistics, Association for Computational Linguistics, 23–32, 2012. a 3. Baroni, M., Dinu, G., and Kruszewski, G.: Don't count, predict! A systematic
comparison of context-counting vs. context-predicting semantic vectors, in:
Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Vol. 1, 238–247, 2014. a 4. Baxter, W. and ichi Anjyo, K.: Latent doodle space, in: Computer Graphics
Forum, Wiley Online Library, Vol. 25, 477–485, 2006. a 5. Bengio, Y.: Neural net language models, Scholarpedia, 3, 3881, https://doi.org/10.4249/scholarpedia.3881, 2008. a
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|