
WikiTextGraph: A Python Tool for Parsing Multilingual Wikipedia Text and Graph Extraction
References
- Callahan ES, Herring SC. Cultural bias in Wikipedia content on famous persons. J Am Soc Inf Sci Technol. 2011;62(10):1899–1915. DOI: 10.1002/asi.21577
- Park TK. The visibility of Wikipedia in scholarly publications. First Monday. Published online 2011. DOI: 10.5210/fm.v16i8.3492
- Areia C, Burton K, Taylor M, Watkinson C. Research citations building trust in Wikipedia: Results from a survey of published authors. PLOS One. 2025;20(4):
e0320334 . DOI: 10.1371/journal.pone.0320334 - Fan A, Gardent C. Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies. Proc 60th Annu Meet Assoc Comput Linguistics (Vol 1: Long Pap). Published online 2022.8561–8576. DOI: 10.18653/v1/2022.acl-long.586
- Xu S, Liu S, Culhane T, Pertseva E, Wu MH, Semnani S, Lam M. Fine-tuned LLMs Know More, Hallucinate Less with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata. Proc 2023 Conf Empir Methods Nat Lang Process. Published online 2023.5778–5791. DOI: 10.18653/v1/2023.emnlp-main.353
- Peng Y, Bonald T, Alam M. Refining Wikidata Taxonomy using Large Language Models. arXiv. Published online 2024. DOI: 10.1145/3627673.3679156
- Bhole A, Fortuna B, Grobelnik M, Mladenic D. Extracting named entities and relating them over time based on Wikipedia. Informatica. 2007;31(4).
https://www.informatica.si/index.php/informatica/article/view/169 - BØhn C, N∅rvåg K. Extracting Named Entities and Synonyms from Wikipedia. 2010 24th IEEE Int Conf Adv Inf Netw Appl. Published online 2010.1300–1307. DOI: 10.1109/AINA.2010.50
- Riehle D, Gonzalez-Barahona JM, Robles G, Möslein KM, Schieferdecker I, Cress U, Wichmann A, Hecht B, Jullien N, Viseur R. Reliability of User-Generated Data. Proc Int Symp Open Collab. Published online 2014.1–3. DOI: 10.1145/2641580.2641618
- Yu AZ, Ronen S, Hu K, Lu T, Hidalgo CA. Pantheon 1.0, a manually verified dataset of globally famous biographies. Sci Data. 2016;3(1):150075. DOI: 10.1038/sdata.2015.75
- Jatowt A, Kawai D, Tanaka K. Digital History Meets Wikipedia. Proc 16th ACMIEEE-CS Jt Conf Digit Libr. Published online 2016.17–26. DOI: 10.1145/2910896.2910911
- Beytía P, Schobin J. Networked Pantheon: a Relational Database of Globally Famous People: Social and Behavioural Sciences. Res Data J Humanit Soc Sci. 2020;5(1):50–65. DOI: 10.1163/24523666-00501002
- Wang X, Wang Z, Han X, Jiang W, Han R, Liu Z, Li J, Li P, Lin Y, Zhou J. MAVEN: A Massive General Domain Event Detection Dataset. Proc 2020 Conf Empir Methods Nat Lang Process (EMNLP). Published online 2020.1652–1671. DOI: 10.18653/v1/2020.emnlp-main.129
- Liu Z, He X, Liu L, Liu T, Zhai X. Context Matters: A Strategy to Pre-train Language Model for Science Education. Commun Comput Inf Sci. Published online 2023.666–674. DOI: 10.1007/978-3-031-36336-8_103
- Su W, Ai Q, Li X, Chen J, Liu Y, Wu X, Hou S. Wikiformer: Pre-training with Structured Information of Wikipedia for Ad-hoc Retrieval. arXiv. Published online 2023. DOI: 10.1609/aaai.v38i17.29869
- Consonni C, Laniado D, Montresor A. WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks. arXiv. Published online 2019. DOI: 10.1609/icwsm.v13i01.3257
- Aspert N, Miz V, Ricaud B, Vandergheynst P. A Graph-Structured Dataset for Wikipedia Research. Companion Proc 2019 World Wide Web Conf. Published online 2019.1188–1193. DOI: 10.1145/3308560.3316757
- Rebele T, Suchanek F, Hoffart J, Biega J, Kuzey E, Weikum G. YAGO: A Multilingual Knowledge Base from Wikipedia, Wordnet, and Geonames. Lect Notes Comput Sci. Published online 2016.177–185. DOI: 10.1007/978-3-319-46547-0_19
- Biega J, Kuzey E, Suchanek FM. Inside YAGO2s. Proc 22nd Int Conf World Wide Web. Published online 2013.325–328. DOI: 10.1145/2487788.2487935
- Mahdisoltani F, Biega J, Suchanek FM. YAGO3: A knowledge base from multilingual Wikipedias. Published 2013.
https://imt.hal.science/hal-01699874 - Wu T, Wang H, Li C, Qi G, Niu X, Wang M, Li L, Shi C. Knowledge graph construction from multiple online encyclopedias. World Wide Web. 2020;23(5):2671–2698. DOI: 10.1007/s11280-019-00719-4
- Yeh E, Ramage D, Manning CD, Agirre E, Soroa A.
WikiWalk: random walks on Wikipedia for semantic relatedness . In: Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing. Association for Computational Linguistics; 2009. DOI: 10.3115/1708124.1708133 - 1. Wang L, Li Y, Aslan O, Vinyals O. WikiGraphs: A Wikipedia Text – Knowledge Graph Paired Dataset. arXiv. Published online 2021. DOI: 10.18653/v1/2021.textgraphs-1.7
- Arroyo-Machado W, Torres-Salinas D, Costas R. Wikinformetrics: Construction and description of an open Wikipedia knowledge graph data set for informetric purposes. Quant Sci Stud. 2022;3(4):931–952. DOI: 10.1162/qss_a_00226
- Yang P, Colavizza G. A Map of Science in Wikipedia. Companion Proc Web Conf 2022. Published online 2022.1289–1300. DOI: 10.1145/3487553.3524925
- Lewoniewski W. The Most Cited Scientific Information Sources in Wikipedia Articles Across Various Languages. Biblioteka. 2024;27(36):269–294. DOI: 10.14746/b.2023.27.12
- Gabella M. Cultural Structures of Knowledge from Wikipedia Networks of First Links. Ieee Transactions Netw Sci Eng. 2017;6(3):249–252. DOI: 10.1109/TNSE.2018.2812788
- Schwartz GA. Complex networks reveal emergent interdisciplinary knowledge in Wikipedia. Humanit Soc Sci Commun. 2021;8(1):127. DOI: 10.1057/s41599-021-00801-1
- Miccio LA, Gámez-Pérez C, Suárez JL, Schwartz GA. Mapping the Networked Context of Copernicus, Michelangelo, and Della Mirandola in Wikipedia. Ramona R, Maximillian S, Hyejin Y, Mikhail T, editors ACS. 2022;25(05n06):2240010-1-2240010-2240012. DOI: 10.1142/S0219525922400100
- Miccio LA, Agapitos P, Gamez-Perez C, González F, Suarez JL, Schwartz GA. Wikipedia as a cultural lens: a quantitative approach for exploring cultural networks. Humanit Soc Sci Commun. 2025;12(1):462. DOI: 10.1057/s41599-025-04772-5
- Attardi G. attardi/wikiextractor; 2025.
https://github.com/attardi/wikiextractor - wiki-dump-parser: A simple but fast Python script that reads the XML dump of a wiki and outputs the processed data in a CSV file.
https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_parser - wikimedia/pywikibot 2025.
https://github.com/wikimedia/pywikibot - Voss J. Measuring Wikipedia. In: Proceedings of ISSI. Vol 1. 2005
http://eprints.rclis.org/6207/ - Hagberg AA, Schult DA, Swart PJ. Exploring Network Structure, Dynamics, and Function using NetworkX. Proc 7th Python Sci Conf. Published online 2008.11–15. DOI: 10.25080/TCWV9851
DOI: https://doi.org/10.5334/jors.572 | Journal eISSN: 2049-9647
Language: English
Submitted on: Apr 15, 2025
Accepted on: Jul 28, 2025
Published on: Sep 12, 2025
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year
Keywords:
© 2025 Paschalis Agapitos, Juan-Luis Suárez, Gustavo Ariel Schwartz, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.