Journal of ICT Research and Applications ITB Bandung Vol. 15 No. 3 2021
Development of Focused Crawlers for Building Large Punjabi News Corpus

Dublin Core

Title

Journal of ICT Research and Applications ITB Bandung Vol. 15 No. 3 2021
Development of Focused Crawlers for Building Large Punjabi News Corpus

Subject

corpus; crawler; NLP; Punjabi language; scraper; text extraction; text
processing.

Description

Abstract. Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for
three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.

Creator

Gurjot Singh Mahi & Amandeep Verma

Source

DOI: 10.5614/itbj.ict.res.appl.2021.15.3.1

Publisher

IRCS-ITB

Date

08 September 2021

Contributor

Sri Wahyuni

Rights

ISSN: 2337-5787

Format

PDF

Language

English

Type

Text

Coverage

Journal of ICT Research and Applications ITB Bandung Vol. 15 No. 3 2021

Files

Collection

Tags

,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon , ,Repository, Repository Horizon University Indonesia, Repository Universitas Horizon Indonesia, Horizon.ac.id, Horizon University Indonesia, Universitas Horizon Indonesia, HorizonU, Repo Horizon ,

Citation

Gurjot Singh Mahi & Amandeep Verma, “Journal of ICT Research and Applications ITB Bandung Vol. 15 No. 3 2021
Development of Focused Crawlers for Building Large Punjabi News Corpus,” Repository Horizon University Indonesia, accessed December 22, 2024, https://repository.horizon.ac.id/items/show/3434.