Towards Malay named entity recognition: an open-source dataset and a multi-task framework

Yingwen Fua School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, People's Republic of ChinaView further author information

Nankai Linb School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, People's Republic of China

https://orcid.org/0000-0003-2838-8273 View further author information

Zhihe Yanga School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, People's Republic of ChinaView further author information

Shengyi Jianga School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, People's Republic of China;c Guangzhou Key Laboratory of Multilingual Intelligent Processing, Guangzhou, People's Republic of ChinaCorrespondence[email protected]

https://orcid.org/0000-0002-6753-474X View further author information

Article: 2159014 | Received 15 Aug 2022, Accepted 06 Dec 2022, Published online: 28 Dec 2022

Cite this article
https://doi.org/10.1080/09540091.2022.2159014
CrossMark

Full Article
Figures & data
References
Citations
Metrics
Licensing
Reprints & Permissions
View PDF PDF View EPUB EPUB

References

Abinaya, N., John, N., Ganesh, B. H. B., Kumar, A. M., & Soman, K. P. (2014). AMRITA_CENFIRE-2014: named entity recognition for indian languages using rich features. In Proceedings of the forum for information retrieval evaluation (pp. 103–111). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824882.
Google Scholar
Akbik, A., Bergmann, T., & Vollgraf, R. (2019). Pooled contextualized embeddings for named entity recognition. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 724–728). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1078.
Google Scholar
Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labelling. In E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), Proceedings of the 27th international conference on computational linguistics, COLING 2018, August 20–26 (pp. 1638–1649). Association for Computational Linguistics. https://aclanthology.org/C18-1139/.
Google Scholar
Alfina, I., Manurung, R., & Fanany, M. I. (2016). DBpedia entities expansion in automatically building dataset for Indonesian NER. In 2016 international conference on advanced computer science and information systems (ICACSIS) (pp. 335–340). IEEE. https://doi.org/10.1109/ICACSIS.2016.7872784.
Google Scholar
Alfina, I., Savitri, S., & Fanany, M. I. (2017). Modified DBpedia entities expansion for tagging automatically NER dataset. In 2017 international conference on advanced computer science and information systems (ICACSIS) (pp. 216–221). IEEE. https://doi.org/10.1109/ICACSIS.2017.8355036.
Google Scholar
Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3). https://doi.org/10.7763/IJMLC.2014.V4.428
Google Scholar
Aliod, D. M., van Zaanen, M., & Smith, D. (2006). Named entity recognition for question answering. In L. Cavedon and I. Zukerman (Eds.), Proceedings of the Australasian language technology workshop, ALTA 2006, November 30–December 1 (pp. 51–58). Australasian Language Technology Association. https://aclanthology.org/U06-1009/.
Google Scholar
Anbukkarasi, S., Varadhaganapathy, S., Jeevapriya, S., Kaaviyaa, A., Lawvanyapriya, T., & Monisha, S. (2022). Named entity recognition for tamil text using deep learning. In 2022 international conference on computer communication and informatics (ICCCI) (pp. 1–5). https://doi.org/10.1109/ICCCI54379.2022.9740745.
Google Scholar
Asmai, S. A., Salleh, M. S., Basiron, H., & Ahmad, S. (2018). An enhanced Malay named entity recognition using combination approach for crime textual data analysis. International Journal of Advanced Computer Science and Applications, 9(9). https://doi.org/10.14569/issn.2156-5570
Google Scholar
Chiu, J. P. C., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357–370. https://doi.org/10.1162/tacl_a_00104
Google Scholar
Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. In D. Scott, N. Bel, and C. Zong (Eds.), Proceedings of the 28th international conference on computational linguistics, COLING 2020, December 8–13 (pp. 3861–3867). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.343.
Google Scholar
Derczynski, L., Nichols, E., van Erp, M., & Limsopatham, N. (2017). Results of the WNUT2017 shared task on novel and emerging entity recognition. In L. Derczynski, W. Xu, A. Ritter, and T. Baldwin (Eds.), Proceedings of the 3rd workshop on noisy user-generated text, nut@emnlp 2017, September 7 (pp. 140–147). Association for Computational Linguistics. https://doi.org/10.18653/v1/w17-4418.
Google Scholar
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423.
Google Scholar
Etzioni, O., Cafarella, M. J., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. S., & Yates, A. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1), 91–134. https://doi.org/10.1016/j.artint.2005.03.001
Web of Science ®Google Scholar
Fu, Y., Lin, N., Lin, X., & Jiang, S. (2021). Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition. Journal of Intelligent & Fuzzy Systems, 41(1), 563–574. https://doi.org/10.3233/JIFS-202286
Web of Science ®Google Scholar
Gunawan, W., Suhartono, D., Purnomo, F., & Ongko, A. (2018). Named-entity recognition for Indonesian language using bidirectional LSTM-CNNs. Procedia Computer Science, 135, 425–432. The 3rd International conference on computer science and computational intelligence (ICCSCI 2018), empowering smart technology in digital era for a better life. https://www.sciencedirect.com/science/article/pii/S1877050918314832.
Google Scholar
Guo, J., Xu, G., Cheng, X., & Li, H. (2009). Named entity recognition in query. In J. Allan, J.A. Aslam, M. Sanderson, C. Zhai, and J. Zobel (Eds.), Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2009, July 19–23 (pp. 267–274). ACM. https://doi.org/10.1145/1571941.1571989.
Google Scholar
Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In K. Toutanova (Eds.), Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2021, June 6–11 (pp. 2545–2568). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.201.
Google Scholar
Hedderich, M. A., Lange, L., & Klakow, D. (2021). ANEA: Distant supervision for low-resource named entity recognition. CoRR abs/2102.13129. https://arxiv.org/abs/2102.13129.
Google Scholar
Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A., & Weischedel, R. M. (2006). OntoNotes: The 90% solution. In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and M. Sanderson (Eds.), Human language technology conference of the North American chapter of the association of computational linguistics, proceedings, June 4–9. The Association for Computational Linguistics. https://aclanthology.org/N06-2015/.
Google Scholar
Ikhwantri, F. (2019). Cross-lingual transfer for distantly supervised and low-resources Indonesian NER. arXiv preprint arXiv:1907.11158.
Google Scholar
Jain, A., Paranjape, B., & Lipton, Z. C. (2019). Entity projection via machine translation for cross-lingual NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 1083–1092). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1100.
Google Scholar
Keung, P., Lu, Y., & Bhardwaj, V. (2019). Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 1355–1360). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1138.
Google Scholar
Kosasih, J. A., & Khodra, M. L. (2018). Transfer learning for Indonesian named entity recognition. In 2018 international symposium on advanced intelligent informatics (sain) (pp. 173–178). https://doi.org/10.1109/SAIN.2018.8673345.
Google Scholar
Kurniawan, K., & Louvan, S. (2018). Empirical evaluation of character-based model on neural named-entity recognition in Indonesian conversational texts. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi (Eds.), Proceedings of the 4th workshop on noisy user-generated text, nut@emnlp 2018, November 1 (pp. 85–92). Association for Computational Linguistics. https://doi.org/10.18653/v1/w18-6112.
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In K. Knight, A. Nenkova, and O. Rambow (Eds.), NAACL HLT 2016, the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, June 12–17 (pp. 260–270). The Association for Computational Linguistics. https://doi.org/10.18653/v1/n16-1030.
Google Scholar
Li, F., Wang, Z., Hui, S. C., Liao, L., Song, D., & Xu, J. (2021). Effective named entity recognition with boundary-aware bidirectional neural networks. In J. Leskovec, M. Grobelnik, M. Najork, J. Tang, and L. Zia (Eds.), WWW '21: The web conference 2021, April 19–23 (pp. 1695–1703). ACM/IW3C2. https://doi.org/10.1145/3442381.3449995.
Google Scholar
Li, J., Sun, A., & Ma, Y. (2021). Neural named entity boundary detection. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1790–1795. https://doi.org/10.1109/TKDE.69
Web of Science ®Google Scholar
Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2020). A unified MRC framework for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 5849–5859). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.519.
Google Scholar
Lin, B. Y., Lee, D., Shen, M., Moreno, R., Huang, X., Shiralkar, P., & Ren, X. (2020). TriggerNER: Learning with entity triggers as explanations for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 8503–8511). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.752.
Google Scholar
Luthfi, A., Distiawan, B., & Manurung, R. (2014). Building an Indonesian named entity recognizer using Wikipedia and DBPedia. In 2014 international conference on Asian language processing, IALP 2014, October 20–22 (pp. 19–22). IEEE. https://doi.org/10.1109/IALP.2014.6973520.
Google Scholar
Ma, X., & Hovy, E. H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7–12 (Vol. 1). The Association for Computer Linguistics. https://doi.org/10.18653/v1/p16-1101.
Google Scholar
Mayhew, S., Tsai, C., & Roth, D. (2017). Cheap translation for cross-lingual named entity recognition. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 2536–2545). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1269.
Google Scholar
McCallum, A. (2003). Efficiently inducing features of conditional random fields. In C. Meek and U. Kjærulff (Eds.), UAI '03, proceedings of the 19th conference in uncertainty in artificial intelligence, August 7–10 (pp. 403–410). Morgan Kaufmann.
Google Scholar
Menezes, D. S., Milidiú, R., & Savarese, P. (2019). Building a massive corpus for named entity recognition using free open data sources. In 8th Brazilian conference on intelligent systems, BRACIS 2019, October 15–18 (pp. 6–11). IEEE. https://doi.org/10.1109/BRACIS.2019.00011.
Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun (Eds.), 1st international conference on learning representations, ICLR 2013, May 2–4. Workshop Track Proceedings. http://arxiv.org/abs/1301.3781.
Google Scholar
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In K. Su, J. Su, and J. Wiebe (Eds.), ACL 2009, proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the AFNLP, August 2–7 (pp. 1003–1011). The Association for Computer Linguistics. https://aclanthology.org/P09-1113/.
Google Scholar
Morsidi, F., Sarkawi, S., Sulaiman, S., Mohammad, S. A., & Wahid, R. A. (2015). Malay named entity recognition: A review. Journal of ICT in Education, 2, 1–14. https://ejournal.upsi.edu.my/index.php/JICTIE/article/view/2596
Google Scholar
Ni, J., Dinu, G., & Florian, R. (2017). Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In R. Barzilay and M. Kan (Eds.), Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, July 30–August 4 (Vol. 1, pp. 1470–1480). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1135.
Google Scholar
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In A. Moschitti, B. Pang, and W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29 (pp. 1532–1543). ACL. https://doi.org/10.3115/v1/d14-1162.
Google Scholar
Prabhakar, D. K., Dubey, S., Goel, B., & Pal, S. (2014). ISMFIRE-2014: Named entity recognition for Indian languages. In Proceedings of the forum for information retrieval evaluation (pp. 98–102). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824881.
Google Scholar
Raiman, J., & Miller, J. (2017). Globally normalized reader. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 1059–1069). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1111.
Google Scholar
Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT), 2(2), 126–134. https://doi.org/10.37936/ecti-cit.200622.
Google Scholar
Salleh, M. S., Asmai, S. A., Basiron, H., & Ahmad, S. (2017). A Malay named entity recognition using conditional random fields. In 2017 5th international conference on information and communication technology (ICOIC7) (pp. 1–6). IEEE. https://doi.org/10.1109/ICoICT.2017.8074647.
Google Scholar
Sang, E. F. T. K., & Meulder, F. D. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In W. Daelemans and M. Osborne (Eds.), Proceedings of the seventh conference on natural language learning, CONLL 2003, held in cooperation with HLT-NAACL 2003, May 31–June 1 (pp. 142–147). ACL. https://aclanthology.org/W03-0419/.
Google Scholar
Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad, M. A. A., & Hamzah, Z. A. Z. (2011). Name extraction for unstructured Malay text. In 2011 IEEE symposium on computers & informatics (pp. 787–791). IEEE. https://doi.org/10.1109/ISCI.2011.5959017.
Google Scholar
Srinivasan, R., & Subalalitha, C. (2019). Automated named entity recognition from tamil documents. In 2019 IEEE 1st international conference on energy, systems and information processing (ICESIP) (pp. 1–5). https://doi.org/10.1109/ICESIP46348.2019.8938383.
Google Scholar
Sulaiman, S., Wahid, R., Sarkawi, S., & Omar, N. (2017). Using stanford NER and Illinois NER to detect malay named entity recognition. International Journal of Computer Theory and Engineering, 9(2), 147–150. https://doi.org/10.7763/IJCTE.2017.V9.1128
Google Scholar
Ulanganathan, T., Ebrahim, A., Xian, B. C. M., Bouzekri, K., Mahmud, R., & Hoe, O. H. (2017). Benchmarking Mi-NER: Malay entity recognition engine. In 9th international conference on information, process, and knowledge management (pp. 52–58).
Google Scholar
Wei, J. W., & Zou, K. (2019). EDA: easy data augmentation techniques for boosting performance on text classification tasks. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 6381–6387). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670.
Google Scholar
Wu, Q., Lin, Z., Karlsson, B., Lou, J., & Huang, B. (2020a). Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 6505–6514). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.581.
Google Scholar
Wu, Q., Lin, Z., Karlsson, B. F., Huang, B., & Lou, J. (2020b). UniTrans: Unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data. In C. Bessiere (Ed.), Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020 (pp. 3926–3932). IJCAI.org. https://doi.org/10.24963/ijcai.2020/543.
Google Scholar
Wu, S., & Dredze, M. (2019). Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 833–844). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1077.
Google Scholar
Xia, M., Kong, X., Anastasopoulos, A., & Neubig, G. (2019). Generalized data augmentation for low-resource translation. In A. Korhonen, D.R. Traum, and L. Màrquez (Eds.), Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, July 28–August 2 (Vol. 1, pp. 5786–5796). Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1579.
Google Scholar
Zamin, N., & Bakar, Z. A. (2015). Name entity recognition for malay texts using cross-lingual annotation projection approach. In O. Gervasi (Eds.), Computational science and its applications – ICCSA 2015 – 15th international conference proceedings, part I, June 22–25 (Vol. 9155, pp. 242–256). Springer. https://doi.org/10.1007/978-3-319-21404-7_18.
Google Scholar
Zamin, N., Oxley, A., & Bakar, Z. A. (2013). Projecting named entity tags from a resource rich language to a resource poor language. Journal of Information and Communication Technology, 12, 121–146. https://e-journal.uum.edu.my/index.php/jict/article/view/8140.
Google Scholar
Zhao, S., Liu, T., Zhao, S., & Wang, F. (2019). A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, January 27–February 1 (pp. 817–824). AAAI Press. https://doi.org/10.1609/aaai.v33i01.3301817.
Google Scholar
Zhao, W., Zhao, S., Chen, S., Weng, T. H., & Kang, W. (2022). Entity and relation collaborative extraction approach based on multi-head attention and gated mechanism. Connection Science, 34(1), 670–686. https://doi.org/10.1080/09540091.2022.2026295
Web of Science ®Google Scholar
Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12 (pp. 473–480). ACL. https://aclanthology.org/P02-1060/.
Google Scholar
Zirikly, A. (2015). Cross-lingual transfer of named entity recognizers without parallel corpora. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31 (Vol. 2, pp. 390–396). The Association for Computer Linguistics. https://doi.org/10.3115/v1/p15-2064.
Google Scholar

Share icon
Back to Top

Related research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.

People also read
Recommended articles
Cited by

To cite this article:

Reference style: APA Chicago Harvard

Citation copied to clipboard

Reference styles above use APA (6th edition), Chicago (16th edition) & Harvard (10th edition)

Download citation

Download a citation file in RIS format that can be imported by citation management software including EndNote, ProCite, RefWorks and Reference Manager.

Choose format: RIS BibTex RefWorks Direct Export

Choose options: Citation Citation & abstract Citation & references

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

References

Information for

Open access

Opportunities

Help and information

Your download is now in progress and you may close this window

Login or register to access this feature

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date