1,498
Views
3
CrossRef citations to date
0
Altmetric
Research Article

Towards Malay named entity recognition: an open-source dataset and a multi-task framework

, ORCID Icon, & ORCID Icon
Article: 2159014 | Received 15 Aug 2022, Accepted 06 Dec 2022, Published online: 28 Dec 2022

References

  • Abinaya, N., John, N., Ganesh, B. H. B., Kumar, A. M., & Soman, K. P. (2014). AMRITA_CENFIRE-2014: named entity recognition for indian languages using rich features. In Proceedings of the forum for information retrieval evaluation (pp. 103–111). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824882.
  • Akbik, A., Bergmann, T., & Vollgraf, R. (2019). Pooled contextualized embeddings for named entity recognition. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 724–728). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1078.
  • Akbik, A., Blythe, D., & Vollgraf, R. (2018). Contextual string embeddings for sequence labelling. In E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), Proceedings of the 27th international conference on computational linguistics, COLING 2018, August 20–26 (pp. 1638–1649). Association for Computational Linguistics. https://aclanthology.org/C18-1139/.
  • Alfina, I., Manurung, R., & Fanany, M. I. (2016). DBpedia entities expansion in automatically building dataset for Indonesian NER. In 2016 international conference on advanced computer science and information systems (ICACSIS) (pp. 335–340). IEEE. https://doi.org/10.1109/ICACSIS.2016.7872784.
  • Alfina, I., Savitri, S., & Fanany, M. I. (2017). Modified DBpedia entities expansion for tagging automatically NER dataset. In 2017 international conference on advanced computer science and information systems (ICACSIS) (pp. 216–221). IEEE. https://doi.org/10.1109/ICACSIS.2017.8355036.
  • Alfred, R., Leong, L. C., On, C. K., & Anthony, P. (2014). Malay named entity recognition based on rule-based approach. International Journal of Machine Learning and Computing, 4(3). https://doi.org/10.7763/IJMLC.2014.V4.428
  • Aliod, D. M., van Zaanen, M., & Smith, D. (2006). Named entity recognition for question answering. In L. Cavedon and I. Zukerman (Eds.), Proceedings of the Australasian language technology workshop, ALTA 2006, November 30–December 1 (pp. 51–58). Australasian Language Technology Association. https://aclanthology.org/U06-1009/.
  • Anbukkarasi, S., Varadhaganapathy, S., Jeevapriya, S., Kaaviyaa, A., Lawvanyapriya, T., & Monisha, S. (2022). Named entity recognition for tamil text using deep learning. In 2022 international conference on computer communication and informatics (ICCCI) (pp. 1–5). https://doi.org/10.1109/ICCCI54379.2022.9740745.
  • Asmai, S. A., Salleh, M. S., Basiron, H., & Ahmad, S. (2018). An enhanced Malay named entity recognition using combination approach for crime textual data analysis. International Journal of Advanced Computer Science and Applications, 9(9). https://doi.org/10.14569/issn.2156-5570
  • Chiu, J. P. C., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4, 357–370. https://doi.org/10.1162/tacl_a_00104
  • Dai, X., & Adel, H. (2020). An analysis of simple data augmentation for named entity recognition. In D. Scott, N. Bel, and C. Zong (Eds.), Proceedings of the 28th international conference on computational linguistics, COLING 2020, December 8–13 (pp. 3861–3867). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.343.
  • Derczynski, L., Nichols, E., van Erp, M., & Limsopatham, N. (2017). Results of the WNUT2017 shared task on novel and emerging entity recognition. In L. Derczynski, W. Xu, A. Ritter, and T. Baldwin (Eds.), Proceedings of the 3rd workshop on noisy user-generated text, nut@emnlp 2017, September 7 (pp. 140–147). Association for Computational Linguistics. https://doi.org/10.18653/v1/w17-4418.
  • Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (Eds.), Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, June 2–7 (Vol. 1, pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423.
  • Etzioni, O., Cafarella, M. J., Downey, D., Popescu, A., Shaked, T., Soderland, S., Weld, D. S., & Yates, A. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1), 91–134. https://doi.org/10.1016/j.artint.2005.03.001
  • Fu, Y., Lin, N., Lin, X., & Jiang, S. (2021). Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition. Journal of Intelligent & Fuzzy Systems, 41(1), 563–574. https://doi.org/10.3233/JIFS-202286
  • Gunawan, W., Suhartono, D., Purnomo, F., & Ongko, A. (2018). Named-entity recognition for Indonesian language using bidirectional LSTM-CNNs. Procedia Computer Science, 135, 425–432. The 3rd International conference on computer science and computational intelligence (ICCSCI 2018), empowering smart technology in digital era for a better life. https://www.sciencedirect.com/science/article/pii/S1877050918314832.
  • Guo, J., Xu, G., Cheng, X., & Li, H. (2009). Named entity recognition in query. In J. Allan, J.A. Aslam, M. Sanderson, C. Zhai, and J. Zobel (Eds.), Proceedings of the 32nd annual international ACM SIGIR conference on research and development in information retrieval, SIGIR 2009, July 19–23 (pp. 267–274). ACM. https://doi.org/10.1145/1571941.1571989.
  • Hedderich, M. A., Lange, L., Adel, H., Strötgen, J., & Klakow, D. (2021). A survey on recent approaches for natural language processing in low-resource scenarios. In K. Toutanova (Eds.), Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2021, June 6–11 (pp. 2545–2568). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.201.
  • Hedderich, M. A., Lange, L., & Klakow, D. (2021). ANEA: Distant supervision for low-resource named entity recognition. CoRR abs/2102.13129. https://arxiv.org/abs/2102.13129.
  • Hovy, E. H., Marcus, M. P., Palmer, M., Ramshaw, L. A., & Weischedel, R. M. (2006). OntoNotes: The 90% solution. In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and M. Sanderson (Eds.), Human language technology conference of the North American chapter of the association of computational linguistics, proceedings, June 4–9. The Association for Computational Linguistics. https://aclanthology.org/N06-2015/.
  • Ikhwantri, F. (2019). Cross-lingual transfer for distantly supervised and low-resources Indonesian NER. arXiv preprint arXiv:1907.11158.
  • Jain, A., Paranjape, B., & Lipton, Z. C. (2019). Entity projection via machine translation for cross-lingual NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 1083–1092). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1100.
  • Keung, P., Lu, Y., & Bhardwaj, V. (2019). Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 1355–1360). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1138.
  • Kosasih, J. A., & Khodra, M. L. (2018). Transfer learning for Indonesian named entity recognition. In 2018 international symposium on advanced intelligent informatics (sain) (pp. 173–178). https://doi.org/10.1109/SAIN.2018.8673345.
  • Kurniawan, K., & Louvan, S. (2018). Empirical evaluation of character-based model on neural named-entity recognition in Indonesian conversational texts. In W. Xu, A. Ritter, T. Baldwin, and A. Rahimi (Eds.), Proceedings of the 4th workshop on noisy user-generated text, nut@emnlp 2018, November 1 (pp. 85–92). Association for Computational Linguistics. https://doi.org/10.18653/v1/w18-6112.
  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In K. Knight, A. Nenkova, and O. Rambow (Eds.), NAACL HLT 2016, the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, June 12–17 (pp. 260–270). The Association for Computational Linguistics. https://doi.org/10.18653/v1/n16-1030.
  • Li, F., Wang, Z., Hui, S. C., Liao, L., Song, D., & Xu, J. (2021). Effective named entity recognition with boundary-aware bidirectional neural networks. In J. Leskovec, M. Grobelnik, M. Najork, J. Tang, and L. Zia (Eds.), WWW '21: The web conference 2021, April 19–23 (pp. 1695–1703). ACM/IW3C2. https://doi.org/10.1145/3442381.3449995.
  • Li, J., Sun, A., & Ma, Y. (2021). Neural named entity boundary detection. IEEE Transactions on Knowledge and Data Engineering, 33(4), 1790–1795. https://doi.org/10.1109/TKDE.69
  • Li, X., Feng, J., Meng, Y., Han, Q., Wu, F., & Li, J. (2020). A unified MRC framework for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 5849–5859). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.519.
  • Lin, B. Y., Lee, D., Shen, M., Moreno, R., Huang, X., Shiralkar, P., & Ren, X. (2020). TriggerNER: Learning with entity triggers as explanations for named entity recognition. In D. Jurafsky, J. Chai, N. Schluter, and J.R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 8503–8511). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.752.
  • Luthfi, A., Distiawan, B., & Manurung, R. (2014). Building an Indonesian named entity recognizer using Wikipedia and DBPedia. In 2014 international conference on Asian language processing, IALP 2014, October 20–22 (pp. 19–22). IEEE. https://doi.org/10.1109/IALP.2014.6973520.
  • Ma, X., & Hovy, E. H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7–12 (Vol. 1). The Association for Computer Linguistics. https://doi.org/10.18653/v1/p16-1101.
  • Mayhew, S., Tsai, C., & Roth, D. (2017). Cheap translation for cross-lingual named entity recognition. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 2536–2545). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1269.
  • McCallum, A. (2003). Efficiently inducing features of conditional random fields. In C. Meek and U. Kjærulff (Eds.), UAI '03, proceedings of the 19th conference in uncertainty in artificial intelligence, August 7–10 (pp. 403–410). Morgan Kaufmann.
  • Menezes, D. S., Milidiú, R., & Savarese, P. (2019). Building a massive corpus for named entity recognition using free open data sources. In 8th Brazilian conference on intelligent systems, BRACIS 2019, October 15–18 (pp. 6–11). IEEE. https://doi.org/10.1109/BRACIS.2019.00011.
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio and Y. LeCun (Eds.), 1st international conference on learning representations, ICLR 2013, May 2–4. Workshop Track Proceedings. http://arxiv.org/abs/1301.3781.
  • Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In K. Su, J. Su, and J. Wiebe (Eds.), ACL 2009, proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing of the AFNLP, August 2–7 (pp. 1003–1011). The Association for Computer Linguistics. https://aclanthology.org/P09-1113/.
  • Morsidi, F., Sarkawi, S., Sulaiman, S., Mohammad, S. A., & Wahid, R. A. (2015). Malay named entity recognition: A review. Journal of ICT in Education, 2, 1–14. https://ejournal.upsi.edu.my/index.php/JICTIE/article/view/2596
  • Ni, J., Dinu, G., & Florian, R. (2017). Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In R. Barzilay and M. Kan (Eds.), Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, July 30–August 4 (Vol. 1, pp. 1470–1480). Association for Computational Linguistics. https://doi.org/10.18653/v1/P17-1135.
  • Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In A. Moschitti, B. Pang, and W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25–29 (pp. 1532–1543). ACL. https://doi.org/10.3115/v1/d14-1162.
  • Prabhakar, D. K., Dubey, S., Goel, B., & Pal, S. (2014). ISMFIRE-2014: Named entity recognition for Indian languages. In Proceedings of the forum for information retrieval evaluation (pp. 98–102). Association for Computing Machinery. https://doi.org/10.1145/2824864.2824881.
  • Raiman, J., & Miller, J. (2017). Globally normalized reader. In M. Palmer, R. Hwa, and S. Riedel (Eds.), Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, September 9–11 (pp. 1059–1069). Association for Computational Linguistics. https://doi.org/10.18653/v1/d17-1111.
  • Ranaivo-Malançon, B. (2006). Automatic identification of close languages-case study: Malay and Indonesian. ECTI Transactions on Computer and Information Technology (ECTI-CIT), 2(2), 126–134. https://doi.org/10.37936/ecti-cit.200622.
  • Salleh, M. S., Asmai, S. A., Basiron, H., & Ahmad, S. (2017). A Malay named entity recognition using conditional random fields. In 2017 5th international conference on information and communication technology (ICOIC7) (pp. 1–6). IEEE. https://doi.org/10.1109/ICoICT.2017.8074647.
  • Sang, E. F. T. K., & Meulder, F. D. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In W. Daelemans and M. Osborne (Eds.), Proceedings of the seventh conference on natural language learning, CONLL 2003, held in cooperation with HLT-NAACL 2003, May 31–June 1 (pp. 142–147). ACL. https://aclanthology.org/W03-0419/.
  • Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad, M. A. A., & Hamzah, Z. A. Z. (2011). Name extraction for unstructured Malay text. In 2011 IEEE symposium on computers & informatics (pp. 787–791). IEEE. https://doi.org/10.1109/ISCI.2011.5959017.
  • Srinivasan, R., & Subalalitha, C. (2019). Automated named entity recognition from tamil documents. In 2019 IEEE 1st international conference on energy, systems and information processing (ICESIP) (pp. 1–5). https://doi.org/10.1109/ICESIP46348.2019.8938383.
  • Sulaiman, S., Wahid, R., Sarkawi, S., & Omar, N. (2017). Using stanford NER and Illinois NER to detect malay named entity recognition. International Journal of Computer Theory and Engineering, 9(2), 147–150. https://doi.org/10.7763/IJCTE.2017.V9.1128
  • Ulanganathan, T., Ebrahim, A., Xian, B. C. M., Bouzekri, K., Mahmud, R., & Hoe, O. H. (2017). Benchmarking Mi-NER: Malay entity recognition engine. In 9th international conference on information, process, and knowledge management (pp. 52–58).
  • Wei, J. W., & Zou, K. (2019). EDA: easy data augmentation techniques for boosting performance on text classification tasks. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019 November 3–7 (pp. 6381–6387). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670.
  • Wu, Q., Lin, Z., Karlsson, B., Lou, J., & Huang, B. (2020a). Single-/multi-source cross-lingual NER via teacher-student learning on unlabeled data in target language. In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Proceedings of the 58th annual meeting of the association for computational linguistics, ACL 2020, July 5–10 (pp. 6505–6514). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.581.
  • Wu, Q., Lin, Z., Karlsson, B. F., Huang, B., & Lou, J. (2020b). UniTrans: Unifying model transfer and data transfer for cross-lingual named entity recognition with unlabeled data. In C. Bessiere (Ed.), Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI 2020 (pp. 3926–3932). IJCAI.org. https://doi.org/10.24963/ijcai.2020/543.
  • Wu, S., & Dredze, M. (2019). Beto, Bentz, Becas: The surprising cross-lingual effectiveness of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, EMNLP-IJCNLP 2019, November 3–7 (pp. 833–844). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1077.
  • Xia, M., Kong, X., Anastasopoulos, A., & Neubig, G. (2019). Generalized data augmentation for low-resource translation. In A. Korhonen, D.R. Traum, and L. Màrquez (Eds.), Proceedings of the 57th conference of the association for computational linguistics, ACL 2019, July 28–August 2 (Vol. 1, pp. 5786–5796). Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1579.
  • Zamin, N., & Bakar, Z. A. (2015). Name entity recognition for malay texts using cross-lingual annotation projection approach. In O. Gervasi (Eds.), Computational science and its applications – ICCSA 2015 – 15th international conference proceedings, part I, June 22–25 (Vol. 9155, pp. 242–256). Springer. https://doi.org/10.1007/978-3-319-21404-7_18.
  • Zamin, N., Oxley, A., & Bakar, Z. A. (2013). Projecting named entity tags from a resource rich language to a resource poor language. Journal of Information and Communication Technology, 12, 121–146. https://e-journal.uum.edu.my/index.php/jict/article/view/8140.
  • Zhao, S., Liu, T., Zhao, S., & Wang, F. (2019). A neural multi-task learning framework to jointly model medical named entity recognition and normalization. In The thirty-third AAAI conference on artificial intelligence, AAAI 2019, the thirty-first innovative applications of artificial intelligence conference, IAAI 2019, the ninth AAAI symposium on educational advances in artificial intelligence, EAAI 2019, January 27–February 1 (pp. 817–824). AAAI Press. https://doi.org/10.1609/aaai.v33i01.3301817.
  • Zhao, W., Zhao, S., Chen, S., Weng, T. H., & Kang, W. (2022). Entity and relation collaborative extraction approach based on multi-head attention and gated mechanism. Connection Science, 34(1), 670–686. https://doi.org/10.1080/09540091.2022.2026295
  • Zhou, G., & Su, J. (2002). Named entity recognition using an HMM-based chunk tagger. In Proceedings of the 40th annual meeting of the association for computational linguistics, July 6–12 (pp. 473–480). ACL. https://aclanthology.org/P02-1060/.
  • Zirikly, A. (2015). Cross-lingual transfer of named entity recognizers without parallel corpora. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing of the Asian federation of natural language processing, ACL 2015, July 26–31 (Vol. 2, pp. 390–396). The Association for Computer Linguistics. https://doi.org/10.3115/v1/p15-2064.