binzume.net / ソフトウェア / cppfl - C++ Form Library /

htmlparser.h (低機能HTML/XMLパーサ)

コンパクトなHTML/XHTMLパーサです．~~いい加減なレイアウトエンジンも搭載．~~Nintendo DS用のブラウザを作ったときの残骸です．

文字のエンコーディングの処理の分離がまだ不完全．内部エンコーディングはUNICODEで作ってましたが，テストしやすいようにSJISでも動くようにしてあります．

誰か，綺麗に書き直してください．

encode.hが必要です．

サンプル

#include <string>
#include <iostream>
#include "htmlparser.h"
using namespace std;

ostream & operator <<(ostream &os,const wstring &ws){
    
    std::string s;
    os << Encode::stringEncode_SJIS(s,ws.begin(),ws.end());
    return os;
}

void dumpDomTree(HTML::Element *e,int d=0)
{
    for (int i=0;i<d;i++)
        cout << "  " ;
    cout << e->tagName ;
    if (e->nodeType==3) {
        cout << "[" << ((HTML::TextNode*)e)->text << "]";
    }
    cout << endl ;
    if (e->isContainer()) {
        HTML::Container *e2 = (HTML::Container*)e; 
        unsigned int i;
        for (i=0;i<e2->childNodes.size();i++) {
            dumpDomTree(e2->childNodes[i],d+1);
        }
    }
}

int main(int argc,char *argv[])
{
    string s =
        "<html>"
            "<head><title>タイトル</title></head>"
            "<body>"
                "<h2 id='testid'>head</h2>"
                "<p>aaa<a href='aaa.html' name=\"abcd\">linkあああ</a></p>"
                "sss<hr />aa"
            "</body>"
        "</html>";

    HTML::Document document;
    document.parse(s);
    dumpDomTree(&document);

    HTML::Element *e = document.getElementById("testid");
    if (e) {
        cout << "testid: " << e->tagName << " " << e->getInnerText() << endl;
    }

    HTML::NodeList list;
    if (document.getElementsByName(list,"abcd").size()){
        cout << "abcd: " << list[0]->tagName << " " << list[0]->getInnerText() << endl;
    }

    return 0;
}

API

極一部だけですが，DOMっぽいAPIが用意されています．

Element.getTagName()
Element.getInnerText()
Element.getElementsByTagName(vector<Element*> &list, string tagname)
Element.getElementsByClassName(vector<Element*> &list, string tagname)
Container.appendChild(Element *e)
Document.getElementById(string id)
Document.createElement(string tagname)
Document.parse(string html)

パーサのつもりで作ったので，削除系のメソッドがありません．気が向いたらDOM1.0相当にします．